# Decision Tree Implementation( Python )

***

1. Classification
2. Regression

## 1. Classification

### Tools & Libraries

- **Pandas**( Data Analysis & Manipulation. )
- **Numpy**( Numerical Multidimensional Array, Matrices and Computation. )
- **Matplotlib**( Visualization )
- **Scikit-Learn**( ML  )

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Dataset

***

**Source :**
*UCI ( banknote authentication ) Data Set*

In [2]:
dataset = pd.read_csv('./Resources/bill_authentication.csv')

#### Information

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have $400x 400$ pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about $660 dpi$ were gained. Wavelet Transform tool were used to extract features from images.

In [3]:
def data_info(daten):
    print('Shape :')
    print(daten.shape)
    print('\nHead :')
    print(daten.head())

data_info(dataset)

Shape :
(1372, 5)

Head :
   Variance  Skewness  Curtosis  Entropy  Class
0   3.62160    8.6661   -2.8073 -0.44699      0
1   4.54590    8.1674   -2.4586 -1.46210      0
2   3.86600   -2.6383    1.9242  0.10645      0
3   3.45660    9.5228   -4.0112 -3.59440      0
4   0.32924   -4.4552    4.5718 -0.98880      0


**Attributes :**
- variance of Wavelet Transformed image (continuous)
- skewness of Wavelet Transformed image (continuous)
- curtosis of Wavelet Transformed image (continuous)
- entropy of image (continuous)
- class (integer) 

### Preprocessing

***

<u>**Attribute-Label Split.**</u>

- Attribute set: $X$ with corresponding labels: $y$.

In [4]:
X = dataset.drop('Class', axis=1) # Column except 'Class'
y = dataset['Class'] # Column 'Class'

<u>**Train-Test Split.**</u>

- 20%( Test ) & 80%( Train )

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

print('\nTrain Data Shape :')
print(X_train.shape)

print('\nTest Data Shape :')
print(X_test.shape)


Train Data Shape :
(1097, 4)

Test Data Shape :
(275, 4)


### Model

***

- Scikit-Learn( tree Library )
- DecisionTreeClassifier( Class )

#### Training

- fit Method( class Classifier )

In [6]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

#### Prediction

- predict Method( class Classifier )

In [7]:
y_pred = classifier.predict(X_test)

#### Testing

**Commonly used metrics :** confusion matrix, precision, recall & F1 score

- Scikit-Learn( metrics Library )
- confusion_matrix( evaluate Accuracy of a classification )
- classification_report( Methods )

In [8]:
from sklearn.metrics import classification_report, confusion_matrix

print('\nConfusion Matrix :\n')
print(confusion_matrix(y_test, y_pred))

print('\n\nClassification Report :\n')
print(classification_report(y_test, y_pred))


Confusion Matrix :

[[139   2]
 [  3 131]]


Classification Report :

              precision    recall  f1-score   support

           0       0.98      0.99      0.98       141
           1       0.98      0.98      0.98       134

   micro avg       0.98      0.98      0.98       275
   macro avg       0.98      0.98      0.98       275
weighted avg       0.98      0.98      0.98       275



### Interpretation

**confusion_matrix** $C$ such that $C_i,_j$ is equal to number of observation known to be in the group $i$ and predicted to be in group $j$. Thus in binary classification,the count of 
- $C_0,_0$ : true negatives, $tn$ : number of true negatives
- $C_1,_0$ : false negatives, $fn$ 
- $C_1,_1$ : true positives, $tp$
- $C_0,_1$ : false positives, $fb$

**classification_report**( Precision, Recall, Fscore, Support ) for each Class.

- <u>Precision</u> is the Ratio $\frac{tp}{(tp+fp)}$. i.e. not to label negative sample as positive

- <u>Recall</u> is the Ratio $\frac{tp}{(tp+fn)}$ i.e. to find all the positive samples

- <u>F-Score</u> is a weighted harmonic mean of the `precision` and `recall`
    - `1`( Best value) : Recall and Precision are equally important.
    - `0`( Worst value )
- <u>Support</u> is the number of occurrences of each class in `y_true`.

***

|Total tested |Mis-Classified|
| --- | --- |
| 275 | 6 |
| accuracy = 97.8% |

***