## To implement Decision Tree Algorithm

* Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. 
* The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
* A decision tree is a tree where each node represents a feature(attribute), each link(branch) represents a decision(rule) and each leaf represents an outcome(categorical or continues value).

#### 1. Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

#### 2. The dataset is imported using the pandas library. It is imported in a dataframe from the data.csv file

The dataset that I have selected contains 24 features (or attributes) and 194 rows (or instances). It is composed of biomedical voice measurements from 31 people, out of which 23 have Parkinson's Disease.
* Each column is a particular voice measure
* Each row corresponds to the voice recordings of these individuals
* The aim here is to segregate healthy people from the people having Parkinson's Disease using the 'status' column which is set to '0' for healthy people and '1' for people with Parkinson

In [2]:
dataset = pd.read_csv('data.csv')
x = dataset.iloc[:, 1:23].values
#x = dataset.drop(['status', 'name'], axis=1)
y=dataset.loc[:,'status'].values
dataset.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE,status
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,1
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674,1
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634,1
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975,1
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335,1


In [3]:
x.shape

(195, 22)

#### 3. Preprocessing the data using LabelEncoder, OneHotEncoder and MinMax Scaler before training it

LabelEncoder and OneHotEncoder are parts of the SciKit Learn library in Python, and they are used to convert categorical data, or text data, into numbers, which our predictive models can better understand

MinMaxScaler is an estimator that scales and translates each feature individually such that it is in the given range on the training set, e.g. here it is scaled between -1 and 1.

In [4]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler((-1,1))
x=scaler.fit_transform(x)

#### 4. Splitting our dataset into Training set and Test set

In [5]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 9)

In [37]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

- we can check how our training and testing data look after splitting

In [6]:
x

array([[-0.63138346, -0.77481654, -0.89037042, ...,  0.17153026,
        -0.21867743, -0.0053808 ],
       [-0.6033463 , -0.81013911, -0.4433544 , ...,  0.48267409,
        -0.05370956,  0.34265204],
       [-0.66992292, -0.88174367, -0.46942324, ...,  0.37274182,
        -0.18236124,  0.19336492],
       ...,
       [ 0.00546073, -0.43717403, -0.89854572, ..., -0.31484696,
         0.11793486, -0.63884033],
       [ 0.28578581,  0.20361309, -0.89144127, ..., -0.09423055,
        -0.36355605, -0.67372646],
       [ 0.46654868, -0.35441175, -0.85610326, ..., -0.16981039,
         0.00734563, -0.5690805 ]])

In [7]:
y

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

In [8]:
x_train

array([[-0.76161423, -0.86727089, -0.56938063, ..., -0.10276688,
        -0.49853639,  0.12736212],
       [-0.74244929, -0.89603274, -0.53852177, ...,  0.42168615,
         0.07524641, -0.23873512],
       [-0.19528212, -0.75046388, -0.03129642, ..., -0.21545904,
        -0.18050259, -0.50136695],
       ...,
       [ 0.46654868, -0.35441175, -0.85610326, ..., -0.16981039,
         0.00734563, -0.5690805 ],
       [-0.30210977, -0.74343366, -0.11542137, ..., -0.63276672,
        -0.18051327, -0.41200593],
       [-0.42002189, -0.61177419, -0.81993621, ..., -0.63580576,
        -0.64587778, -0.2977375 ]])

In [9]:
x_test

array([[-0.31872482, -0.56497545, -0.81276267, ..., -0.14534047,
        -0.20731822, -0.50930766],
       [ 0.2813497 , -0.53843045,  0.46957293, ..., -0.23953275,
        -0.92082453, -0.78550954],
       [ 0.15734811, -0.59101014,  0.23825809, ...,  0.1733762 ,
        -0.14651839, -0.518868  ],
       ...,
       [-0.62818154, -0.85670515, -0.43182839, ...,  0.13191016,
        -0.61608244, -0.3073434 ],
       [-0.41949794, -0.58611919, -0.79430493, ..., -0.30667981,
        -0.40641443, -0.23221934],
       [ 0.33082225,  0.89266869,  0.28711412, ...,  0.15261616,
         0.32296558, -0.06536489]])

In [10]:
y_train

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1], dtype=int64)

In [11]:
y_test

array([1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1], dtype=int64)

#### 5. Import Decision Tree Classifier from sklearn

* Create a Decision Tree classifier : DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset.
* Train the model using the training sets

#### Using criterion as 'entropy' for the information gain (a function to measure the quality of a split) and creating a Decision Tree classifier

In [12]:
from sklearn.tree import DecisionTreeClassifier
classifier_entropy = DecisionTreeClassifier(criterion='entropy',random_state=0)
classifier_entropy.fit(x_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

#### Using criterion as 'gini' for the Gini impurity (a function to measure the quality of a split) and creating a Decision Tree classifier

In [13]:
from sklearn.tree import DecisionTreeClassifier
classifier_gini = DecisionTreeClassifier(criterion='gini',random_state=0)
classifier_gini.fit(x_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

#### 6. Predict the output using the .predict() method provided by sklearn

In [14]:
preds_entropy=classifier_entropy.predict(x_test)
preds_entropy

array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1], dtype=int64)

In [15]:
preds_gini=classifier_gini.predict(x_test)
preds_gini

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1], dtype=int64)

#### 7. Import classification report, accuracy score and confusion matrix to view the results

In [16]:
#making the confusion matrix
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score

* Print the <b>confusion matrix</b>

For criterion = 'entropy'

In [17]:
cm = confusion_matrix(y_test,preds_entropy)
print(cm)

[[10  3]
 [ 0 36]]


For criterion = 'gini'

In [18]:
cm = confusion_matrix(y_test,preds_gini)
print(cm)

[[10  3]
 [ 2 34]]


* Print the <b>classification report</b>

For criterion = 'entropy'

In [19]:
print(classification_report(y_test,preds_entropy))

              precision    recall  f1-score   support

           0       1.00      0.77      0.87        13
           1       0.92      1.00      0.96        36

    accuracy                           0.94        49
   macro avg       0.96      0.88      0.91        49
weighted avg       0.94      0.94      0.94        49



For criterion = 'gini

In [20]:
print(classification_report(y_test,preds_gini))

              precision    recall  f1-score   support

           0       0.83      0.77      0.80        13
           1       0.92      0.94      0.93        36

    accuracy                           0.90        49
   macro avg       0.88      0.86      0.87        49
weighted avg       0.90      0.90      0.90        49



* Print the <b>accuracy</b>

In [21]:
print('accuracy when criterion is ENTROPY:',accuracy_score(y_test.tolist(), preds_entropy.tolist())*100)

accuracy when criterion is ENTROPY: 93.87755102040816


In [22]:
print('accuracy when criterion is GINI:',accuracy_score(y_test.tolist(), preds_gini.tolist())*100)

accuracy when criterion is GINI: 89.79591836734694


Thus it can be concluded from the following that the accuracy is greater in case of splitting using the criterion as ENTROPY in the Decision Tree Classifier, which is based on Information Gain. 