Decision Trees in Python with Scikit-Learn
-------------------------------------------------------------

Introduction
-----------------
A decision tree is one of most frequently and widely used supervised machine learning algorithms that can perform both regression and classification tasks. 

For each attribute in the dataset, the decision tree algorithm forms a node, where the most important attribute is placed at the root node. For evaluation we start at the root node and work our way down the tree by following the corresponding node that meets our condition or "decision". This process continues until a leaf node is reached, which contains the prediction or the outcome of the decision tree.

Consider a scenario where a person asks you to lend them your car for a day, and you have to make a decision whether or not to lend them the car. There are several factors that help determine your decision, some of which have been listed below:

![decison_tree_image](datasets_n_images/images/decison_tree_image.png 'decison_tree_image')

Advantages of Decision Trees
------------------------------

There are several advantages of using decision treess for predictive analysis:

>1. Decision trees can be used to predict both continuous and discrete values i.e. they work well for both regression and classification tasks.

>2. They require relatively less effort for training the algorithm.

>3. They can be used to classify non-linearly separable data.

>4. They're very fast and efficient compared to KNN and other classification algorithms.

# 1. Decision Tree for Classification
---------------------------------------------------------

Here, we will predict whether a bank note is authentic or fake depending upon the four different attributes of the image of the note. The attributes are Variance of wavelet transformed image, kurtosis of the image, entropy, and skewness of the image.

In [1]:
# your code goes here1
import pandas as pd
import numpy as np
%matplotlib inline

bankdata = pd.read_csv("./datasets_n_images/datasets_module_4/bill_authentication.csv")
print(bankdata.shape,"\n")
print(bankdata.head())

#class=0 Not fake
#Class=1 Fake

(1372, 5) 

   Variance  Skewness  Curtosis  Entropy  Class
0   3.62160    8.6661   -2.8073 -0.44699      0
1   4.54590    8.1674   -2.4586 -1.46210      0
2   3.86600   -2.6383    1.9242  0.10645      0
3   3.45660    9.5228   -4.0112 -3.59440      0
4   0.32924   -4.4552    4.5718 -0.98880      0


In [2]:
X = bankdata.drop('Class',axis=1)
y = bankdata['Class']

In [3]:
X.head()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy
0,3.6216,8.6661,-2.8073,-0.44699
1,4.5459,8.1674,-2.4586,-1.4621
2,3.866,-2.6383,1.9242,0.10645
3,3.4566,9.5228,-4.0112,-3.5944
4,0.32924,-4.4552,4.5718,-0.9888


In [4]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [6]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [7]:
y_pred = classifier.predict(X_test)

In [8]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
confusion_matrix(y_test, y_pred)

array([[160,   4],
       [  1, 110]], dtype=int64)

In [9]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98       164
           1       0.96      0.99      0.98       111

    accuracy                           0.98       275
   macro avg       0.98      0.98      0.98       275
weighted avg       0.98      0.98      0.98       275



In [10]:
accuracy_score(y_test, y_pred)

0.9818181818181818

In [11]:
#Loop the above process
from sklearn.model_selection import train_test_split
i=1
x=0
# test_sz=.85
for i in range(1,10):
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.15+x)


 # Training the Algorithm. Here we would use DecisionTreeClassifier
 from sklearn.tree import DecisionTreeClassifier  
 classifier = DecisionTreeClassifier()  
 classifier.fit(X_train, y_train)

 # make predictions on the test data
 y_pred=classifier.predict(X_test)



 from sklearn.metrics import classification_report, confusion_matrix
 print("Iteration",i,":\n")
 print("Confusion Matrix:\n",confusion_matrix(y_test, y_pred),"\n")  
 print("\nclassification_report:\n",classification_report(y_test, y_pred))
 print("\nAccuracy:\n",accuracy_score(y_test, y_pred))
 x=x+0.05


Iteration 1 :

Confusion Matrix:
 [[124   2]
 [  1  79]] 


classification_report:
               precision    recall  f1-score   support

           0       0.99      0.98      0.99       126
           1       0.98      0.99      0.98        80

    accuracy                           0.99       206
   macro avg       0.98      0.99      0.98       206
weighted avg       0.99      0.99      0.99       206


Accuracy:
 0.9854368932038835
Iteration 2 :

Confusion Matrix:
 [[151   1]
 [  2 121]] 


classification_report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       152
           1       0.99      0.98      0.99       123

    accuracy                           0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275


Accuracy:
 0.9890909090909091
Iteration 3 :

Confusion Matrix:
 [[191   2]
 [  2 148]] 


classification_report:
               precision    recal

# 2. Decision Tree for Regression
------------------------------------------------------

We will petrol_consumption.csv dataset to try and predict gas consumptions (in millions of gallons) in 48 US states based upon gas tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with a drivers license.

In [12]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

dataset = pd.read_csv('./datasets_n_images/datasets_module_4/petrol_consumption.csv')

dataset.head()  

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [13]:
dataset.describe()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


In [14]:
X = dataset.drop('Petrol_Consumption', axis=1)  
y = dataset['Petrol_Consumption']  

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10,random_state=0)


from sklearn.tree import DecisionTreeRegressor 
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

In [15]:
y_pred

array([487., 524., 580., 554., 574.])

In [16]:
df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df

Unnamed: 0,Actual,Predicted
29,534,487.0
4,410,524.0
26,577,580.0
30,571,554.0
32,577,574.0


**Remember : 

that in your case the records compared may be different, depending upon the training and testing split. Since the train_test_split method randomly splits the data we likely won't have the same training and test sets.

In [17]:
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  

Mean Absolute Error: 36.8
Mean Squared Error: 3102.4
Root Mean Squared Error: 55.69919209467943


The mean absolute error for our algorithm is 56.09, which is less than 10% of 576.77 i.e. 57.677 of all the values in the 'Petrol_Consumption' column. This means that our algorithm did a fine prediction job. All though getting a value <10% would have been better.

In [25]:
# necessary imports
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

# loading the dataset
dataset = pd.read_csv('./datasets_n_images/datasets_module_4/petrol_consumption.csv')

dataset.head() 

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [26]:
dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [27]:
dataset.describe()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


In [28]:
X = dataset.drop('Petrol_Consumption', axis=1)  
y = dataset['Petrol_Consumption']  

In [29]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) #default tset_size=0.25,
                                                                                            #random_state=seedvalue

# Training the Algorithm
from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [30]:
y_pred = regressor.predict(X_test)  

# Now let's compare some of our predicted values with the actual values 
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df  

Unnamed: 0,Actual,Predicted
29,534,469.391989
4,410,545.645464
26,577,589.668394
30,571,569.730413
32,577,649.774809
37,704,646.631164
34,487,511.608148
40,587,672.475177
7,467,502.074782
10,580,501.270734


In [31]:
# Evaluating the Algorithm
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  

Mean Absolute Error: 56.8222474789647
Mean Squared Error: 4666.344787588363
Root Mean Squared Error: 68.3106491521517
