<img src="https://cellstrat2.s3.amazonaws.com/PlatformAssets/bluewhitelogo.svg" alt="drawing" width="200"/>

# ML Tuesdays - Session 4
## Machine Learning Track
### Decision Trees Exercise

### Guidelines
1. The notebook has been split into multiple steps with fine-grained instructions for each step. Use the instructions for each code cell to complete the code.
2. You can refer the Logistic Regression Module in the Machine Learning Pack from CellStrat Hub.
3. Make use of the docstrings of the functions and classes using the `shift+tab` shortcut key.
4. Refer the internet for the explanation of any algorithm.

## About the Dataset
The data set consists following Input variables :
1 - fixed acidity  
2 - volatile acidity  
3 - citric acid  
4 - residual sugar  
5 - chlorides  
6 - free sulfur dioxide
7 - total sulfur dioxide  
8 - density  
9 - pH   
10 - sulphates   
11 - alcohol

Output variable gives the quality of th wine based on the input variables: 

12 - quality (score between 0 and 10)

Decision tree algorithm is one of the most versatile algorithms in machine learning which can perform both classification and regression analysis. It is very powerful and works great with complex datasets. Apart from that, it is very easy to understand and read. That makes it more popular to use. When coupled with ensemble techniques – which we will learn very soon- it performs even better.
As the name suggests, this algorithm works by dividing the whole dataset into a tree-like structure based on some rules and conditions and then gives prediction based on those conditions.





### Import Libraries

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split, GridSearchCVs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv("winequality_red.csv")
data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5




1. Split to X and y data
2. Perform Train Test Split



In [3]:
X = data.drop(columns = 'quality')
y = data['quality']

In [4]:
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.30, random_state= 355)

### Create a Decision Tree without Preprocessing

Make optimal use of the scikit-learn documentation and google to understand Decision Tree algorithm and apply it.


In [5]:
clf = DecisionTreeClassifier()
clf.fit(x_train,y_train)

DecisionTreeClassifier()

#### Visualise the tree

In [None]:
from sklearn import tree
plt.figure(figsize=(25,10))
tree.plot_tree(clf,filled=True)

In [7]:
feature_name=list(X.columns)
class_name = list(y_train.unique())
feature_name

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

### Make predictions on test data and check accuracy

In [8]:
clf.score(x_train, y_train)

1.0

In [9]:
py_pred = clf.predict(x_test)

In [10]:
# accuracy of classification tree
clf.score(x_test,y_test)

0.6229166666666667

### Try to tune some hyperparameters using the GridSearchCV algorithm

When we do hyperparameter tuning, we basically try to find those sets and values of hyperparameters which will give us a model with maximum accuracy. Let's go ahead and try to improve our model.


GridSearchCV is a method used to tune our hyperparameters. We can pass different values of hyperparameters as
parameters for grid search.
It does a exhaustive generation of combination of different parameters passed.
Using cross validation score, Grid Search returns the combination of hyperparameters for which the model is performing the best. 

In [11]:
# we are tuning three hyperparameters right now, we are passing the different values for both parameters
grid_param = {
    'criterion': ['gini', 'entropy'],
    'max_depth' : range(2,32,1),
    'min_samples_leaf' : range(1,10,1),
    'min_samples_split': range(2,10,1),
    'splitter' : ['best', 'random']
    
}

In [12]:
grid_search = GridSearchCV(estimator=clf,
                     param_grid=grid_param,
                     cv=5,
                    n_jobs =-1)

In [13]:
scalar = StandardScaler()

x_transform = scalar.fit_transform(X)

In [14]:
x_train,x_test,y_train,y_test = train_test_split(x_transform,y,test_size = 0.30, random_state= 355)

In [15]:
grid_search.fit(x_train,y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(2, 32),
                         'min_samples_leaf': range(1, 10),
                         'min_samples_split': range(2, 10),
                         'splitter': ['best', 'random']})

In [16]:
best_parameters = grid_search.best_params_
print(best_parameters)

{'criterion': 'entropy', 'max_depth': 26, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'}


In [17]:
grid_search.best_score_

0.6095051249199231

In [18]:
clf = DecisionTreeClassifier(criterion = 'entropy', max_depth =24, min_samples_leaf= 1, min_samples_split= 2, splitter ='random')
clf.fit(x_train,y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=24, splitter='random')

In [19]:
clf.score(x_test,y_test)

0.5895833333333333

### Compare the Score has improved after using Gridsearch or not.



In [None]:
# Visualizing the tree
from sklearn import tree
plt.figure(figsize=(25,10))
tree.plot_tree(clf,filled=True)

#### Note : we must understand that giving all the hyperparameters in the gridSearch doesn't gurantee the best result. 
#### We have to do hit and trial with parameters to get the perfect score.