# Decision Tree Classifier

A decision tree is one of most frequently and widely used supervised machine learning algorithms that can perform both regression and classification tasks. The intuition behind the decision tree algorithm is simple, yet also very powerful.

For each attribute in the dataset, the decision tree algorithm forms a node, where the most important attribute is placed at the root node. For evaluation we start at the root node and work our way down the tree by following the corresponding node that meets our condition or "decision". This process continues until a leaf node is reached, which contains the prediction or the outcome of the decision tree.

This may sound a bit complicated at first, but what you probably don't realize is that you have been using decision trees to make decisions your entire life without even knowing it. Consider a scenario where a person asks you to lend them your car for a day, and you have to make a decision whether or not to lend them the car. There are several factors that help determine your decision, some of which have been listed below:

Is this person a close friend or just an acquaintance? If the person is just an acquaintance, then decline the request; if the person is friend, then move to next step.
Is the person asking for the car for the first time? If so, lend them the car, otherwise move to next step.
Was the car damaged last time they returned the car? If yes, decline the request; if no, lend them the car.
The decision tree for the aforementioned scenario looks like this:

Decision tree

Advantages of Decision Trees
There are several advantages of using decision treess for predictive analysis:

Decision trees can be used to predict both continuous and discrete values i.e. they work well for both regression and classification tasks.
They require relatively less effort for training the algorithm.
They can be used to classify non-linearly separable data.
They're very fast and efficient compared to KNN and other classification algorithms.

In [1]:
import pickle as pkl

with open('../data/titanic_tansformed.pkl', 'rb') as f:
    df_data = pkl.load(f)

In [2]:
df_data.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,2,3,male,Q,S
0,0,22.0,1,0,7.25,0,1,1,0,1
1,1,38.0,1,0,71.2833,0,0,0,0,0
2,1,26.0,0,0,7.925,0,1,0,0,1
3,1,35.0,1,0,53.1,0,0,0,0,1
4,0,35.0,0,0,8.05,0,1,1,0,1


In [3]:
df_data.shape

(889, 10)

In [4]:
data = df_data.drop("Survived",axis=1)
label = df_data["Survived"]

In [5]:
from sklearn.model_selection import train_test_split  
data_train, data_test, label_train, label_test = train_test_split(data, label, test_size = 0.2, random_state = 101)

In [17]:
from sklearn.tree import DecisionTreeClassifier
import time

tic = time.time()
dt_cla = DecisionTreeClassifier()
dt_cla.fit(data_train,label_train)
print('Time taken for training Decision Tree', (time.time()-tic), 'secs')

predictions = dt_cla.predict(data_test)
print('Accuracy', dt_cla.score(data_test, label_test))

from sklearn.metrics import classification_report, confusion_matrix                
print(confusion_matrix(label_test, predictions))  
print(classification_report(label_test, predictions)) 

Time taken for training Decision Tree 0.002324819564819336 secs
Accuracy 0.7752808988764045
[[85 22]
 [18 53]]
             precision    recall  f1-score   support

          0       0.83      0.79      0.81       107
          1       0.71      0.75      0.73        71

avg / total       0.78      0.78      0.78       178



### Hyperparameters for Decision Tree
- There are a number of [hyperparameters](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for a decision tree
- Mostly commonly tuned parameter are 
    - __max_depth__ - The maximum depth of a tree. Defaults to complete expansion of the tree
    - __min_samples_split__ - Minimum number of samples required to split an internal node

In [18]:
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

max_depth = [2,3,4,5,6,7,8]
min_samples_split = [1, 2, 3, 4, 5, 10, 20]
score_func = 'accuracy'

dt_cla = DecisionTreeClassifier()
dt_grid = GridSearchCV(estimator=dt_cla, 
                    param_grid=[{'max_depth':max_depth, 'min_samples_split':min_samples}], 
                    cv=5, 
                    scoring=score_func)
dt_grid.fit(data_train, label_train)
print('Best Score', dt_grid.best_score_)
print('Best Max Depth', dt_grid.best_estimator_.max_depth)
print('Best Split Samples', dt_grid.best_estimator_.min_samples_split)

Best Score 0.8185654008438819
Best Max Depth 5
Best Split Samples 2
