# DECISION TREE

I'm beginner, so I was using public solutions from Github, Kaggle, Google Developer and Udemy and I mixed them here for better understanding

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

PROS
- easy to understand and can be visualised (super helpful for understanding)
- this model doesn't work on missing values, so we need to prepare our data before
- able to handle both numerical and categorical data. Most of algorithms can handle jus one type of variable

CONS
- overfitting - we can create over-complex tree, which will be impossible to understand
- even small data can affect on the tree, the tree can be unstable. Moreover tree cannot guarantee to return the globally optimal decision tree
- biased trees

Source: http://scikit-learn.org/stable/modules/tree.html

Sources (general): http://dni-institute.in/blogs/step-by-step-tutorial-on-decision-tree-using-python/
http://stackabuse.com/decision-trees-in-python-with-scikit-learn/

In [24]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

In [25]:
bill = pd.read_csv(r'C:/Users/tomek/Desktop/data/bill_authentication.csv') #data uploading

In [26]:
bill.describe()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
count,1372.0,1372.0,1372.0,1372.0,1372.0
mean,0.433735,1.922353,1.397627,-1.191657,0.444606
std,2.842763,5.869047,4.31003,2.101013,0.497103
min,-7.0421,-13.7731,-5.2861,-8.5482,0.0
25%,-1.773,-1.7082,-1.574975,-2.41345,0.0
50%,0.49618,2.31965,0.61663,-0.58665,0.0
75%,2.821475,6.814625,3.17925,0.39481,1.0
max,6.8248,12.9516,17.9274,2.4495,1.0


In [27]:
bill.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
Variance    1372 non-null float64
Skewness    1372 non-null float64
Curtosis    1372 non-null float64
Entropy     1372 non-null float64
Class       1372 non-null int64
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


In [28]:
bill.head()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


We need to divide our data. Thanks to this we we can train our algorithm on one set of data and then test it out on a completely different set of data that the algorithm hasn't seen yet

In [29]:
X = bill.drop('Class', axis=1)  #X variable contains all the columns from the dataset, except the "Class" column, which is the label
y = bill['Class']  #The y variable contains the values from the "Class" column. The X variable is our attribute set and y variable contains corresponding labels

In [30]:
from sklearn.cross_validation import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)  

In [31]:
from sklearn.tree import DecisionTreeClassifier  
classifier = DecisionTreeClassifier()  
classifier.fit(X_train, y_train)  

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Here explanation for all data above:

- CLASS_WEIGHT - here 'None' - all classes are supposed to have weight one
- CRITERION - Gini - measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. It measure the quality of the split
- MAX_DEPTH - we have here 'None'. It means that nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- MAX_FEATURES - The number of features to consider when looking for the best split. None means default result (max_features=n_features)
- MAX_LEAF_NODES - unlimited number of leaf nodes
- MIN_IMPURITY_DECREASE - a node will be split if this split induces a decrease of the impurity greater than or equal to this value.
- MIN_IMPURITY_SPLIT - a node will split if its impurity is above the threshold, otherwise it is a leaf.
- MIN_SAMPLES_LEAF - we have default number 1 here. This is minimum number of samples required to be at a leaf node
- MIN_SAMPLES_SPLIT - we have default result here - 2 - this is minimum number of samples required to split an internal node
- MIN_WEIGHT_FRACTION_LEAF - like before, we have here 0, which is default number- This number means the minimum weighted fraction of the sum total of weights
- PRESORT - setting this to true may slow down the training process. We have 'False' data here as default data
- RANDOM_STATE - for 'None' the random number generator is the RandomState instance used by np.random.
- SPLITTER - we have 'best', this is default result. 'best' means choosing the best split

Nice explanation for Decision Tree Classifier: https://medium.com/machine-learning-101/chapter-3-decision-trees-theory-e7398adac567, http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [19]:
y_pred = classifier.predict(X_test)  

In [20]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))  

[[148   3]
 [  1 123]]
             precision    recall  f1-score   support

          0       0.99      0.98      0.99       151
          1       0.98      0.99      0.98       124

avg / total       0.99      0.99      0.99       275



Explanation for confusion matrix:

148 - predicted no and actual NO
3 - predicted no and actual yes
1 - predicted no and actual yes
123 - predicted yes and actual yes


Source: http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

In [22]:
from sklearn import tree

In [23]:
with open("dt_train_gini.txt", "w") as f:
    f = tree.export_graphviz(classifier, out_file=f)

Visualisation as a separated file