# Decision Tree

* Você pode baixar o dataset em https://archive.ics.uci.edu/ml/datasets/Car+Evaluation.

__Equipe:__
* Sayonara Santos Araújo
* Lailson Azevedo

### Atividades

1. Utilizamos a medida de Entropia como fator de decisão (medida de impureza de um nó). Teste o mesmo conjunto 
randômico de dados para a medida Gini e compare os resultados.
Ref1.: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Ref2.: https://en.wikipedia.org/wiki/Decision_tree_learning

2. Faça o balanceamento dos dados contidos em "train.csv", aplique o algoritmo de Decision Tree e faça a submissão no kaggle. Tente melhorar o resultado obtido em sala de aula (posição 3100 no leaderboard).
Dataset: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction

3. (Opcional) Execute uma Random Forest na competição do Kaggle e veja se a acurácia melhora. Utilize 10, 100 ou 1000 árvores (dependendo de quanto o seu computador aguentar =]): http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html




In [2]:
import os
import pandas as pd
import math
import numpy as np
from sklearn.tree import DecisionTreeClassifier

In [3]:
headers = ["buying", "maint", "doors", "persons","lug_boot", "safety", "class"]
data = pd.read_csv("car_data.csv", header=None, names=headers)

data = data.sample(frac=1).reset_index(drop=True) # shuffle
data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,high,high,4,4,big,high,acc
1,low,high,3,2,big,low,unacc
2,low,vhigh,3,4,small,high,acc
3,low,vhigh,5more,more,small,low,unacc
4,med,vhigh,2,4,med,low,unacc


In [4]:
for h in headers:
    data[h] = data[h].astype('category')
    data[h] = data[h].cat.codes
data.set_index("class", inplace=True)
data.head()

Unnamed: 0_level_0,buying,maint,doors,persons,lug_boot,safety
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,0,2,1,0,0
2,1,0,1,0,0,1
0,1,3,1,1,2,0
2,1,3,3,2,2,1
2,2,3,0,1,1,1


In [5]:
size = len(data)
train_size = int(math.floor(size * 0.7))
train_data = data[:train_size]
test_data = data[train_size:]

In [6]:
## 1. Utilizamos a medida de Entropia como fator de decisão (medida de impureza de um nó). Teste o mesmo conjunto randômico de dados para a medida Gini e compare os resultados. 

## GINI
d_tree = DecisionTreeClassifier(criterion="gini")
d_tree.fit(train_data, train_data.index) 

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [7]:
d_tree.predict(test_data.iloc[:, 0:6])
score = d_tree.score(test_data, test_data.index)
print("Score Gini: {0:.4f}".format(score))

Score Gini: 0.9769


In [8]:
# desenha a arvore
import graphviz 
from sklearn import tree
dot_data = tree.export_graphviz(d_tree, out_file=None, feature_names=["buying", "maint", "doors", "persons","lug_boot", "safety"]) 
graph = graphviz.Source(dot_data) 
graph.render("sayonara_lailson_car_dataset_gini")

'sayonara_lailson_car_dataset_gini.pdf'

In [9]:
## 1. Testar o mesmo conjunto randômico de dados para a medida Gini

## ENTROPIA
d_tree_e = DecisionTreeClassifier(criterion="entropy")
d_tree_e.fit(train_data, train_data.index) 

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [10]:
d_tree_e.predict(test_data.iloc[:, 0:6])
score_e = d_tree_e.score(test_data, test_data.index)
print("Score Entropy: {0:.4f}".format(score_e))

Score Entropy: 0.9692


In [11]:
# desenha a arvore
import graphviz 
from sklearn import tree
dot_data_e = tree.export_graphviz(d_tree_e, out_file=None, feature_names=["buying", "maint", "doors", "persons","lug_boot", "safety"]) 
graph_e = graphviz.Source(dot_data_e) 
graph_e.render("sayonara_lailson_car_dataset_entropy")

'sayonara_lailson_car_dataset_entropy.pdf'