### Atividades

1. Utilizamos a medida de Entropia como fator de decisão (medida de impureza de um nó). Teste o mesmo conjunto 
randômico de dados para a medida Gini e compare os resultados.
Ref1.: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Ref2.: https://en.wikipedia.org/wiki/Decision_tree_learning

2. Faça o balanceamento dos dados contidos em "train.csv", aplique o algoritmo de Decision Tree e faça a submissão no kaggle. Tente melhorar o resultado obtido em sala de aula (posição 3100 no leaderboard).
Dataset: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction

3. (Opcional) Execute uma Random Forest na competição do Kaggle e veja se a acurácia melhora. Utilize 10, 100 ou 1000 árvores (dependendo de quanto o seu computador aguentar =]): http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html




## Carregando os Dados

In [1]:
# Bibliotecas
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report



In [2]:
# Carregando os dados
headers = ["buying", "maint", "doors", "persons","lug_boot", "safety", "class"]
data = pd.read_csv("car_data.csv", header=None, names=headers)

# Embaralhando os dados
data = data.sample(frac=1).reset_index(drop=True)

# Categoriza todos os atributos para facilitar a Classificação
for h in headers:
    data[h] = data[h].astype('category')
    data[h] = data[h].cat.codes

# Exibição dos Dados
data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,1,1,1,2,2,0,1
1,0,1,0,2,2,2,2
2,2,2,0,2,2,2,2
3,2,1,3,0,2,2,2
4,0,0,3,1,0,1,2


In [3]:
# Separação dos Dados em Treino e Teste
X = data.iloc[:,:-1].values; y = data.iloc[:,-1].values;
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# 1ª Questão

In [4]:
# Definição do Classificador da Árvore de Decisão
# 1 -> Entropia; 2 -> Gini
dTree1 = DecisionTreeClassifier(criterion="entropy")
dTree2 = DecisionTreeClassifier(criterion="gini")


In [5]:
# É feito o fitting dos dois modelos anteriores utilizando os dados de treino
dTree1.fit(X_train, y_train)
dTree2.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [6]:
# Avaliação de Performance do Modelo 1
print("#### Modelo de Entropia ####")
print("Decision Tree Score: {0:.3}".format(dTree1.score(X_test, y_test)))

print("\nClassification Report:")
yPreds = dTree1.predict(X_test)
print(classification_report(y_test, yPreds))

#### Modelo de Entropia ####
Decision Tree Score: 0.971

Classification Report:
             precision    recall  f1-score   support

          0       0.95      0.93      0.94       116
          1       0.85      0.88      0.87        26
          2       0.99      1.00      0.99       353
          3       0.91      0.88      0.89        24

avg / total       0.97      0.97      0.97       519



In [7]:
# Avaliação de Performance do Modelo 1
print("#### Modelo de Gini ####")
print("Decision Tree Score: {0:.3}".format(dTree2.score(X_test, y_test)))

print("\nClassification Report:")
yPreds = dTree2.predict(X_test)
print(classification_report(y_test, yPreds))

#### Modelo de Gini ####
Decision Tree Score: 0.971

Classification Report:
             precision    recall  f1-score   support

          0       0.95      0.93      0.94       116
          1       0.92      0.88      0.90        26
          2       0.99      1.00      0.99       353
          3       0.91      0.88      0.89        24

avg / total       0.97      0.97      0.97       519



# 2ª Questão

In [8]:
# Carregando os dados
data = pd.read_csv("train.csv")

# Embaralhando os dados
data = data.drop(["id"], axis=1)
data = data.sample(frac=1).reset_index(drop=True)

# Exibição dos Dados
data.head()

Unnamed: 0,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,0,1,5,0,0,1,0,0,0,...,5,4,2,5,0,0,0,0,1,0
1,0,0,2,2,0,0,1,0,0,0,...,3,1,3,11,0,1,1,0,0,0
2,0,5,1,3,0,0,0,0,0,1,...,6,3,3,5,0,0,1,0,0,0
3,0,0,1,2,0,0,0,1,0,0,...,3,1,4,7,0,1,0,0,0,1
4,0,0,1,7,0,0,0,0,0,1,...,9,4,4,8,0,1,0,0,1,0


In [9]:
# Separação dos Dados em Treino e Teste
X = data.iloc[:,1:].values; y = data.iloc[:,0].values;
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [10]:
# Definição do Classificador da Árvore de Decisão
dTree = DecisionTreeClassifier(criterion="gini")
dTree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [11]:
# Avaliação de Performance utilizando Conjunto de Teste
yPred = dTree.predict(X_test)
print("#### Modelo de Gini ####")
print("Decision Tree Score: {0:.3}".format(dTree.score(X_test, y_test)))

print("\nClassification Report:")
print(classification_report(y_test, yPred))

#### Modelo de Gini ####
Decision Tree Score: 0.918

Classification Report:
             precision    recall  f1-score   support

          0       0.96      0.95      0.96    114606
          1       0.05      0.06      0.05      4437

avg / total       0.93      0.92      0.92    119043



In [35]:
# Preparando submissão para o Kaggle
# Carregando os dados
data = pd.read_csv("test.csv")
yPred = dTree.predict(data.values[:,1:])

In [36]:
# Salvando o .csv da Submissão
data['target'] = yPred
data[['id', 'target']].to_csv('submission.csv', index=False)