# Árvore de decisão


Ilustra o funcionamento do algoritmo de árvore de decisão com dados contínuos.

-------------------------------------------------------------------------------

### Base de dados: Sonar, Mines vs. Rocks

https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar,+Mines+vs.+Rocks%29

208 instâncias

60 atributos

2 classes (rocha, mina)

### UPLOAD DE ARQUIVO LOCAL:

Para fazer o upload de bases de dados, deve-se usar o objeto ```files``` do pacote ```goggle.colab```.

Deve-se fazer o upload do arquivo "train.csv" disponível na pasta "Datasets\Titanic".

```Se você quiser um botão pra fazer upload do arquivo direto no notebook
#  from google.colab import files
#  uploaded = files.upload()
#  df = pd.read_csv(next(iter(uploaded.keys())))
```

``` Se quiser ler de uma pasta que ja está salva no seu gdrive
#  from google.colab import drive
#  drive.mount('/content/drive')
#  df = pd.read_excel('/content/drive/My Drive/arquivo.xlsx', sheet_name=0)
```



In [None]:
%%capture
!pip install pydotplus
!pip install dtreeviz

In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets, tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder

In [None]:
# Truque pra baixar direto pelo link
download_url = 'https://drive.google.com/uc?export=download&id='
url_drive_file = 'https://docs.google.com/spreadsheets/d/1cGZN3X8ydgwbbsaiQK5_\
yUwf3VTLH19W/edit?usp=sharing&ouid=114919786921075985733&rtpof=true&sd=true'

download_path = download_url + url_drive_file.split('/')[-2]

sonar = pd.read_excel(download_path, sheet_name=0)

sonar.head()

### Carga dos dados e particionamento das bases de treinamento e teste

### Transformação de dados

A classe é convertida para labels únicos sequenciais.

<code>
 le = preprocessing.LabelEncoder()
  
 le.fit(dados)
</code>


### Particionamento da base

<code>train_test_split(X, y) -- particiona a base de dados original em bases de treinamento e teste.</code>

No código a seguir, são utilizados 10% para teste e 90% para treinamento.







In [None]:
print("\nDimensões: {0}".format(sonar.shape))
print("\nCampos: {0}".format(sonar.keys())) 
print(sonar.describe(), sep='\n')

X = sonar.iloc[:,0:(sonar.shape[1] - 1)]

le = LabelEncoder()
y = le.fit_transform(sonar.iloc[:,(sonar.shape[1] - 1)])

# Particiona a base de dados
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.1)

### Indução do Modelo


Os três passos para indução de um modelo são:

1.   Instanciar o modelo: ``` DecisionTreeClassifier()```
2.   Treinar o modelo: ```fit()```
3.   Testar o modelo: ```predict()```



In [None]:
sonar_tree = DecisionTreeClassifier(random_state=0, criterion='gini')#, max_depth=2)
sonar_tree = sonar_tree.fit(X_train, y_train)

In [None]:
print("Acurácia (base de treinamento):", sonar_tree.score(X_train, y_train))

In [None]:
y_pred = sonar_tree.predict(X_test)
y_pred

In [None]:
y_pred

In [None]:
print("Acurácia de previsão:", accuracy_score(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred, target_names=["Mina", "Rocha"]))
      
cnf_matrix = confusion_matrix(y_test, y_pred)
cnf_table = pd.DataFrame(data=cnf_matrix, index=["Mina", "Rocha"], columns=["Mina (prev)", "Rocha (prev)"])
print(cnf_table)

### Exibição da árvore de decisão

In [None]:
import pydotplus
from IPython.display import Image

# Create DOT data
dot_data = tree.export_graphviz(sonar_tree, out_file=None, 
                                proportion=False,
                                rounded =True,
                                filled=True,
                                feature_names=np.arange(0,60),  
                                class_names=["mina", "rocha"])

# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)  

# Show graph
Image(graph.create_png())


# Ajuste de hiperparâmetros utilizando grid search e random search

In [None]:
sonar_tree

DecisionTreeClassifier(random_state=0)

In [None]:
# Parametros da atuais
sonar_tree.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 0,
 'splitter': 'best'}

In [None]:
# Analise outros hiperparametros possíveis: 
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [None]:
from sklearn.model_selection import RandomizedSearchCV

from time import time

In [None]:
np.arange(3, 15)

array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [None]:
# specify parameters and distributions to sample from
tree_params = {"max_depth": np.arange(3, 15),
               "criterion": ["gini", "entropy"],
               "min_samples_split": np.arange(2, 5),
               "min_samples_leaf": np.arange(2, 5),
               }

tree_params

{'criterion': ['gini', 'entropy'],
 'max_depth': array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 'min_samples_leaf': array([2, 3, 4]),
 'min_samples_split': array([2, 3, 4])}

In [None]:
|0|1|2| 

In [None]:
# Executa randomized search
n_iter_search = 20 

random_search = RandomizedSearchCV(sonar_tree, # modelo
                                   param_distributions=tree_params, #Parametros que criamos,
                                   n_iter=n_iter_search, # É interessante incrementar esse valor para que mais permutações sejam testadas
                                   cv=5,  # Cross Validation - Validação Cruzada 
                                   random_state=0)



In [None]:
start = time()

random_search.fit(X_train, y_train)

print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings \n\n" % ((time() - start), n_iter_search))



RandomizedSearchCV took 1.17 seconds for 20 candidates parameter settings 




In [None]:
random_search.cv_results_

{'mean_fit_time': array([0.01078868, 0.00784626, 0.0064712 , 0.00902362, 0.00914016,
        0.00720119, 0.00710907, 0.00944781, 0.00620041, 0.00729995,
        0.00930452, 0.00938554, 0.00971446, 0.00885253, 0.00737467,
        0.00872865, 0.00911932, 0.0060451 , 0.00903873, 0.00682626]),
 'mean_score_time': array([0.00292239, 0.0025569 , 0.00220351, 0.00256982, 0.00265713,
        0.00282454, 0.00299029, 0.00263057, 0.00273585, 0.00245748,
        0.00322366, 0.00248661, 0.00319562, 0.00258241, 0.00276031,
        0.00227933, 0.00231314, 0.00232663, 0.00221176, 0.00223994]),
 'mean_test_score': array([0.7170697 , 0.7170697 , 0.72745377, 0.70583215, 0.71123755,
        0.71692745, 0.72745377, 0.69530583, 0.70640114, 0.7170697 ,
        0.7170697 , 0.7116643 , 0.71123755, 0.70071124, 0.7170697 ,
        0.70583215, 0.70611664, 0.73271693, 0.7116643 , 0.7170697 ]),
 'param_criterion': masked_array(data=['entropy', 'gini', 'gini', 'entropy', 'entropy',
                    'gini', 'gini',

In [None]:
random_search.best_params_

{'criterion': 'gini',
 'max_depth': 4,
 'min_samples_leaf': 3,
 'min_samples_split': 2}

In [None]:
random_search.best_estimator_

DecisionTreeClassifier(max_depth=4, min_samples_leaf=3, random_state=0)

In [None]:
sonar_tree

DecisionTreeClassifier(random_state=0)

In [None]:
print("Acurácia de treinamento:", random_search.best_estimator_.score(X_train, y_train))

y_pred = random_search.best_estimator_.predict(X_test)
print("Acurácia de previsão:", accuracy_score(y_test, y_pred))

Acurácia de treinamento: 0.9090909090909091
Acurácia de previsão: 0.7142857142857143


In [None]:
print(classification_report(y_test, y_pred, target_names=["Mina", "Rocha"]))
      
cnf_matrix = confusion_matrix(y_test, y_pred)
cnf_table = pd.DataFrame(data=cnf_matrix, index=["Mina", "Rocha"], columns=["Mina (prev)", "Rocha (prev)"])
print(cnf_table)

              precision    recall  f1-score   support

        Mina       0.60      0.75      0.67         8
       Rocha       0.82      0.69      0.75        13

    accuracy                           0.71        21
   macro avg       0.71      0.72      0.71        21
weighted avg       0.74      0.71      0.72        21

       Mina (prev)  Rocha (prev)
Mina             6             2
Rocha            4             9
