# Aprendizado Supervisionado

Vamos trabalhar com mais uma base e testar diferentes algortimos de Machine Learning. No entanto, nosso foco vai ser olhar as métricas de avaliação estudadas em sala de aula.

In [None]:
import pandas as pd
from tabulate import tabulate

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Base dos Dados 

Vamos utilizar a base de dados disponível nesse link: https://archive.ics.uci.edu/ml/datasets/spambase. A base está disponível na pasta `datasets` com o nome `SpamDataset.csv`. A versão deste repositório já está com os nomes das colunas. No repositório original, essas informações estão em arquivos separados. Não foi feita nenhuma modificação nos dados, apenas a inclusão no nome das colunas. 

In [None]:
data_ = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Adolfo/datasets/SpamDataset.csv")
data_

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,classe
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


A base que vamos trabalhar é usada para classificar textos como SPAM ou NÃO SPAM. Cada instância é um e-mail que foi classificado nestas duas classes. Cada instância é caracterizada por 57 atributos e 1 atributo de classe. Cada um dos 57 atributos tem relação com o conteúdo do texto usado para gerar a base. A classe pode assumir dois valores: 1 (texto classificado como Spam) e 0 (texto classificado como Não Spam).

Segue uma descrição dos atributos (em inglês) tirada da documentação da base:

```
The last column of 'spambase.data' denotes whether the e-mail was 
considered spam (1) or not (0), i.e. unsolicited commercial e-mail.  
Most of the attributes indicate whether a particular word or
character was frequently occuring in the e-mail.  The run-length
attributes (55-57) measure the length of sequences of consecutive 
capital letters.  For the statistical measures of each attribute, 
see the end of this file.  Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD 
= percentage of words in the e-mail that match WORD,
i.e. 100 * (number of times the WORD appears in the e-mail) / 
total number of words in e-mail.  A "word" in this case is any 
string of alphanumeric characters bounded by non-alphanumeric 
characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR
= percentage of characters in the e-mail that match CHAR,
i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_longest
= length of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0), 
i.e. unsolicited commercial e-mail.  
```

Para mais informações, acesse o link original da base. 

Vamos separar os atributos da classe nas variáveis `X` e `y`, respectivamente. 

In [None]:
X = data_[data_.columns[:-1]]
y = data_[data_.columns[-1]]

## Treinando os modelos 

Vamos treinar os modelos vistos em sala de aula: KNN, Árvore de Decisão e SVM. Em cada um destes modelos vamos alterar alguns atributos com o intuito de ajustar o melhor modelo. 

### KNN

Vamos usar a implementação [`sklearn.neighbors.KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) do KNN no Scikit-Learn. Neste caso, vamos modificar o parâmetro K (número de vizinhos), representado pelo atributo `n_neighbors`. A escolha do melhor valor de K depende dos dados que serão treinados. De forma geral, um valor alto de K reduz os efeitos de ruídos. No entanto, eles tornam os limites de classificação menos distintos, o que pode implicar em erros do modelo. 

### Árvore de Decisão 

Vamos usar a implementação [`sklearn.tree.DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) da árvore de Decisão do Scikit-Learn. Neste exemplo, vamos trabalhar com dois parâmetros que nos ajudam a controlar o tamanho da árvore de decisão gerada. Lembre que em alguns casos, o tamanho da árvore pode ser um problema. Podar essa árvore pode evitar problemas de consumo de memória e, em alguns casos, evitar uma árvore superajustada que cause o que chamamos de `overfitting`. O tamanho da árvore vai ser controlado pelo parâmetro `max_depth` (A profundidade máxima da árvore. Se `None` (valor padrão), os nós são expandidos até todos os nós serem puros ou até todas as folhas conterem pelo menos `min_samples_split` exemplos). 

[Ver esse link](https://stackoverflow.com/questions/46480457/difference-between-min-samples-split-and-min-samples-leaf-in-sklearn-decisiontre) para entender um pouco melhor a diferença de `min_samples_split` e `min_samples_leaf`. 

### SVM 

Vamos usar a implementação [`sklearn.svm.SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) do SVM do Scikit-Learn. Vamos modificar três parâmetros: 

* `kernel`: Especifica o tipo de kernel que queremos usar no algoritmo. O `rbf` é o padrão e pode assumir os valores `linear`, `poly`, `rbf`, `sigmoid`, `precomputed`. Vamos utilizar o `linear`, `poly` e o `rbf` nos nossos exemplos. 
* `C`: O `C` é um parâmetro de regularização que controla "o quanto o modelo erra". Valores altos implica que o modelo vai errar mais e valores baixos implica em menos erros. Não existe uma regra de que o menor erro implica em um modelo melhor. O parâmetro deve ser modificado e testado para a base de dados. 
* `gamma`: O `gamma` é um parâmetro utilizado quando trabalhamos com o kernel `rbf` que ajuda a controlar o quanto de curvatura vamos querer no modelo. Maior valor implica em maior curvaturas e menor, menos curvaturas. Isso ajuda a controlar o quão complexo é o modelo que está sendo gerado. 

[Veja esse link](https://medium.com/@myselfaman12345/c-and-gamma-in-svm-e6cee48626be) para entender um pouco mais destes parâmetros.


Vamos treinar nosso modelos :) 





## Usando Validação Cruzada

Vamos utilizar a validação cruzada para executar os testes dos modelos escolhidos. Para cada modelo, vamos mudar alguns parâmetros e analisar do ponto de vista das métricas a qualidade de cada modelo. 



In [None]:
def cross_validate_model(model, value_X, value_y):

    cross_result = cross_validate(model, value_X, value_y, scoring=('accuracy','precision','recall','f1'))

    result_values = [
        cross_result['fit_time'].mean(), 
        cross_result['score_time'].mean(), 
        cross_result['test_accuracy'].mean(), 
        cross_result['test_precision'].mean(), 
        cross_result['test_recall'].mean(), 
        cross_result['test_f1'].mean()
    ]

    return result_values

    

Vamos começar com o KNN, informando que o valor de K deve variar de 1 a 15. 

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.neighbors import KNeighborsClassifier

In [None]:
all_knn_results = []
for k in range(1,15):

    # Validação Cruzada
    knn_ = KNeighborsClassifier(n_neighbors=k)
    results_ = cross_validate_model(knn_, X, y)

    # Gerar a lista de valores
    temp_list = [k]
    temp_list.extend(results_)
    all_knn_results.append(temp_list)


In [None]:
print(tabulate(all_knn_results, headers=['k','Tempo Treimo','Tempo Teste','Acurácia','Precisão','Recall','F1']))

  k    Tempo Treimo    Tempo Teste    Acurácia    Precisão    Recall        F1
---  --------------  -------------  ----------  ----------  --------  --------
  1      0.00895534       0.113317    0.777007    0.717829  0.724201  0.720072
  2      0.0064652        0.110525    0.764835    0.781242  0.563131  0.653822
  3      0.00668736       0.118536    0.776354    0.720975  0.719234  0.718433
  4      0.00643535       0.141741    0.767444    0.754983  0.613873  0.675613
  5      0.00577607       0.136667    0.772446    0.720053  0.699938  0.708679
  6      0.00641098       0.132635    0.760059    0.732517  0.62272   0.672297
  7      0.00613003       0.139926    0.768752    0.713367  0.70105   0.70594
  8      0.00647264       0.136924    0.759624    0.728924  0.628793  0.673932
  9      0.0061316        0.139597    0.759842    0.702385  0.687268  0.693561
 10      0.00611396       0.143459    0.759842    0.724332  0.639823  0.677938
 11      0.00560989       0.140635    0.760278    0.7

No caso da árvore de decisão vamos modificar o atributo que controla o tamanho da árvore. 

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
all_tree_results = []
for depth_ in range(1, 15): 
    tree_ = DecisionTreeClassifier(max_depth=depth_, random_state=42)
    temp_list = [depth_]
    results_ = cross_validate_model(tree_, X, y)
    temp_list.extend(results_)
    all_tree_results.append(temp_list)

In [None]:
print(tabulate(all_tree_results, headers=['Profundidade','Tempo Treimo','Tempo Teste','Acurácia','Precisão','Recall','F1']))

  Profundidade    Tempo Treimo    Tempo Teste    Acurácia    Precisão    Recall        F1
--------------  --------------  -------------  ----------  ----------  --------  --------
             1       0.0111142     0.00628057    0.77505     0.825201  0.591349  0.677385
             2       0.0148803     0.00591307    0.840902    0.873568  0.73528   0.790326
             3       0.0219108     0.006213      0.869814    0.873883  0.791515  0.827812
             4       0.0242724     0.00615134    0.881762    0.900529  0.796452  0.843372
             5       0.0293884     0.00618162    0.889586    0.886307  0.836175  0.858561
             6       0.0351223     0.00631537    0.899148    0.893452  0.853804  0.871505
             7       0.0381377     0.00604525    0.899799    0.886041  0.867606  0.874813
             8       0.0481462     0.00700002    0.896538    0.882163  0.864849  0.871281
             9       0.047369      0.00607805    0.895018    0.880424  0.863197  0.869354
          

No caso do SVM vamos modificar os atributos `C` e `gamma` que controla as funções de separação que são geradas. 

In [None]:
from sklearn.svm import SVC 

O `C` vai variar com os valores `[0.01, 0.1, 1, 10, 100, 1000]`.

In [None]:
all_svm_results = []
for C in [0.01, 0.1, 1, 10, 100, 1000]:
    print("C = %f" % C, end=" ")
    svm = SVC(kernel='rbf', C=C, random_state=42)
    temp_list = [C]
    result_ = cross_validate_model(svm, X, y)
    temp_list.extend(result_)
    all_svm_results.append(temp_list)



C = 0.010000 C = 0.100000 C = 1.000000 C = 10.000000 C = 100.000000 C = 1000.000000 

In [None]:
print(tabulate(all_svm_results, headers=['C','Tempo Treimo','Tempo Teste','Acurácia','Precisão','Recall','F1']))

      C    Tempo Treimo    Tempo Teste    Acurácia    Precisão    Recall        F1
-------  --------------  -------------  ----------  ----------  --------  --------
   0.01        0.993678       0.305682    0.668129    0.686506  0.301167  0.417589
   0.1         0.796206       0.243973    0.69008     0.670696  0.424713  0.51959
   1           0.762783       0.231045    0.705075    0.706021  0.43849   0.539785
  10           0.748279       0.224609    0.732235    0.745362  0.493119  0.592809
 100           0.724858       0.198525    0.821997    0.81729   0.724242  0.764404
1000           0.86045        0.140765    0.885458    0.871563  0.847211  0.856721


O `gamma` vai variar com os valores `[0.001, 0.01, 0.1, 1, 10, 100, 1000]`

In [None]:
all_svm_results = []
for gamma in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    print("Gamma = %.3f" % gamma, end=" ")
    svm = SVC(kernel='rbf', C=1000, gamma=gamma, random_state=42)
    temp_list = [gamma]
    result_ = cross_validate_model(svm, X, y)
    temp_list.extend(result_)
    all_svm_results.append(temp_list)



Gamma = 0.001 Gamma = 0.010 Gamma = 0.100 Gamma = 1.000 Gamma = 10.000 Gamma = 100.000 Gamma = 1000.000 

In [None]:
print(tabulate(all_svm_results, headers=['C','Tempo Treimo','Tempo Teste','Acurácia','Precisão','Recall','F1']))

       C    Tempo Treimo    Tempo Teste    Acurácia    Precisão    Recall        F1
--------  --------------  -------------  ----------  ----------  --------  --------
   0.001        1.75683        0.117346    0.859811    0.822488  0.838927  0.828333
   0.01         0.968663       0.179592    0.824821    0.761331  0.820176  0.788416
   0.1          1.18119        0.290426    0.726141    0.779357  0.424081  0.548392
   1            1.19236        0.36619     0.677682    0.989387  0.184196  0.307747
  10            1.5563         0.466601    0.664858    0.994286  0.150564  0.259727
 100            1.02461        0.294643    0.661164    0.994118  0.141187  0.24548
1000            1.0273         0.295675    0.655079    0.99375   0.125743  0.220919


## Agora é como vocês

Analise os resultados gerados até aqui juntamente com mais dois métodos: Regressão Logística e RandomForest. Pesquise sobre esses métodos e veja quais atributos seriam interessantes de modificar. Execute os testes e escolha dentre todos os modelos analisados qual é o melhor para a tarefa de detecção de SPAM e NÃO SPAM. 

* Documentação da Regressão Logística: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* Documentação do Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

## Eu

### Regressão logistica 

https://www.datacamp.com/tutorial/understanding-logistic-regression-python  
https://realpython.com/logistic-regression-python/

X e y 

y_pred = clf.predict(X)



* Parâmetros  

penalty is a string ('l2' by default) that decides whether there is regularization and which approach to use. Other options are 'l1', 'elasticnet', and 'none'.

dual is a Boolean (False by default) that decides whether to use primal (when False) or dual formulation (when True).

tol is a floating-point number (0.0001 by default) that defines the tolerance for stopping the procedure.

C is a positive floating-point number (1.0 by default) that defines the relative strength of regularization. Smaller values indicate stronger regularization.

fit_intercept is a Boolean (True by default) that decides whether to calculate the intercept 𝑏₀ (when True) or consider it equal to zero (when False).

intercept_scaling is a floating-point number (1.0 by default) that defines the scaling of the intercept 𝑏₀.

class_weight is a dictionary, 'balanced', or None (default) that defines the weights related to each class. When None, all classes have the weight one.

random_state is an integer, an instance of numpy.RandomState, or None (default) that defines what pseudo-random number generator to use.

solver is a string ('liblinear' by default) that decides what solver to use for fitting the model. Other options are 'newton-cg', 'lbfgs', 'sag', and 'saga'.

max_iter is an integer (100 by default) that defines the maximum number of iterations by the solver during model fitting.

multi_class is a string ('ovr' by default) that decides the approach to use for handling multiple classes. Other options are 'multinomial' and 'auto'.

verbose is a non-negative integer (0 by default) that defines the verbosity for the 'liblinear' and 'lbfgs' solvers.

warm_start is a Boolean (False by default) that decides whether to reuse the previously obtained solution.

n_jobs is an integer or None (default) that defines the number of parallel processes to use. None usually means to use one core, while -1 means to use all available cores.

l1_ratio is either a floating-point number between zero and one or None (default). It defines the relative importance of the L1 part in the elastic-net regularization.


In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
all_logistic_results =[]

for solver in ['liblinear','newton-cg','lbfgs','sag','saga']:
  model = LogisticRegression(solver=solver,random_state=16,max_iter=100).fit(X, y)
  temp_list = [solver]
  result_ = cross_validate_model(model, X, y)
  temp_list.extend(result_)
  all_logistic_results.append(temp_list)
  print(f'\nSolver {solver} concluído!\n')


Solver liblinear concluído!


Solver newton-cg concluído!



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist


Solver lbfgs concluído!






Solver sag concluído!






Solver saga concluído!





In [None]:
print(tabulate(all_logistic_results, headers=['Solução','Tempo Treimo','Tempo Teste','Acurácia','Precisão','Recall','F1']))

Solução      Tempo Treimo    Tempo Teste    Acurácia    Precisão    Recall        F1
---------  --------------  -------------  ----------  ----------  --------  --------
liblinear        0.133621     0.0144161     0.912626    0.894385  0.886912  0.889818
newton-cg        0.324195     0.00996494    0.912409    0.894715  0.885807  0.889466
lbfgs            0.135847     0.00955877    0.891323    0.855949  0.887469  0.86873
sag              0.328369     0.010554      0.440778    0.413074  0.991719  0.583129
saga             0.399501     0.00958128    0.426215    0.406993  0.994477  0.577519


### Random Forest

https://youtu.be/jBGxiu8K11o  
https://data36.com/random-forest-in-python/


n_estimators determines the number of decision trees that make up our random forest. The more, the better.

max_features defines the number of features that each decision tree takes into consideration at each split. If you read the scikit-learn documentation, you’ll know that the default value for max_features is auto, which is actually the same as sqrt (=the square root of the number of features)). Using sqrt is the recommended setting.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
all_rforests_results =[]

for three in [10,20,50,80,100]:
  model = RandomForestClassifier(n_estimators=three, max_features="auto", random_state=16).fit(X, y)
  temp_list = [three]
  result_ = cross_validate_model(model, X, y)
  temp_list.extend(result_)
  all_rforests_results.append(temp_list)
  print(f'Com {three} árvores, concluído!')

Com 10 árvores, concluído!
Com 20 árvores, concluído!
Com 50 árvores, concluído!
Com 80 árvores, concluído!
Com 100 árvores, concluído!


In [None]:
print(tabulate(all_rforests_results, headers=['N Árvores','Tempo Treimo','Tempo Teste','Acurácia','Precisão','Recall','F1']))

  N Árvores    Tempo Treimo    Tempo Teste    Acurácia    Precisão    Recall        F1
-----------  --------------  -------------  ----------  ----------  --------  --------
         10       0.0506142     0.00713086    0.920667    0.919337  0.882492  0.899412
         20       0.0988116     0.00766759    0.923927    0.917876  0.89574   0.905158
         50       0.236628      0.0128105     0.928056    0.918298  0.906781  0.910926
         80       0.380094      0.0183857     0.929143    0.919687  0.906784  0.911779
        100       0.474663      0.0214505     0.928926    0.918455  0.907887  0.911677
