# Reconhecimento de Folhas com Machine Learning

An introduction to machine learning with scikit-learn
https://scikit-learn.org/stable/
https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Supervised learning
https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Unsupervised learning
https://scikit-learn.org/stable/unsupervised_learning.html

Neste notebook você concluirá seu primeiro projeto de Machine Learning usando Python. 

Um projeto de Machine Learning tem uma série de etapas bem conhecidas:

- Defina o problema.
- Prepare os dados.
- Avalie algoritmos.
- Melhore os resultados.
- Resultados finais.

![image.png](attachment:14b85f78-7ab2-4425-9a9e-d3bb08480ce5.png)

# Treinamento Supervisionado para Modelar

![image.png](attachment:53109a22-6bcc-4eaf-bc75-3e4dcef808ad.png)

* Import das bibliotecas

In [78]:
import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_profiling
import pickle as pkl
import seaborn as sns 
import sys
import wandb

from sklearn import set_config
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler, Normalizer, StandardScaler
from sklearn.svm import SVC

Instalação pandas-profiling
```python
!{sys.executable} -m pip install pandas-profiling

!jupyter nbextension enable --py widgetsnbextension
```

Login no Wandb
```python
!wandb login
```

Setando o logger

In [2]:
# logger = logging.getLogger()
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = 20

In [3]:
RED = "\033[1;31m"
BLUE = "\033[1;34m"
GREEN = "\033[1;32m"
PINK = "\033[1;45m"
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKCYAN = '\033[96m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
MAG = "\033[1;45m"

Instanciando o wandb

In [4]:
run = wandb.init(project="eldorado_leaf", group="eda", save_code=True)

wandb: Currently logged in as: saraselis (use `wandb login --relogin` to force relogin)


# Carregar o conjunto de dados
Podemos carregar diretamente do aquivo .csv.

Tudo deve carregar sem erros. Se você tiver um erro, pare. Instale as libs que ainda não estão instaladas, usando pip install <lib_name>

## Leaf Dataset

The provided data comprises the following shape (attributes 3 to 9) and texture (attributes 10
to 16) features:

1. Class (Species)
2. Specimen Number
3. Eccentricity
4. Aspect Ratio
5. Elongation
6. Solidity
7. Stochastic Convexity
8. Isoperimetric Factor
9. Maximal Indentation Depth
10. Lobedness
11. Average Intensity
12. Average Contrast
13. Smoothness
14. Third moment
15. Uniformity
16. Entropy

Como o dataset vem sem as colunas, vamos setar as colunas

In [5]:
names = ['Class', 'Specimen Number', 'Eccentricity', 'Aspect Ratio', 'Elongation', 'Solidity', 'Stochastic Convexity', 'Isoperimetric Factor',
         'Maximal Indentation Depth', 'Lobedness', 'Average Intensity', 'Average Contrast', 'Smoothness', 'Third moment',
         'Uniformity', 'Entropy']

Leitura do dataset

In [6]:
data_leaf = pd.read_csv('leaf.csv', names=names)

Dataset

In [7]:
data_leaf.sample()

Unnamed: 0,Class,Specimen Number,Eccentricity,Aspect Ratio,Elongation,Solidity,Stochastic Convexity,Isoperimetric Factor,Maximal Indentation Depth,Lobedness,Average Intensity,Average Contrast,Smoothness,Third moment,Uniformity,Entropy
118,11,11,0.52382,1.1117,0.67175,0.54701,0.62982,0.15157,0.13674,3.4028,0.026434,0.085792,0.007306,0.002137,0.000166,0.90513


Salvando o dataset processado no wandb

In [8]:
artifact = wandb.Artifact(
        'data_leaf_processado',
        type='csv',
        description='Dataset após o processamento das colunas')

artifact.add_file('leaf.csv')

<ManifestEntry digest: 8enho+coZZ231SjBNt9l1g==>

In [9]:
run.log_artifact(artifact)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f487c723d60>

In [10]:
data_leaf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Class                      340 non-null    int64  
 1   Specimen Number            340 non-null    int64  
 2   Eccentricity               340 non-null    float64
 3   Aspect Ratio               340 non-null    float64
 4   Elongation                 340 non-null    float64
 5   Solidity                   340 non-null    float64
 6   Stochastic Convexity       340 non-null    float64
 7   Isoperimetric Factor       340 non-null    float64
 8   Maximal Indentation Depth  340 non-null    float64
 9   Lobedness                  340 non-null    float64
 10  Average Intensity          340 non-null    float64
 11  Average Contrast           340 non-null    float64
 12  Smoothness                 340 non-null    float64
 13  Third moment               340 non-null    float64

Relatório do pandas sobre o dataset

In [11]:
profile = pandas_profiling.ProfileReport(data_leaf)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [12]:
# Data shape
logging.info(f'Dataset features names: {data_leaf.shape}')

INFO : Dataset features names: (340, 16)


## Classes

In [13]:
# Classes (Labels)
y = data_leaf['Class']

In [14]:
y

0       1
1       1
2       1
3       1
4       1
       ..
335    36
336    36
337    36
338    36
339    36
Name: Class, Length: 340, dtype: int64

In [84]:
# Classe shape
logging.info(y.shape)

# Classe types
logging.info(type(y))
logging.info(y.dtypes)

INFO : (340,)
INFO : <class 'pandas.core.series.Series'>
INFO : int64


## Entradas

In [16]:
# Entrada (input)
X = data_leaf.drop('Class', axis=1)
X.head()

Unnamed: 0,Specimen Number,Eccentricity,Aspect Ratio,Elongation,Solidity,Stochastic Convexity,Isoperimetric Factor,Maximal Indentation Depth,Lobedness,Average Intensity,Average Contrast,Smoothness,Third moment,Uniformity,Entropy
0,1,0.72694,1.4742,0.32396,0.98535,1.0,0.83592,0.004657,0.003947,0.04779,0.12795,0.016108,0.005232,0.000275,1.1756
1,2,0.74173,1.5257,0.36116,0.98152,0.99825,0.79867,0.005242,0.005002,0.02416,0.090476,0.008119,0.002708,7.5e-05,0.69659
2,3,0.76722,1.5725,0.38998,0.97755,1.0,0.80812,0.007457,0.010121,0.011897,0.057445,0.003289,0.000921,3.8e-05,0.44348
3,4,0.73797,1.4597,0.35376,0.97566,1.0,0.81697,0.006877,0.008607,0.01595,0.065491,0.004271,0.001154,6.6e-05,0.58785
4,5,0.82301,1.7707,0.44462,0.97698,1.0,0.75493,0.007428,0.010042,0.007938,0.045339,0.002051,0.00056,2.4e-05,0.34214


In [83]:
# Entrada shape
logging.info(X.shape)

# Entrada types
logging.info(type(X))
logging.info(X.dtypes)

INFO : (340, 15)
INFO : <class 'pandas.core.frame.DataFrame'>
INFO : Specimen Number                int64
Eccentricity                 float64
Aspect Ratio                 float64
Elongation                   float64
Solidity                     float64
Stochastic Convexity         float64
Isoperimetric Factor         float64
Maximal Indentation Depth    float64
Lobedness                    float64
Average Intensity            float64
Average Contrast             float64
Smoothness                   float64
Third moment                 float64
Uniformity                   float64
Entropy                      float64
dtype: object


## Separa os dados de treinamento e teste

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=None, random_state=0)

In [19]:
logging.info(X_train.shape)
logging.info(y_train.shape)
logging.info(X_test.shape)
logging.info(y_test.shape)

INFO : (255, 15)
INFO : (255,)
INFO : (85, 15)
INFO : (85,)


Salvando localmente

In [20]:
pkl.dump(X_train, open('model/Xy/X_train.pkl', 'wb'))
pkl.dump(y_train, open('model/Xy/y_train.pkl', 'wb'))

pkl.dump(X_test, open('model/Xy/X_test.pkl', 'wb'))
pkl.dump(y_test, open('model/Xy/y_test.pkl', 'wb'))

In [21]:
type(X_train)

pandas.core.frame.DataFrame

Salvando no W&B os dados de treino e teste para que futuramente, caso precisemos, já tenhamos esses dados prontos.

In [21]:
x_train_wb = wandb.Artifact(
        'x_train',
        type='pandas.core.frame.DataFrame',
        description='X treino')

x_train.add_file('model/Xy/X_train.pkl')
run.log_artifact(x_train)

<ManifestEntry digest: I+LmpPg5uXPL8Yvh13DK0w==>

In [24]:
y_train_wb = wandb.Artifact(
        'y_train',
        type='pandas.core.frame.DataFrame',
        description='Y treino')

y_train.add_file('model/Xy/y_train.pkl')
run.log_artifact(y_train)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f050460fd30>

In [25]:
x_teste_wb= wandb.Artifact(
        'x_teste',
        type='pandas.core.frame.DataFrame',
        description='X teste')

x_teste.add_file('model/Xy/X_test.pkl')
run.log_artifact(x_teste)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f0504602df0>

In [26]:
y_teste_wb = wandb.Artifact(
        'y_teste',
        type='pandas.core.frame.DataFrame',
        description='Y teste')

y_teste.add_file('model/Xy/y_test.pkl')
run.log_artifact(y_teste)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f05045a75e0>

## Treinar o modelo de Machine Learning

Criando uma pipeline mais simples, apenas com o preprocessamento e o classificador

In [30]:
steps = [('preprocessing', StandardScaler()),
         ('classifier', SVC(kernel='rbf'))]

pipe = Pipeline(steps)

model_deploy = pipe.fit(X_train, y_train)

In [31]:
model_deploy

Pipeline(steps=[('preprocessing', StandardScaler()), ('classifier', SVC())])

In [34]:
logging.info(f'Training set score: {model_deploy.score(X_train,y_train)}')
logging.info(f'Training set score: {model_deploy.score(X_test,y_test)}')

INFO : Training set score: 0.7019607843137254
INFO : Training set score: 0.4


In [35]:
set_config(display='diagram')
model_deploy  # clique no diagrama abaixo para ver os detalhes de cada etapa

## Salvar o modelo de Machine Learning para disponibilizar na Cloud IBM

In [37]:
# save the model to disk
pkl.dump(model_deploy, open('model/modelo/model_leaf_deploy.pkl', 'wb'))

In [36]:
type(model_deploy)

sklearn.pipeline.Pipeline

In [39]:
model_deploy_wb = wandb.Artifact(
        'modelo_treinado',
        type='sklearn.pipeline.Pipeline',
        description='Modelo svm treinado')

model_deploy_wb.add_file('model/modelo/model_leaf_deploy.pkl')
run.log_artifact(model_deploy_wb)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f47acac63a0>

# Para melhorar as métricas de classificação do seu primeiro modelo, consulte o material da aula 05

In [42]:
parameters = {'preprocessing': [StandardScaler(), MinMaxScaler(), Normalizer(), MaxAbsScaler()],
              'classifier__C': [10, 100, 1000],
              'classifier__kernel': ['linear', 'rbf', 'sigmoid']}

In [48]:
def best_params(values: dict, clf: 'classificador', x_treino: np.array, x_teste: np.array, y_treino: np.array, y_teste: np.array) -> list:
    '''
        Instancia um classificador de busca e procura os melhores parâmetros para o modelo.
        
        Params
        ------
        :values: parametros a serem testados
        :clf: instancia do classificador desejado
        :x_treino: dados de treino
        :x_teste: dados de teste
        :y_treino: dados de treino -labels
        :y_teste: dados de treino - labels
        
        Return
        ------
        :best_params: lista com os melhores parametros
    '''
    
    logging.info('Instanciando SVM')
    random_clf = RandomizedSearchCV(clf, param_distributions=values, n_iter=200, verbose=1)
    
    logging.info('Treinando SVM')
    random_clf.fit(x_treino, y_treino)
    
    logging.info('Predict SVM')
    y_random_clf_rl = random_clf.predict(x_teste)
    #print(y_random_clf_rl)
    
    logging.info('Parametros SVM')
    print(RED, random_clf.get_params())
    
    logging.info('Best Params SVM')
    best_params = random_clf.best_params_
    print(BLUE, best_params)
    
    return best_params, random_clf

In [50]:
%%time

try:
    best, random_clf = best_params(parameters, model_deploy, X_train, X_test, y_train, y_test)

except Exception as error:
    logging.warning('Aconteceu algum problema...')
    logging.critical(error)

else:
    logging.info('Ok')
    
finally:
    logging.info('Busca finalizada')

INFO : Instanciando SVM
INFO : Treinando SVM


Fitting 5 folds for each of 36 candidates, totalling 180 fits


INFO : Predict SVM
INFO : Parametros SVM
INFO : Best Params SVM
INFO : Ok
INFO : Busca finalizada


[1;31m {'cv': None, 'error_score': nan, 'estimator__memory': None, 'estimator__steps': [('preprocessing', StandardScaler()), ('classifier', SVC())], 'estimator__verbose': False, 'estimator__preprocessing': StandardScaler(), 'estimator__classifier': SVC(), 'estimator__preprocessing__copy': True, 'estimator__preprocessing__with_mean': True, 'estimator__preprocessing__with_std': True, 'estimator__classifier__C': 1.0, 'estimator__classifier__break_ties': False, 'estimator__classifier__cache_size': 200, 'estimator__classifier__class_weight': None, 'estimator__classifier__coef0': 0.0, 'estimator__classifier__decision_function_shape': 'ovr', 'estimator__classifier__degree': 3, 'estimator__classifier__gamma': 'scale', 'estimator__classifier__kernel': 'rbf', 'estimator__classifier__max_iter': -1, 'estimator__classifier__probability': False, 'estimator__classifier__random_state': None, 'estimator__classifier__shrinking': True, 'estimator__classifier__tol': 0.001, 'estimator__classifier__verbose

In [54]:
best

{'preprocessing': StandardScaler(),
 'classifier__kernel': 'linear',
 'classifier__C': 10}

In [55]:
random_clf

In [57]:
# Stores the optimum model in best_pipe
best_pipe = random_clf.best_estimator_
logging.info(best_pipe)

INFO : Pipeline(steps=[('preprocessing', StandardScaler()),
                ('classifier', SVC(C=10, kernel='linear'))])


In [60]:
result_df = pd.DataFrame.from_dict(random_clf.cv_results_, orient='columns')
result_df.shape

(36, 16)

In [61]:
result_df.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_preprocessing,param_classifier__kernel,param_classifier__C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.007606,0.001032,0.002077,0.000184,StandardScaler(),linear,10,"{'preprocessing': StandardScaler(), 'classifie...",0.745098,0.764706,0.72549,0.72549,0.647059,0.721569,0.039992,1
1,0.007259,0.000497,0.002268,0.000264,MinMaxScaler(),linear,10,"{'preprocessing': MinMaxScaler(), 'classifier_...",0.666667,0.72549,0.686275,0.705882,0.54902,0.666667,0.062005,17
2,0.007427,0.000471,0.002617,0.000406,Normalizer(),linear,10,"{'preprocessing': Normalizer(), 'classifier__k...",0.27451,0.313725,0.254902,0.254902,0.27451,0.27451,0.021479,27
3,0.007469,0.000588,0.002416,0.000324,MaxAbsScaler(),linear,10,"{'preprocessing': MaxAbsScaler(), 'classifier_...",0.666667,0.72549,0.686275,0.686275,0.54902,0.662745,0.059988,18
4,0.008887,0.000987,0.002319,0.000133,StandardScaler(),rbf,10,"{'preprocessing': StandardScaler(), 'classifie...",0.705882,0.72549,0.764706,0.666667,0.54902,0.682353,0.073784,14


In [85]:
sns.relplot(data=result_df,
    kind='line',
    x='param_classifier__C',
    y='mean_test_score',
    hue='param_preprocessing',
    col='param_classifier__kernel')
plt.show()

In [89]:
sns.relplot(data=result_df,
    kind='line',
    x='param_classifier__C',
    y='mean_test_score',
    hue='param_classifier__kernel',
    col='param_preprocessing')
plt.show()

In [74]:
categorias = random_clf.predict(X_test)

In [75]:
categorias

array([13, 30,  4,  9, 33,  8,  2, 31,  7, 31, 15,  9,  7, 23, 31, 27,  3,
        5,  7,  4, 12,  4,  4,  6,  1, 31,  6, 14, 15, 34, 29, 32, 10, 22,
       12, 13, 10, 10, 29, 28, 13,  6,  7,  6, 32, 26, 13,  8, 22, 27, 24,
        8, 13,  5, 32, 13, 32,  6,  1,  1, 32,  9, 25, 12,  2, 12, 12, 10,
        2, 13, 35,  2, 14,  5,  5,  4, 12, 26, 15, 36, 23, 15, 11,  9,  1])

In [79]:
resultados = classification_report(y_test, categorias)
print(resultados)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.75      1.00      0.86         3
           2       0.75      0.75      0.75         4
           3       1.00      0.50      0.67         2
           4       0.20      1.00      0.33         1
           5       0.25      1.00      0.40         1
           6       0.80      0.80      0.80         5
           7       1.00      0.67      0.80         6
           8       1.00      1.00      1.00         3
           9       0.75      1.00      0.86         3
          10       1.00      1.00      1.00         4
          11       1.00      1.00      1.00         1
          12       0.67      0.80      0.73         5
          13       0.71      1.00      0.83         5
          14       1.00      0.67      0.80         3
          15       1.00      1.00      1.00         4
          22       1.00      0.40      0.57         5
          23       0.50      1.00      0.67         1
          24       1.00    

In [90]:
class_rep_por_classe = classification_report(y_test, categorias, target_names=list(set(y)), output_dict=True)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [91]:
class_rep_por_classe

{1: {'precision': 0.75,
  'recall': 1.0,
  'f1-score': 0.8571428571428571,
  'support': 3},
 2: {'precision': 0.75, 'recall': 0.75, 'f1-score': 0.75, 'support': 4},
 3: {'precision': 1.0,
  'recall': 0.5,
  'f1-score': 0.6666666666666666,
  'support': 2},
 4: {'precision': 0.2,
  'recall': 1.0,
  'f1-score': 0.33333333333333337,
  'support': 1},
 5: {'precision': 0.25, 'recall': 1.0, 'f1-score': 0.4, 'support': 1},
 6: {'precision': 0.8,
  'recall': 0.8,
  'f1-score': 0.8000000000000002,
  'support': 5},
 7: {'precision': 1.0,
  'recall': 0.6666666666666666,
  'f1-score': 0.8,
  'support': 6},
 8: {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 3},
 9: {'precision': 0.75,
  'recall': 1.0,
  'f1-score': 0.8571428571428571,
  'support': 3},
 10: {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 4},
 11: {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 1},
 12: {'precision': 0.6666666666666666,
  'recall': 0.8,
  'f1-score': 0.7272727272727272,
  'su

Salvando o modelo treinado com o random search

In [52]:
pkl.dump(model_deploy, open('model/modelo/model_leaf_rs.pkl', 'wb'))

In [53]:
modelo_treinado_rs = wandb.Artifact(
        'modelo_treinado_rs',
        type='sklearn.pipeline.Pipeline',
        description='Modelo treinado com parametros otimizados via random search')

modelo_treinado_rs.add_file('model/modelo/model_leaf_rs.pkl')
run.log_artifact(modelo_treinado_rs)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f47aca468e0>

Salvando o best estimator

In [71]:
model = random_clf.best_estimator_

In [72]:
pkl.dump(model, open('model/modelo/best_estimator.pkl', 'wb'))

In [73]:
best_estimator_wb = wandb.Artifact(
        'best_estimator',
        type='sklearn.pipeline.Pipeline',
        description='Best estimator')

best_estimator_wb.add_file('model/modelo/best_estimator.pkl')
run.log_artifact(best_estimator_wb)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f47ac4f67c0>