# Reconhecimento de Folhas com Machine Learning

An introduction to machine learning with scikit-learn
https://scikit-learn.org/stable/
https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Supervised learning
https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Unsupervised learning
https://scikit-learn.org/stable/unsupervised_learning.html

Neste notebook você concluirá seu primeiro projeto de Machine Learning usando Python. 

Um projeto de Machine Learning tem uma série de etapas bem conhecidas:

- Defina o problema.
- Prepare os dados.
- Avalie algoritmos.
- Melhore os resultados.
- Resultados finais.

![image.png](attachment:14b85f78-7ab2-4425-9a9e-d3bb08480ce5.png)

# Treinamento Supervisionado para Modelar

![image.png](attachment:53109a22-6bcc-4eaf-bc75-3e4dcef808ad.png)

* Import das bibliotecas

In [1]:
import logging
import pandas as pd
import pandas_profiling
import pickle as pkl
import sys
import wandb

from sklearn.model_selection import train_test_split

Instalação pandas-profiling
```python
!{sys.executable} -m pip install pandas-profiling

!jupyter nbextension enable --py widgetsnbextension
```

Login no Wandb
```python
!wandb login
```

Setando o logger

In [5]:
# logger = logging.getLogger()
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = 20

In [6]:
logging.info('sara')

INFO : sara


Instanciando o wandb

In [7]:
run = wandb.init(project="eldorado_leaf", group="eda", save_code=True)

wandb: Currently logged in as: saraselis (use `wandb login --relogin` to force relogin)


# Carregar o conjunto de dados
Podemos carregar diretamente do aquivo .csv.

Tudo deve carregar sem erros. Se você tiver um erro, pare. Instale as libs que ainda não estão instaladas, usando pip install <lib_name>

## Leaf Dataset

The provided data comprises the following shape (attributes 3 to 9) and texture (attributes 10
to 16) features:

1. Class (Species)
2. Specimen Number
3. Eccentricity
4. Aspect Ratio
5. Elongation
6. Solidity
7. Stochastic Convexity
8. Isoperimetric Factor
9. Maximal Indentation Depth
10. Lobedness
11. Average Intensity
12. Average Contrast
13. Smoothness
14. Third moment
15. Uniformity
16. Entropy

Como o dataset vem sem as colunas, vamos setar as colunas

In [8]:
names = ['Class', 'Specimen Number', 'Eccentricity', 'Aspect Ratio', 'Elongation', 'Solidity', 'Stochastic Convexity', 'Isoperimetric Factor',
         'Maximal Indentation Depth', 'Lobedness', 'Average Intensity', 'Average Contrast', 'Smoothness', 'Third moment',
         'Uniformity', 'Entropy']

Leitura do dataset

In [9]:
data_leaf = pd.read_csv('leaf.csv', names=names)

Dataset

In [6]:
data_leaf.sample()

Unnamed: 0,Class,Specimen Number,Eccentricity,Aspect Ratio,Elongation,Solidity,Stochastic Convexity,Isoperimetric Factor,Maximal Indentation Depth,Lobedness,Average Intensity,Average Contrast,Smoothness,Third moment,Uniformity,Entropy
38,4,7,0.39289,1.1286,0.17039,0.96405,1.0,0.79407,0.011761,0.025176,0.025409,0.078211,0.00608,0.00153,0.000158,1.0326


Salvando o dataset processado no wandb

In [7]:
artifact = wandb.Artifact(
        'data_leaf_processado',
        type='csv',
        description='Dataset após o processamento das colunas')

artifact.add_file('leaf.csv')

<ManifestEntry digest: 8enho+coZZ231SjBNt9l1g==>

In [8]:
run.log_artifact(artifact)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f86d8f1c3a0>

In [10]:
data_leaf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Class                      340 non-null    int64  
 1   Specimen Number            340 non-null    int64  
 2   Eccentricity               340 non-null    float64
 3   Aspect Ratio               340 non-null    float64
 4   Elongation                 340 non-null    float64
 5   Solidity                   340 non-null    float64
 6   Stochastic Convexity       340 non-null    float64
 7   Isoperimetric Factor       340 non-null    float64
 8   Maximal Indentation Depth  340 non-null    float64
 9   Lobedness                  340 non-null    float64
 10  Average Intensity          340 non-null    float64
 11  Average Contrast           340 non-null    float64
 12  Smoothness                 340 non-null    float64
 13  Third moment               340 non-null    float64

Relatório do pandas sobre o dataset

In [10]:
profile = pandas_profiling.ProfileReport(data_leaf)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [11]:
# Data shape
logging.info(f'Dataset features names: {data_leaf.shape}')

INFO : Dataset features names: (340, 16)


## Classes

In [12]:
# Classes (Labels)
y = data_leaf['Class']

# Classe distribuição


In [13]:
y

0       1
1       1
2       1
3       1
4       1
       ..
335    36
336    36
337    36
338    36
339    36
Name: Class, Length: 340, dtype: int64

In [14]:
# Classe shape
logging.info(y.shape)

# Classe types
logging.info(type(y))

INFO : (340,)
INFO : <class 'pandas.core.series.Series'>


## Entradas

In [15]:
# Entrada (input)
X = data_leaf.drop('Class', axis=1)
X.head()

Unnamed: 0,Specimen Number,Eccentricity,Aspect Ratio,Elongation,Solidity,Stochastic Convexity,Isoperimetric Factor,Maximal Indentation Depth,Lobedness,Average Intensity,Average Contrast,Smoothness,Third moment,Uniformity,Entropy
0,1,0.72694,1.4742,0.32396,0.98535,1.0,0.83592,0.004657,0.003947,0.04779,0.12795,0.016108,0.005232,0.000275,1.1756
1,2,0.74173,1.5257,0.36116,0.98152,0.99825,0.79867,0.005242,0.005002,0.02416,0.090476,0.008119,0.002708,7.5e-05,0.69659
2,3,0.76722,1.5725,0.38998,0.97755,1.0,0.80812,0.007457,0.010121,0.011897,0.057445,0.003289,0.000921,3.8e-05,0.44348
3,4,0.73797,1.4597,0.35376,0.97566,1.0,0.81697,0.006877,0.008607,0.01595,0.065491,0.004271,0.001154,6.6e-05,0.58785
4,5,0.82301,1.7707,0.44462,0.97698,1.0,0.75493,0.007428,0.010042,0.007938,0.045339,0.002051,0.00056,2.4e-05,0.34214


In [16]:
# Entrada shape
logging.info(X.shape)

# Entrada types
logging.info(type(X))

INFO : (340, 15)
INFO : <class 'pandas.core.frame.DataFrame'>


## Separa os dados de treinamento e teste

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=None, random_state=0)

In [18]:
logging.info(X_train.shape)
logging.info(y_train.shape)
logging.info(X_test.shape)
logging.info(y_test.shape)

INFO : (255, 15)
INFO : (255,)
INFO : (85, 15)
INFO : (85,)


Salvando localmente

In [46]:
pkl.dump(X_train, open('model/Xy/X_train.pkl', 'wb'))
pkl.dump(y_train, open('model/Xy/y_train.pkl', 'wb'))

pkl.dump(X_test, open('model/Xy/X_test.pkl', 'wb'))
pkl.dump(y_test, open('model/Xy/y_test.pkl', 'wb'))

In [19]:
type(X_train)

pandas.core.frame.DataFrame

Salvando W&B

In [21]:
x_train_wb = wandb.Artifact(
        'x_train',
        type='pandas.core.frame.DataFrame',
        description='X treino')

x_train.add_file('model/Xy/X_train.pkl')
run.log_artifact(x_train)

<ManifestEntry digest: I+LmpPg5uXPL8Yvh13DK0w==>

In [24]:
y_train_wb = wandb.Artifact(
        'y_train',
        type='pandas.core.frame.DataFrame',
        description='Y treino')

y_train.add_file('model/Xy/y_train.pkl')
run.log_artifact(y_train)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f050460fd30>

In [25]:
x_teste_wb= wandb.Artifact(
        'x_teste',
        type='pandas.core.frame.DataFrame',
        description='X teste')

x_teste.add_file('model/Xy/X_test.pkl')
run.log_artifact(x_teste)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f0504602df0>

In [26]:
y_teste_wb = wandb.Artifact(
        'y_teste',
        type='pandas.core.frame.DataFrame',
        description='Y teste')

y_teste.add_file('model/Xy/y_test.pkl')
run.log_artifact(y_teste)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f05045a75e0>

## Treinar o modelo de Machine Learning

In [28]:
# Load Libs

import sklearn
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import set_config

In [31]:
steps = [('preprocessing', StandardScaler()),
         ('classifier', SVC(kernel='rbf'))]

pipe = Pipeline(steps)

model_deploy = pipe.fit(X_train, y_train)

In [32]:
model_deploy

Pipeline(steps=[('preprocessing', StandardScaler()), ('classifier', SVC())])

In [33]:
print('Training set score: ' + str(model_deploy.score(X_train,y_train)))
print('Test set score: ' + str(model_deploy.score(X_test,y_test)))

Training set score: 0.7019607843137254
Test set score: 0.4


In [34]:
set_config(display='diagram')
model_deploy  # clique no diagrama abaixo para ver os detalhes de cada etapa

## Salvar o modelo de Machine Learning para disponibilizar na Cloud IBM

In [35]:
# save the model to disk
pkl.dump(model_deploy, open('model_leaf_deploy.pkl', 'wb'))

# Para melhorar as métricas de classificação do seu primeiro modelo, consulte o material da aula 05

In [37]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, MaxAbsScaler


In [53]:
parameters = {'preprocessing': [StandardScaler(), MinMaxScaler(), Normalizer(), MaxAbsScaler()],
              'classifier__C': [10, 100, 1000],
              'classifier__kernel': ['linear', 'rbf', 'sigmoid']}

In [54]:
import numpy as np

In [55]:
def best_params(values: dict, clf: 'classificador', x_treino: np.array, x_teste: np.array, y_treino: np.array, y_teste: np.array) -> list:
    '''
        Instancia um classificador de busca e procura os melhores parâmetros para o modelo.
        
        Params
        ------
        :values: parametros a serem testados
        :clf: instancia do classificador desejado
        :x_treino: dados de treino
        :x_teste: dados de teste
        :y_treino: dados de treino -labels
        :y_teste: dados de treino - labels
        
        Return
        ------
        :best_params: lista com os melhores parametros
    '''
    
    logger.info('Instanciando SVM')
    random_clf = RandomizedSearchCV(clf, param_distributions=values, n_iter=200, verbose=1)
#     random_clf = RandomizedSearchCV(clf, param_distributions=values, n_iter=200, verbose=1)
    
    logger.info('Treinando SVM')
    random_clf.fit(x_treino, y_treino)
    
    logger.info('Predict SVM')
    y_random_clf_rl = random_clf.predict(x_teste)
    #print(y_random_clf_rl)
    
    logger.info('Parametros SVM')
    print(RED, random_clf.get_params())
    
    logger.info('Best Params SVM')
    best_params = random_clf.best_params_
    print(BLUE, best_params)
    
    return best_params, random_clf

In [56]:
from sklearn.model_selection import RandomizedSearchCV, train_test_split

In [57]:
%%time

try:
    best, random_clf = best_params(parameters, model_deploy, X_train, X_test, y_train, y_test)

except Exception as error:
    logger.warning('Aconteceu algum problema...')
    logger.critical(error)

else:
    logger.info('Ok')
    
finally:
    logger.info('Busca finalizada')

INFO : Instanciando SVM
INFO : Treinando SVM


Fitting 5 folds for each of 10 candidates, totalling 50 fits


INFO : Predict SVM
INFO : Parametros SVM
INFO : Best Params SVM
INFO : Ok
INFO : Busca finalizada


[1;31m {'cv': None, 'error_score': nan, 'estimator__memory': None, 'estimator__steps': [('preprocessing', StandardScaler()), ('classifier', SVC())], 'estimator__verbose': False, 'estimator__preprocessing': StandardScaler(), 'estimator__classifier': SVC(), 'estimator__preprocessing__copy': True, 'estimator__preprocessing__with_mean': True, 'estimator__preprocessing__with_std': True, 'estimator__classifier__C': 1.0, 'estimator__classifier__break_ties': False, 'estimator__classifier__cache_size': 200, 'estimator__classifier__class_weight': None, 'estimator__classifier__coef0': 0.0, 'estimator__classifier__decision_function_shape': 'ovr', 'estimator__classifier__degree': 3, 'estimator__classifier__gamma': 'scale', 'estimator__classifier__kernel': 'rbf', 'estimator__classifier__max_iter': -1, 'estimator__classifier__probability': False, 'estimator__classifier__random_state': None, 'estimator__classifier__shrinking': True, 'estimator__classifier__tol': 0.001, 'estimator__classifier__verbose

In [58]:
best

{'preprocessing': StandardScaler(),
 'classifier__kernel': 'linear',
 'classifier__C': 1000}

In [60]:
random_clf

In [46]:
RED = "\033[1;31m"
BLUE = "\033[1;34m"
GREEN = "\033[1;32m"
PINK = "\033[1;45m"
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKCYAN = '\033[96m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
MAG = "\033[1;45m"