<a href="https://colab.research.google.com/github/Rogerio-mack/IMT_CD_2024/blob/main/IMT_Lab_feature_selection_gridsearch_solucao.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<head>
  <meta name="author" content="Rogério de Oliveira">
  <meta institution="author" content="ITM">
</head>

<img src="https://maua.br/images/selo-60-anos-maua.svg" width=300, align="right">
<!-- <h1 align=left><font size = 6, style="color:rgb(200,0,0)"> optional title </font></h1> -->


# Lab: Seleção de Atributos e Hiperparâmetros





# Caso: **Classificação de Tipos de Vidro de Reciclagem**

Nossa base de dados apresenta vidros de reciclagem separados nas categorias:

* **C = Vidros de Construção**
* **V = Vidros de Veículos**
* **O = Outros**

de acordo com sua composição.

Siga o roteiro de tarefas abaixo para fazer uma seleção de atributos e modelos para a predição de novos casos das classes de vidros para reciclagem.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Aquisição dos dados

In [2]:
df = pd.read_csv('https://github.com/Rogerio-mack/IMT_CD_2024/raw/refs/heads/main/data/recycle_glasses.csv')
display(df.head())
df_cases = pd.read_csv('https://github.com/Rogerio-mack/IMT_CD_2024/raw/refs/heads/main/data/recycle_glasses_cases.csv')
display(df_cases.head())

Unnamed: 0,Pb,Na,Mg,Al,Si,K,Ca,B,Fe,Type of glass
0,1.52,13.64,D,1.1,71.78,0.06,8.75,0.0,0.0,C
1,1.52,13.89,D,1.36,72.73,0.48,7.83,0.0,0.0,C
2,1.52,13.53,C,1.54,72.99,0.39,7.78,0.0,0.0,C
3,1.52,13.21,D,1.29,72.61,0.57,8.22,0.0,0.0,C
4,1.52,13.27,D,1.24,73.08,0.55,8.07,0.0,0.0,C


Unnamed: 0,Pb,Na,Mg,Al,Si,K,Ca,B,Fe
0,1.52,12.81,C,1.48,73.89,0.6,8.12,0.0,0.01
1,1.52,12.89,C,1.52,74.1,0.67,7.83,0.0,0.01
2,1.52,12.9,D,1.19,73.44,0.6,8.43,0.0,0.01
3,1.52,13.33,B,1.52,73.04,0.58,8.79,0.0,0.01
4,1.51,13.81,B,3.5,70.89,1.68,5.87,2.2,0.01


# Ex1. Conjunto de Treinamento e Teste, Label Encode  

Separe os conjuntos de Treinamento e Teste e faça o **Label Encode** dos dados. **Estratifique** os dados de treinamento e teste pela variável algo. Empregue 20% de dados de teste e o `random_state=42`.

In [3]:
from sklearn.model_selection import train_test_split

X = df.drop('Type of glass', axis=1)
y = df['Type of glass']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

In [4]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

X_train['Mg'] = le.fit_transform(X_train['Mg'])
X_test['Mg'] = le.transform(X_test['Mg'])

In [5]:
X_train.sum().sum() == 16433.35

True

# Ex2. Selecione os Atributos

Para a construção do modelo, exclua os 3 atributos com menor ganho de informação (equivalente à informação mútua no scikit-learn).  

In [6]:
from sklearn.feature_selection import mutual_info_classif

# Calculate mutual information
mutual_info = mutual_info_classif(X_train, y_train)

# Create a DataFrame to store feature names and their mutual information scores
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Mutual Information': mutual_info})

# Sort the DataFrame by mutual information in descending order
feature_importance_df = feature_importance_df.sort_values('Mutual Information', ascending=False)

# Display the feature importance
print(feature_importance_df)

  Feature  Mutual Information
3      Al            0.208455
7       B            0.179701
5       K            0.176104
2      Mg            0.166526
6      Ca            0.128326
1      Na            0.115719
0      Pb            0.048593
4      Si            0.030558
8      Fe            0.000000


In [7]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

select_features = SelectKBest(mutual_info_classif, k=6).fit(X_train, y_train)
list(X_train.columns[select_features.get_support()])

['Na', 'Mg', 'Al', 'K', 'Ca', 'B']

In [8]:
feature_importance_df['Feature'].values[0:-3]

array(['Al', 'B', 'K', 'Mg', 'Ca', 'Na'], dtype=object)

In [9]:
X_train = X_train[feature_importance_df['Feature'].values[0:-3]]
X_test = X_test[feature_importance_df['Feature'].values[0:-3]]

# Ex3. RandomForest

Aplique um modelo `RandomForestClassifier(random_state=42)` (o `random_state` controla a aleatoriedade do bootstrapping das amostras usadas ao construir árvores e a amostragem dos recursos a serem considerados ao procurar a melhor divisão em cada nó), fazendo a escolha com o `GridSearchCV` da melhor floresta aleatória considerando as seguintes configurações de **hiperparâmetros**:

* Número de estimadores (árvores): 50, 100 e 200,
* Profundidade das Árvores: de 5 a 10
* Critério de ganho de informação dos nós: índice `gini` e `entropia`

Empregue como métrica para o melhor modelo o **`f1_macro`**, empregue 5 partições de cross validation. Ao final obtenha o classification report do melhor modelo.

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, f1_score

base_estimator = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': range(5,11),
    'criterion': ['gini','entropy']
}

clf = GridSearchCV(base_estimator, param_grid, cv=5, scoring='f1_macro')

clf.fit(X_train, y_train)

print(clf.best_estimator_)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42)
              precision    recall  f1-score   support

           C       0.93      0.86      0.89        29
           O       1.00      1.00      1.00         3
           V       0.64      0.78      0.70         9

    accuracy                           0.85        41
   macro avg       0.85      0.88      0.86        41
weighted avg       0.87      0.85      0.86        41



# Ex4. Predict

Faça a predição dos novos casos com base no melhor modelo obtido.

In [None]:
df_cases['Mg'] = le.transform(df_cases['Mg'])
df_cases = df_cases[feature_importance_df['Feature'].values[0:-3]]
df_cases.head()

Unnamed: 0,Al,K,B,Ca,Mg,Na
0,1.48,0.6,0.0,8.12,2,12.81
1,1.52,0.67,0.0,7.83,2,12.89
2,1.19,0.6,0.0,8.43,3,12.9
3,1.52,0.58,0.0,8.79,1,13.33
4,3.5,1.68,2.2,5.87,1,13.81


In [None]:
clf.predict(df_cases)

array(['C', 'C', 'C', 'C', 'O', 'C', 'V'], dtype=object)

# Ex5. Multi Layer Perceptron

Adapte o código do `GridSearchCV` para seleção dos hiperparâmetros dos modelos de `RandomForest` que você criou para seleção dos seguintes hiperparâmetrso de uma rede MLP:

```
    'hidden_layer_sizes': [(100,), (50, 50), (10, 10, 10)],
    'activation': ['relu', 'tanh']
```

onde 'hidden_layer_sizes' são as camadas e elementos da rede neural e 'activation' as funções de ativação. Empregue como modelo base `MLPClassifier(max_iter=500,random_state=42)`. Qual a acuracidade da melhor rede MLP obtida?

In [None]:
from sklearn.neural_network import MLPClassifier

base_estimator = MLPClassifier(max_iter=500,random_state=42)

param_grid = {
    'hidden_layer_sizes': [(100,), (50, 50), (10, 10, 10)],     'activation': ['relu', 'tanh']
}

clf = GridSearchCV(base_estimator, param_grid, cv=5, scoring='f1_macro')

clf.fit(X_train, y_train)

print(clf.best_estimator_)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))



MLPClassifier(activation='tanh', hidden_layer_sizes=(10, 10, 10), max_iter=500,
              random_state=42)
              precision    recall  f1-score   support

           C       0.88      0.97      0.92        29
           O       0.67      0.67      0.67         3
           V       1.00      0.67      0.80         9

    accuracy                           0.88        41
   macro avg       0.85      0.77      0.79        41
weighted avg       0.89      0.88      0.87        41



