# Exercício 1: Classificação de Cogumelos

## Objetivo
Desenvolver um modelo de classificação para prever se um cogumelo é comestível ou venenoso com base nas suas características.

## Explicação do Cenário
Este exercício, está a simular um cenário onde alunos encontram cogumelos e precisam de um modelo de Machine Learning para prever se são comestíveis ou venenosos. Para isso, vão treinar um modelo de classificação usando o dataset `mushrooms.csv`.

## Passos a Seguir

### 1. Carregar e Preparar os Dados

- Deve dividir o dataset em treino (75%) e teste (25%).

### 2. Treinar o Modelo Principal

- Modelo do tipo **Distributed Random Forest** com as configurações padrão.
- Este modelo é um tipo de floresta aleatória distribuída, que funciona bem para tarefas de classificação.

### 3. Treinar Modelos Adicionais com Configurações Diferentes

- Treinar mais dois modelos, mas alterando algumas configurações, como o número de árvores ou a profundidade máxima.

### 4. Comparar os Modelos



In [35]:
# Importar Bibliotecas
import h2o

In [36]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,36 mins 17 secs
H2O_cluster_timezone:,Europe/Lisbon
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.6
H2O_cluster_version_age:,10 days
H2O_cluster_name:,H2O_from_python_avlal_rc0d8l
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.315 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


In [37]:
data = h2o.import_file("..\\data\\mushrooms.csv")

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [38]:
column_names = ["class", "cap-shape", "cap-surface", "cap-color", "bruises", "odor", 
                "gill-attachment", "gill-spacing", "gill-size", "gill-color", 
                "stalk-shape", "stalk-root", "stalk-surface-above-ring", 
                "stalk-surface-below-ring", "stalk-color-above-ring", 
                "stalk-color-below-ring", "veil-type", "veil-color", 
                "ring-number", "ring-type", "spore-print-color", 
                "population", "habitat"]
data.columns = column_names


In [39]:
print(data.columns)

['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']


In [40]:
# Preparar os Dados:

target = "class"
predictors = data.columns
predictors.remove(target)
data[target] = data[target].asfactor()

In [41]:
#Dividir os Dados em Treino e Teste

train, test = data.split_frame([0.75], seed=42)


In [42]:
#Modelo Base (Distributed Random Forest):

from h2o.estimators import H2ORandomForestEstimator
model_default = H2ORandomForestEstimator(seed=42)
model_default.train(x=predictors, y=target, training_frame=train)


drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,50.0,150.0,32488.0,1.0,14.0,6.0266666,2.0,39.0,12.593333

class,e,p,Error,Rate
1.0,0.0,0.0,0.0,0 / 1
0.0,3157.0,0.0,0.0,0 / 3 157
0.0,0.0,2948.0,0.0,0 / 2 948
1.0,3157.0,2948.0,0.0,0 / 6 106

k,hit_ratio
1,1.0
2,1.0
3,1.0

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc
,2024-11-12 18:07:43,0.000 sec,0.0,,,,,
,2024-11-12 18:07:43,0.007 sec,1.0,0.0058713,0.0006728,0.0,,
,2024-11-12 18:07:43,0.010 sec,2.0,0.0123308,0.0007140,0.0,,
,2024-11-12 18:07:43,0.012 sec,3.0,0.0118487,0.0006177,0.0,,
,2024-11-12 18:07:43,0.015 sec,4.0,0.0125168,0.0006545,0.0,,
,2024-11-12 18:07:43,0.018 sec,5.0,0.0116248,0.0005597,0.0,,
,2024-11-12 18:07:43,0.020 sec,6.0,0.0105149,0.0005659,0.0,,
,2024-11-12 18:07:43,0.023 sec,7.0,0.0104480,0.0006184,0.0,,
,2024-11-12 18:07:43,0.025 sec,8.0,0.0101161,0.0005690,0.0,,
,2024-11-12 18:07:43,0.027 sec,9.0,0.0097874,0.0005354,0.0,,

variable,relative_importance,scaled_importance,percentage
odor,46118.3828125,1.0,0.3789252
spore-print-color,16677.9335938,0.3616331,0.1370319
stalk-surface-above-ring,13085.0800781,0.2837281,0.1075117
gill-size,9406.75,0.2039696,0.0772892
gill-color,8195.0556641,0.1776961,0.0673335
stalk-surface-below-ring,5350.5556641,0.1160179,0.0439621
ring-type,3691.0795898,0.0800349,0.0303272
population,3173.3574219,0.0688089,0.0260734
stalk-root,2401.6491699,0.0520757,0.0197328
habitat,2000.2856445,0.0433728,0.0164351


In [43]:
# Avaliar o Modelo

performance_default = model_default.model_performance(test)
print(performance_default)


ModelMetricsMultinomial: drf
** Reported on test data. **

MSE: 5.769846689798338e-05
RMSE: 0.007595950690860452
LogLoss: 0.0019025525416544286
Mean Per-Class Error: 0.0
AUC table was not computed: it is either disabled (model parameter 'auc_type' was set to AUTO or NONE) or the domain size exceeds the limit (maximum is 50 domains).
AUCPR table was not computed: it is either disabled (model parameter 'auc_type' was set to AUTO or NONE) or the domain size exceeds the limit (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
class    e     p    Error    Rate
-------  ----  ---  -------  ---------
0        0     0    nan      0 / 0
0        1051  0    0        0 / 1 051
0        0     968  0        0 / 968
0        1051  968  0        0 / 2 019

Top-3 Hit Ratios: 
k    hit_ratio
---  -----------
1    1
2    1
3    1


In [44]:
#Treinar Modelos com Configurações Diferentes:

model_alt1 = H2ORandomForestEstimator(ntrees=100, max_depth=20, seed=42)
model_alt1.train(x=predictors, y=target, training_frame=train)

model_alt2 = H2ORandomForestEstimator(ntrees=150, max_depth=30, seed=42)
model_alt2.train(x=predictors, y=target, training_frame=train)


drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,150.0,450.0,97874.0,1.0,16.0,6.091111,2.0,43.0,12.628889

class,e,p,Error,Rate
1.0,0.0,0.0,0.0,0 / 1
0.0,3157.0,0.0,0.0,0 / 3 157
0.0,0.0,2948.0,0.0,0 / 2 948
1.0,3157.0,2948.0,0.0,0 / 6 106

k,hit_ratio
1,1.0
2,1.0
3,1.0

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc
,2024-11-12 18:07:57,0.001 sec,0.0,,,,,
,2024-11-12 18:07:57,0.006 sec,1.0,0.0058713,0.0006728,0.0,,
,2024-11-12 18:07:57,0.008 sec,2.0,0.0123308,0.0007140,0.0,,
,2024-11-12 18:07:57,0.011 sec,3.0,0.0118487,0.0006177,0.0,,
,2024-11-12 18:07:57,0.014 sec,4.0,0.0125168,0.0006545,0.0,,
,2024-11-12 18:07:57,0.017 sec,5.0,0.0116248,0.0005597,0.0,,
,2024-11-12 18:07:57,0.020 sec,6.0,0.0105149,0.0005659,0.0,,
,2024-11-12 18:07:57,0.023 sec,7.0,0.0104480,0.0006184,0.0,,
,2024-11-12 18:07:57,0.024 sec,8.0,0.0101161,0.0005690,0.0,,
,2024-11-12 18:07:57,0.026 sec,9.0,0.0097874,0.0005354,0.0,,

variable,relative_importance,scaled_importance,percentage
odor,143381.4062500,1.0,0.3930147
spore-print-color,43673.9492188,0.3045998,0.1197122
gill-color,28026.2480469,0.1954664,0.0768212
stalk-surface-above-ring,22073.0312500,0.1539463,0.0605031
gill-size,21653.3671875,0.1510194,0.0593528
ring-type,20196.3125000,0.1408573,0.0553590
stalk-surface-below-ring,17620.0878906,0.1228896,0.0482974
population,12314.0927734,0.0858835,0.0337535
stalk-root,8895.5449219,0.0620411,0.0243831
habitat,8409.0820312,0.0586483,0.0230497


In [45]:
# Comparação dos Modelos:

performance_alt1 = model_alt1.model_performance(test)
performance_alt2 = model_alt2.model_performance(test)
print(performance_alt1)
print(performance_alt2)


ModelMetricsMultinomial: drf
** Reported on test data. **

MSE: 5.0134810344409235e-05
RMSE: 0.007080593925964773
LogLoss: 0.002240948402333282
Mean Per-Class Error: 0.0
AUC table was not computed: it is either disabled (model parameter 'auc_type' was set to AUTO or NONE) or the domain size exceeds the limit (maximum is 50 domains).
AUCPR table was not computed: it is either disabled (model parameter 'auc_type' was set to AUTO or NONE) or the domain size exceeds the limit (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
class    e     p    Error    Rate
-------  ----  ---  -------  ---------
0        0     0    nan      0 / 0
0        1051  0    0        0 / 1 051
0        0     968  0        0 / 968
0        1051  968  0        0 / 2 019

Top-3 Hit Ratios: 
k    hit_ratio
---  -----------
1    1
2    1
3    1
ModelMetricsMultinomial: drf
** Reported on test data. **

MSE: 4.045510600038969e-05
RMSE: 0.006360432846936574
LogLoss: 0.00