<a href="https://colab.research.google.com/github/Mads8760/Ciencia-de-dados/blob/main/Ciehncia_de_dados_Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Legal, você foi encarregado de criar um modelo de classificação para prever se pacientes possuem uma condição médica com base nas características físicas dos exames. Você tem um conjunto de dados com 1000 entradas e diversas variáveis. Vamos utilizar o Random Forest para essa tarefa. Para garantir que o modelo seja capaz de generalizar bem para novos dados, siga estes passos:

Pré-processamento dos dados: Limpe e prepare os dados, tratando valores ausentes e normalizando as variáveis, se necessário.

Divisão do conjunto de dados: Use o train_test_split para separar os dados em conjuntos de treinamento e teste. Isso é essencial para avaliar o desempenho do modelo em dados não vistos. Uma proporção comum é 80% para treino e 20% para teste.

Treinamento do modelo: Configure e treine o modelo Random Forest. Dois parâmetros importantes são:

n_estimators: Define o número de árvores na floresta. Mais árvores podem melhorar a precisão, mas também aumentam o tempo de computação.

max_depth: Controla a profundidade máxima das árvores. Definir uma profundidade muito alta pode causar overfitting, onde o modelo se ajusta demais aos dados de treinamento.

Avaliação do modelo: Use o conjunto de teste para avaliar a performance do modelo. Métricas como precisão, recall e a curva ROC-AUC são úteis para entender a eficácia do modelo.

Ajuste de hiperparâmetros: Experimente diferentes valores para n_estimators e max_depth e ajuste outros hiperparâmetros para otimizar o desempenho do modelo.

Validação cruzada: Utilize a validação cruzada para garantir que o modelo generalize bem. Isso envolve dividir os dados em vários subconjuntos e treinar o modelo várias vezes

In [None]:
#importação da bibliotecas
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import fetch_openml, make_classification, make_blobs
from sklearn.preprocessing import (StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, FunctionTransformer)
from sklearn.impute import SimpleImputer
from scipy.cluster.vq import whiten
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#criação do dataset
n_features = 1000
X,y = make_classification(n_samples=100, n_features=n_features, n_classes=3, n_informative=3, n_redundant=0, n_repeated=0, random_state=42)
df_exames = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(1, n_features + 1)])
df_exames['target'] = y
#introudção de valores ausentes (NaN) nas colunas de caracteristicas aleatoriamente
nan_indices = np.random.choice(df_exames.index, size=int(df_exames.shape[0] * 0.1), replace=False)
df_exames.loc[nan_indices, 'feature_1'] = np.nan #Adicionando NaN em feature_1

nan_indices = np.random.choice(df_exames.index, size=int(df_exames.shape[0] * 0.1), replace=False)
df_exames.loc[nan_indices, 'feature_2'] = np.nan #Adicionando NaN em feature_1
nan_indices = np.random.choice(df_exames.index, size=int(df_exames.shape[0] * 0.1), replace=False)
df_exames.loc[nan_indices, 'feature_3'] = np.nan #Adicionando NaN em feature_2
nan_indices = np.random.choice(df_exames.index, size=int(df_exames.shape[0] * 0.1), replace=False)
df_exames.loc[nan_indices, 'feature_4'] = np.nan #Adicionando NaN em feature_4
nan_indices = np.random.choice(df_exames.index, size=int(df_exames.shape[0] * 0.1), replace=False)
df_exames.loc[nan_indices, 'feature_10'] = np.nan #Adicionando NaN em feature_10
nan_indices = np.random.choice(df_exames.index, size=int(df_exames.shape[0] * 0.1), replace=False)
df_exames.loc[nan_indices, 'feature_85'] = np.nan #Adicionando NaN em feature_85
nan_indices = np.random.choice(df_exames.index, size=int(df_exames.shape[0] * 0.1), replace=False)
df_exames.loc[nan_indices, 'feature_100'] = np.nan #Adicionando NaN em feature_100

nan_indices = np.random.choice(df_exames.index, size=int(df_exames.shape[0] * 0.3), replace=False)
df_exames.loc[nan_indices, 'target'] = np.nan #Adicionando NaN em target

#Exibindo as prieiras linhas do DataFrame gerado
df_exames.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_992,feature_993,feature_994,feature_995,feature_996,feature_997,feature_998,feature_999,feature_1000,target
0,-2.496798,-1.77411,-0.969557,0.752133,0.447822,-0.215069,-0.209955,0.166305,-0.9625,-0.843326,...,2.022175,1.010016,-0.817225,0.821408,-0.935867,-0.928132,0.513436,2.764449,0.46912,1.0
1,-1.229601,0.042362,-0.296112,,-0.4536,1.161694,-0.051008,-0.438628,-1.338061,-0.103512,...,1.741935,0.040834,-1.27528,0.959489,-0.193836,0.527222,-0.268952,0.153726,0.607632,0.0
2,0.966275,0.438332,-1.056666,0.101967,-0.089346,-2.488921,-0.683776,0.387034,-1.819535,0.644711,...,1.837564,0.939121,1.13802,-0.369675,0.68564,1.835811,-1.947247,-0.154965,0.265085,
3,0.429493,0.502348,0.638791,-0.514374,0.316587,0.36574,-1.247917,-1.796805,0.32282,-0.540515,...,-0.678134,0.308158,0.573973,0.575992,-0.610552,1.646878,-1.30721,1.756376,-1.163475,
4,0.185452,-0.631287,-1.34902,1.70918,0.154964,-1.30822,-0.762276,1.477811,0.442106,-1.246467,...,-1.40319,0.281284,0.943614,1.87928,-0.687072,-0.616835,-0.381389,0.516212,0.665049,2.0


In [None]:
#Pré-processamento do dataset. Remoção de linhas com NaN
df_exames_dropna = df_exames.dropna()
df_exames_dropna.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_992,feature_993,feature_994,feature_995,feature_996,feature_997,feature_998,feature_999,feature_1000,target
0,-2.496798,-1.77411,-0.969557,0.752133,0.447822,-0.215069,-0.209955,0.166305,-0.9625,-0.843326,...,2.022175,1.010016,-0.817225,0.821408,-0.935867,-0.928132,0.513436,2.764449,0.46912,1.0
10,-0.284203,-0.545674,0.33842,-0.741677,-0.39668,-0.531686,-1.581019,0.804897,-1.4608,2.29919,...,-1.389481,-0.813833,0.728974,-1.037775,-0.106493,-1.295106,-1.06909,0.215552,-0.656305,0.0
13,0.372794,0.69945,-0.569361,-0.586923,-0.223939,-0.56963,0.759442,-0.267192,1.458885,0.742289,...,0.826946,-0.489467,-0.866973,-0.315278,1.635048,0.646045,-0.287082,0.023385,0.05563,0.0
14,-1.029845,-0.047204,0.5421,0.644721,2.418653,0.755762,0.312067,-1.085607,0.223681,1.416312,...,0.337466,0.829738,-0.811022,0.021458,0.359568,0.351285,-1.140769,0.16732,-1.179052,2.0
23,0.786045,-0.724998,0.547676,-0.422009,0.438511,-0.394073,1.392027,0.565048,1.336343,1.238833,...,-0.356574,-1.377081,-0.286106,-1.131583,2.117775,-0.269291,-0.976631,-0.405482,-0.705097,0.0


In [None]:
# Verificar valores nulos
missing_values = df_exames_dropna.isnull().sum()
print("Valores nulos por coluna:\n", missing_values)


Valores nulos por coluna:
 feature_1       0
feature_2       0
feature_3       0
feature_4       0
feature_5       0
               ..
feature_997     0
feature_998     0
feature_999     0
feature_1000    0
target          0
Length: 1001, dtype: int64


In [None]:
df_exames_dropna.describe()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_992,feature_993,feature_994,feature_995,feature_996,feature_997,feature_998,feature_999,feature_1000,target
count,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,...,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0
mean,-0.170047,0.064423,0.047044,0.023554,0.019601,-0.079834,0.414756,0.122049,-0.016379,0.020955,...,-0.036752,-0.066322,0.081172,-0.108743,-0.166866,-0.15329,-0.342636,0.20865,0.0795,0.862069
std,1.057241,0.917344,1.049529,1.063437,1.141865,0.940737,1.110263,1.013396,0.989819,1.025838,...,0.910821,0.787165,0.96235,1.034467,0.96896,0.73364,0.955151,1.119711,1.035426,0.833415
min,-2.496798,-1.77411,-2.43728,-1.974216,-2.477904,-1.991574,-2.284768,-2.009193,-1.709226,-2.109161,...,-1.389481,-1.49395,-1.79693,-1.883351,-1.667394,-1.621882,-2.403811,-2.546118,-1.87324,0.0
25%,-0.580461,-0.651035,-0.569361,-0.686492,-0.662922,-0.56963,0.021015,-0.523303,-0.846434,-0.513127,...,-0.83609,-0.496637,-0.811022,-0.883464,-0.935867,-0.622049,-1.063025,-0.411289,-0.656305,0.0
50%,0.01623,0.32511,0.233404,0.394395,-0.034517,-0.214809,0.588206,0.226868,-0.028752,0.087964,...,-0.135333,-0.217657,0.152907,-0.077769,-0.200571,-0.069852,-0.264022,0.121461,0.201241,1.0
75%,0.531955,0.728722,0.547676,0.644721,0.484597,0.447577,1.048204,0.676787,0.564745,0.742289,...,0.425426,0.434264,0.728974,0.545141,0.2936,0.242118,0.075695,0.76752,0.727799,2.0
max,1.464206,1.517642,2.033767,1.934302,2.418653,1.736624,2.521524,2.75544,1.846397,2.29919,...,2.022175,1.733906,2.198003,2.306473,2.117775,1.108591,1.895551,2.764449,2.269044,2.0


In [None]:
# Criar escaladores
scaler_minmax = MinMaxScaler()

# Aplicar transformações
df_selected_normalized = pd.DataFrame(scaler_minmax.fit_transform(df_exames_dropna), columns=df_exames_dropna.columns)
df_exames_dropna.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_992,feature_993,feature_994,feature_995,feature_996,feature_997,feature_998,feature_999,feature_1000,target
0,-2.496798,-1.77411,-0.969557,0.752133,0.447822,-0.215069,-0.209955,0.166305,-0.9625,-0.843326,...,2.022175,1.010016,-0.817225,0.821408,-0.935867,-0.928132,0.513436,2.764449,0.46912,1.0
10,-0.284203,-0.545674,0.33842,-0.741677,-0.39668,-0.531686,-1.581019,0.804897,-1.4608,2.29919,...,-1.389481,-0.813833,0.728974,-1.037775,-0.106493,-1.295106,-1.06909,0.215552,-0.656305,0.0
13,0.372794,0.69945,-0.569361,-0.586923,-0.223939,-0.56963,0.759442,-0.267192,1.458885,0.742289,...,0.826946,-0.489467,-0.866973,-0.315278,1.635048,0.646045,-0.287082,0.023385,0.05563,0.0
14,-1.029845,-0.047204,0.5421,0.644721,2.418653,0.755762,0.312067,-1.085607,0.223681,1.416312,...,0.337466,0.829738,-0.811022,0.021458,0.359568,0.351285,-1.140769,0.16732,-1.179052,2.0
23,0.786045,-0.724998,0.547676,-0.422009,0.438511,-0.394073,1.392027,0.565048,1.336343,1.238833,...,-0.356574,-1.377081,-0.286106,-1.131583,2.117775,-0.269291,-0.976631,-0.405482,-0.705097,0.0


In [None]:
df_exames_dropna.describe()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_992,feature_993,feature_994,feature_995,feature_996,feature_997,feature_998,feature_999,feature_1000,target
count,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,...,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0
mean,-0.170047,0.064423,0.047044,0.023554,0.019601,-0.079834,0.414756,0.122049,-0.016379,0.020955,...,-0.036752,-0.066322,0.081172,-0.108743,-0.166866,-0.15329,-0.342636,0.20865,0.0795,0.862069
std,1.057241,0.917344,1.049529,1.063437,1.141865,0.940737,1.110263,1.013396,0.989819,1.025838,...,0.910821,0.787165,0.96235,1.034467,0.96896,0.73364,0.955151,1.119711,1.035426,0.833415
min,-2.496798,-1.77411,-2.43728,-1.974216,-2.477904,-1.991574,-2.284768,-2.009193,-1.709226,-2.109161,...,-1.389481,-1.49395,-1.79693,-1.883351,-1.667394,-1.621882,-2.403811,-2.546118,-1.87324,0.0
25%,-0.580461,-0.651035,-0.569361,-0.686492,-0.662922,-0.56963,0.021015,-0.523303,-0.846434,-0.513127,...,-0.83609,-0.496637,-0.811022,-0.883464,-0.935867,-0.622049,-1.063025,-0.411289,-0.656305,0.0
50%,0.01623,0.32511,0.233404,0.394395,-0.034517,-0.214809,0.588206,0.226868,-0.028752,0.087964,...,-0.135333,-0.217657,0.152907,-0.077769,-0.200571,-0.069852,-0.264022,0.121461,0.201241,1.0
75%,0.531955,0.728722,0.547676,0.644721,0.484597,0.447577,1.048204,0.676787,0.564745,0.742289,...,0.425426,0.434264,0.728974,0.545141,0.2936,0.242118,0.075695,0.76752,0.727799,2.0
max,1.464206,1.517642,2.033767,1.934302,2.418653,1.736624,2.521524,2.75544,1.846397,2.29919,...,2.022175,1.733906,2.198003,2.306473,2.117775,1.108591,1.895551,2.764449,2.269044,2.0


In [None]:
selected_features = df_exames_dropna.columns[:-1]
X = df_exames_dropna [selected_features]
y = df_exames_dropna["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Dados com Normalização Min-Max
scaler_minmax = MinMaxScaler()
X_minmax = scaler_minmax.fit_transform(X)
X_train_minmax, X_test_minmax, y_train_minmax, y_test_minmax = train_test_split(X_minmax, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Modelo Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Importância das features no Random Forest
rf_importance = rf.feature_importances_

# Criar DataFrame com os resultados
rf_feature_importance = pd.DataFrame({
    "Feature": X.columns,
    "Random Forest Importance": rf_importance
}).sort_values(by="Random Forest Importance", ascending=False)

# Exibir a tabela
display(rf_feature_importance)

Unnamed: 0,Feature,Random Forest Importance
810,feature_811,0.195349
688,feature_689,0.101246
344,feature_345,0.061565
571,feature_572,0.039834
606,feature_607,0.037330
...,...,...
361,feature_362,0.000000
362,feature_363,0.000000
363,feature_364,0.000000
364,feature_365,0.000000


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score


# Definir os hiperparâmetros que serão testados
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "bootstrap": [True, False]
}

# Criar o modelo Random Forest
rf = RandomForestRegressor(random_state=42)

# Criar o GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=5,  # Validação cruzada com 5 folds
    n_jobs=-1,  # Usar todos os núcleos disponíveis para processamento paralelo
    verbose=2  # Mostrar o progresso do GridSearch
)

# Aplicar GridSearchCV no conjunto SEM normalização (pode ser alterado para outros conjuntos)
grid_search.fit(X_train_minmax, y_train_minmax)

# Melhor conjunto de hiperparâmetros encontrados
print("\nMelhores Hiperparâmetros:", grid_search.best_params_)

# Melhor modelo Random Forest
best_rf = grid_search.best_estimator_

# Fazer previsões no conjunto de teste
y_pred_best_rf = best_rf.predict(X_test_minmax)

# Avaliar desempenho do modelo otimizado
rmse_best_rf = np.sqrt(mean_squared_error(y_test_minmax, y_pred_best_rf))
r2_best_rf = r2_score(y_test_minmax, y_pred_best_rf)

print("Melhor parametrização: ",grid_search.best_estimator_)
print(f"\nMelhor Random Forest - RMSE: {rmse_best_rf:.2f} | R²: {r2_best_rf:.2f}")

Fitting 5 folds for each of 216 candidates, totalling 1080 fits

Melhores Hiperparâmetros: {'bootstrap': False, 'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 50}
Melhor parametrização:  RandomForestRegressor(bootstrap=False, min_samples_leaf=4, min_samples_split=10,
                      n_estimators=50, random_state=42)

Melhor Random Forest - RMSE: 0.62 | R²: 0.52


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
#Novos hiperparâmetros

# Definir os hiperparâmetros que serão testados
param_grid = {
    "n_estimators": [25, 50, 200],
    "max_depth": [None, 5, 15, 15],
    "min_samples_split": [2, 4, 6],
    "min_samples_leaf": [1, 2, 4],
    "bootstrap": [True, False]
}

# Criar o modelo Random Forest
rf = RandomForestRegressor(random_state=42)

# Criar o GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=2,  # Validação cruzada com 5 folds
    n_jobs=-1,  # Usar todos os núcleos disponíveis para processamento paralelo
    verbose=2  # Mostrar o progresso do GridSearch
)

# Aplicar GridSearchCV no conjunto SEM normalização (pode ser alterado para outros conjuntos)
grid_search.fit(X_train_minmax, y_train_minmax)

# Melhor conjunto de hiperparâmetros encontrados
print("\nMelhores Hiperparâmetros:", grid_search.best_params_)

# Melhor modelo Random Forest
best_rf = grid_search.best_estimator_

# Fazer previsões no conjunto de teste
y_pred_best_rf = best_rf.predict(X_test_minmax)

# Avaliar desempenho do modelo otimizado
rmse_best_rf = np.sqrt(mean_squared_error(y_test_minmax, y_pred_best_rf))
r2_best_rf = r2_score(y_test_minmax, y_pred_best_rf)

print("Melhor parametrização: ",grid_search.best_estimator_)
print(f"\nMelhor Random Forest - RMSE: {rmse_best_rf:.2f} | R²: {r2_best_rf:.2f}")

Fitting 2 folds for each of 216 candidates, totalling 432 fits

Melhores Hiperparâmetros: {'bootstrap': True, 'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}
Melhor parametrização:  RandomForestRegressor(min_samples_leaf=4, n_estimators=200, random_state=42)

Melhor Random Forest - RMSE: 0.85 | R²: 0.11


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
#Novos hiperparâmetros

# Definir os hiperparâmetros que serão testados
param_grid = {
    "n_estimators": [5, 100, 200],
    "max_depth": [None, 15, 25, 35],
    "min_samples_split": [4, 6, 8],
    "min_samples_leaf": [1, 2, 4],
    "bootstrap": [True, False]
}

# Criar o modelo Random Forest
rf = RandomForestRegressor(random_state=42)

# Criar o GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=10,  # Validação cruzada com 5 folds
    n_jobs=-1,  # Usar todos os núcleos disponíveis para processamento paralelo
    verbose=2  # Mostrar o progresso do GridSearch
)

# Aplicar GridSearchCV no conjunto SEM normalização (pode ser alterado para outros conjuntos)
grid_search.fit(X_train_minmax, y_train_minmax)

# Melhor conjunto de hiperparâmetros encontrados
print("\nMelhores Hiperparâmetros:", grid_search.best_params_)

# Melhor modelo Random Forest
best_rf = grid_search.best_estimator_

# Fazer previsões no conjunto de teste
y_pred_best_rf = best_rf.predict(X_test_minmax)

# Avaliar desempenho do modelo otimizado
rmse_best_rf = np.sqrt(mean_squared_error(y_test_minmax, y_pred_best_rf))
r2_best_rf = r2_score(y_test_minmax, y_pred_best_rf)

print("Melhor parametrização: ",grid_search.best_estimator_)
print(f"\nMelhor Random Forest - RMSE: {rmse_best_rf:.2f} | R²: {r2_best_rf:.2f}")

Fitting 10 folds for each of 216 candidates, totalling 2160 fits

Melhores Hiperparâmetros: {'bootstrap': True, 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 8, 'n_estimators': 100}
Melhor parametrização:  RandomForestRegressor(min_samples_split=8, random_state=42)

Melhor Random Forest - RMSE: 0.86 | R²: 0.08
