# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [9]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer

# Cargar los datos
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

# Convertir las variables categóricas a numéricas usando One-Hot Encoding
spaceship_encoded = pd.get_dummies(spaceship)

# Separar las características (X) y la variable objetivo (y)
X = spaceship_encoded.drop('Transported', axis=1)
y = spaceship_encoded['Transported']

# 1. Imputar valores faltantes (NaN) usando la media de cada columna
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# 2. Escalado de características
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# 3. Selección de características (usamos SelectKBest para seleccionar las mejores características)
selector = SelectKBest(f_classif, k=10)  # Seleccionamos las 10 mejores características
X_selected = selector.fit_transform(X_scaled, y)

# Dividir el conjunto de datos en conjunto de entrenamiento y conjunto de prueba
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Mostrar las primeras filas de X_train escaladas
print(X_train[:5])


[[-5.77428657e-02 -3.40589867e-01  3.06649292e-01 -2.69022628e-01
   9.42847454e-01 -5.69867136e-01  7.73480278e-01 -7.32770025e-01
  -5.11013194e-01  6.85312647e-01]
 [-8.24922656e-01 -3.40589867e-01 -2.76663422e-01 -2.69022628e-01
   9.42847454e-01 -5.69867136e-01  7.73480278e-01 -7.32770025e-01
  -5.11013194e-01  6.85312647e-01]
 [-5.77428657e-02 -3.40589867e-01 -2.76663422e-01 -2.69022628e-01
  -1.06061696e+00  1.75479500e+00 -1.29285779e+00  1.36468464e+00
   1.95689664e+00 -1.45918801e+00]
 [-6.15691804e-01  4.30826867e-17  5.91192079e-01 -2.69022628e-01
  -1.06061696e+00 -5.69867136e-01  7.73480278e-01 -7.32770025e-01
  -5.11013194e-01  6.85312647e-01]
 [ 5.00206073e-01 -3.40589867e-01 -2.76663422e-01 -2.69022628e-01
  -1.06061696e+00  1.75479500e+00 -1.29285779e+00  1.36468464e+00
   1.95689664e+00 -1.45918801e+00]]


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Definir el modelo Random Forest
rf_model = RandomForestClassifier(random_state=42)

# Definir el espacio de hiperparámetros que vamos a explorar
param_grid = {
    'n_estimators': [50, 100, 150],  # Número de árboles
    'max_depth': [None, 10, 20, 30],  # Profundidad máxima de los árboles
    'max_features': ['auto', 'sqrt', 'log2'],  # Número de características a considerar
    'min_samples_split': [2, 5, 10],  # Número mínimo de muestras para dividir un nodo
    'min_samples_leaf': [1, 2, 4],    # Número mínimo de muestras en un nodo hoja
    'bootstrap': [True, False]        # Si usar o no bootstrap
}


- Evaluate your model

In [11]:
from sklearn.metrics import accuracy_score

# Entrenamiento inicial con el modelo sin optimización
rf_model.fit(X_train, y_train)

# Predicción con el conjunto de prueba
y_pred = rf_model.predict(X_test)

# Evaluar la precisión
accuracy = accuracy_score(y_test, y_pred)
print(f"Precisión del modelo base: {accuracy:.4f}")


Precisión del modelo base: 0.7487


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [None]:
#your code here
# Configurar GridSearchCV para buscar los mejores hiperparámetros
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid,
                           cv=5, n_jobs=-1, verbose=2)


- Run Grid Search

In [13]:
# Ejecutar Grid Search
grid_search.fit(X_train, y_train)

# Mostrar los mejores hiperparámetros encontrados
print(f"Mejores hiperparámetros encontrados: {grid_search.best_params_}")


Fitting 5 folds for each of 648 candidates, totalling 3240 fits


1080 fits failed out of a total of 3240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
622 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Ema\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Ema\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1382, in wrapper
    estimator._validate_params()
  File "c:\Users\Ema\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\Ema\AppData\Local\Programs\Python\Python312\Lib\sit

Mejores hiperparámetros encontrados: {'bootstrap': True, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 150}


- Evaluate your model

In [14]:
# Obtener el mejor modelo de la búsqueda
best_rf_model = grid_search.best_estimator_

# Predicción con el conjunto de prueba usando el mejor modelo
y_pred_best = best_rf_model.predict(X_test)

# Evaluar la precisión del modelo ajustado
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Precisión del modelo ajustado: {accuracy_best:.4f}")


Precisión del modelo ajustado: 0.7763
