## Kaggle – DataTops®
Luismi ha decidido cambiar de aires y, por eso, ha comprado una tienda de portátiles. Sin embargo, su única especialidad es Data Science, por lo que ha decidido crear un modelo de ML para establecer los mejores precios.

¿Podrías ayudar a Luismi a mejorar ese modelo?

## Métrica: 
Error de raíz cuadrada media (RMSE) es la desviación estándar de los valores residuales (errores de predicción). Los valores residuales son una medida de la distancia de los puntos de datos de la línea de regresión; RMSE es una medida de cuál es el nivel de dispersión de estos valores residuales. En otras palabras, le indica el nivel de concentración de los datos en la línea de mejor ajuste.


$$ RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sigma_i}\Big)^2}}$$


## Librerías

In [342]:
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
import urllib.request
import bootcampviztools as bvz
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

## Datos

In [343]:
# Para que funcione necesitas bajarte los archivos de datos de Kaggle 
df_train = pd.read_csv("./data/train.csv", index_col = 0)
df_train.index.name = None
df_train

Unnamed: 0,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_in_euros
755,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i3 6006U 2GHz,8GB,256GB SSD,Intel HD Graphics 520,Windows 10,1.86kg,539.00
618,Dell,Inspiron 7559,Gaming,15.6,Full HD 1920x1080,Intel Core i7 6700HQ 2.6GHz,16GB,1TB HDD,Nvidia GeForce GTX 960<U+039C>,Windows 10,2.59kg,879.01
909,HP,ProBook 450,Notebook,15.6,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,8GB,1TB HDD,Nvidia GeForce 930MX,Windows 10,2.04kg,900.00
2,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,898.94
286,Dell,Inspiron 3567,Notebook,15.6,Full HD 1920x1080,Intel Core i3 6006U 2.0GHz,4GB,1TB HDD,AMD Radeon R5 M430,Linux,2.25kg,428.00
...,...,...,...,...,...,...,...,...,...,...,...,...
28,Dell,Inspiron 5570,Notebook,15.6,Full HD 1920x1080,Intel Core i5 8250U 1.6GHz,8GB,256GB SSD,AMD Radeon 530,Windows 10,2.2kg,800.00
1160,HP,Spectre Pro,2 in 1 Convertible,13.3,Full HD / Touchscreen 1920x1080,Intel Core i5 6300U 2.4GHz,8GB,256GB SSD,Intel HD Graphics 520,Windows 10,1.48kg,1629.00
78,Lenovo,IdeaPad 320-15IKBN,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,2TB HDD,Intel HD Graphics 620,No OS,2.2kg,519.00
23,HP,255 G6,Notebook,15.6,1366x768,AMD E-Series E2-9000e 1.5GHz,4GB,500GB HDD,AMD Radeon R2,No OS,1.86kg,258.00


## Exploración de los datos

In [344]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 912 entries, 755 to 229
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Company           912 non-null    object 
 1   Product           912 non-null    object 
 2   TypeName          912 non-null    object 
 3   Inches            912 non-null    float64
 4   ScreenResolution  912 non-null    object 
 5   Cpu               912 non-null    object 
 6   Ram               912 non-null    object 
 7   Memory            912 non-null    object 
 8   Gpu               912 non-null    object 
 9   OpSys             912 non-null    object 
 10  Weight            912 non-null    object 
 11  Price_in_euros    912 non-null    float64
dtypes: float64(2), object(10)
memory usage: 92.6+ KB


## Procesado de datos

Nuestro target es la columna `Price_in_euros`

In [345]:
df['OpSys'].value_counts()

OpSys
Windows 10      741
Linux            48
No OS            44
Windows 7        29
Chrome OS        24
macOS            11
Windows 10 S      7
Mac OS X          6
Android           2
Name: count, dtype: int64

In [346]:
# Copia del dataframe original para transformaciones
df_preprocessed = df_train.copy()

# 1) Transformar Company en numérico
le_company = LabelEncoder()
df_preprocessed["Company"] = le_company.fit_transform(df_preprocessed["Company"])

# 2) Transformar Product (primera palabra o última si es alfanumérico)
def extract_product_keyword(product):
    words = product.split()
    if words[-1].isalnum() and any(c.isalpha() for c in words[-1]) and any(c.isdigit() for c in words[-1]):
        return words[-1]  # Si es alfanumérico, tomar la última palabra
    return words[0]  # En otro caso, tomar la primera palabra

df_preprocessed["Product"] = df_preprocessed["Product"].apply(extract_product_keyword)
le_product = LabelEncoder()
df_preprocessed["Product"] = le_product.fit_transform(df_preprocessed["Product"])

# 3) Transformar TypeName en numérico
le_type = LabelEncoder()
df_preprocessed["TypeName"] = le_type.fit_transform(df_preprocessed["TypeName"])

# 4) Transformar ScreenResolution tomando solo el último código alfanumérico
def extract_screen_res(resolution):
    matches = re.findall(r'\d+x\d+', resolution)
    return matches[-1] if matches else "Unknown"

df_preprocessed["ScreenResolution"] = df_preprocessed["ScreenResolution"].apply(extract_screen_res)
le_screen = LabelEncoder()
df_preprocessed["ScreenResolution"] = le_screen.fit_transform(df_preprocessed["ScreenResolution"])

# 5) Dividir CPU en tres columnas
def extract_cpu_parts(cpu):
    parts = cpu.split()
    first_word = parts[0] if len(parts) > 0 else "Unknown"
    third_word = parts[2] if len(parts) > 2 else "Unknown"
    last_word = parts[-1] if len(parts) > 0 else "Unknown"
    return first_word, third_word, last_word

df_preprocessed[["CPU_Brand", "CPU_Model", "CPU_Speed"]] = df_preprocessed["Cpu"].apply(lambda x: pd.Series(extract_cpu_parts(x)))

# Convertir CPU columnas en numéricas
le_cpu_brand = LabelEncoder()
le_cpu_model = LabelEncoder()
le_cpu_speed = LabelEncoder()

df_preprocessed["CPU_Brand"] = le_cpu_brand.fit_transform(df_preprocessed["CPU_Brand"])
df_preprocessed["CPU_Model"] = le_cpu_model.fit_transform(df_preprocessed["CPU_Model"])
df_preprocessed["CPU_Speed"] = le_cpu_speed.fit_transform(df_preprocessed["CPU_Speed"])

# 6) Dividir Memory en dos columnas
def extract_memory_parts(memory):
    parts = memory.split()
    first_word = parts[0] if len(parts) > 0 else "Unknown"
    last_word = parts[-1] if len(parts) > 1 else "Unknown"
    return first_word, last_word

df_preprocessed[["Memory_Capacity", "Memory_Type"]] = df_preprocessed["Memory"].apply(lambda x: pd.Series(extract_memory_parts(x)))

# Convertir Memory columnas en numéricas
le_memory_capacity = LabelEncoder()
le_memory_type = LabelEncoder()

df_preprocessed["Memory_Capacity"] = le_memory_capacity.fit_transform(df_preprocessed["Memory_Capacity"])
df_preprocessed["Memory_Type"] = le_memory_type.fit_transform(df_preprocessed["Memory_Type"])

# 7) Dividir Gpu en dos columnas
df_preprocessed[["GPU_Brand", "GPU_Model"]] = df_preprocessed["Gpu"].apply(lambda x: pd.Series(extract_memory_parts(x)))

# Convertir GPU columnas en numéricas
le_gpu_brand = LabelEncoder()
le_gpu_model = LabelEncoder()

df_preprocessed["GPU_Brand"] = le_gpu_brand.fit_transform(df_preprocessed["GPU_Brand"])
df_preprocessed["GPU_Model"] = le_gpu_model.fit_transform(df_preprocessed["GPU_Model"])

# 8) Reducir OpSys a seis categorías y transformar en numérico
def categorize_os(os):
    os = os.lower()
    if "windows" in os:
        return "Windows"
    elif "linux" in os:
        return "Linux"
    elif "mac" in os or "macos" in os:
        return "Mac"
    elif "android" in os:
        return "Android"
    elif "chrome" in os or "chromebook" in os:
        return "Chrome"
    else:
        return "Others"

df_preprocessed["OpSys"] = df_preprocessed["OpSys"].apply(categorize_os)
le_os = LabelEncoder()
df_preprocessed["OpSys"] = le_os.fit_transform(df_preprocessed["OpSys"])

# Limpiar y convertir columnas a numéricas

# 1) Convertir Ram a numérico (eliminando "GB")
df_preprocessed["Ram"] = df_preprocessed["Ram"].str.replace("GB", "", regex=True).astype(int)

# 2) Convertir Weight a numérico (eliminando "kg")
df_preprocessed["Weight"] = df_preprocessed["Weight"].str.replace("kg", "", regex=True).astype(float)

# 4) Revisar GPU_Brand y GPU_Model (asegurar que son numéricos)
df_preprocessed["GPU_Brand"] = df_preprocessed["GPU_Brand"].astype(str)
df_preprocessed["GPU_Model"] = df_preprocessed["GPU_Model"].astype(str)

# Convertir cualquier columna restante que siga siendo 'object' en numérica usando Label Encoding si es necesario
for col in df_preprocessed.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    df_preprocessed[col] = le.fit_transform(df_preprocessed[col])

# Eliminar columnas originales ya transformadas
df_preprocessed.drop(columns=["Cpu", "Memory", "Gpu"], inplace=True)

# Modificamos el nombre

df_train_processed = df_preprocessed

df_train_processed


Unnamed: 0,Company,Product,TypeName,Inches,ScreenResolution,Ram,OpSys,Weight,Price_in_euros,CPU_Brand,CPU_Model,CPU_Speed,Memory_Capacity,Memory_Type,GPU_Brand,GPU_Model
755,7,85,3,15.6,3,8,5,1.86,539.00,1,27,22,6,2,1,8
618,4,107,1,15.6,3,16,5,2.59,879.01,1,29,18,4,0,2,32
909,7,143,3,15.6,3,8,5,2.04,900.00,1,29,19,4,0,2,27
2,1,124,4,13.3,1,8,3,1.34,898.94,1,28,8,1,3,1,17
286,4,107,3,15.6,3,4,2,2.25,428.00,1,27,11,4,0,0,51
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28,4,107,3,15.6,3,8,5,2.20,800.00,1,28,7,6,2,0,9
1160,7,158,0,13.3,3,8,5,1.48,1629.00,1,28,15,6,2,1,8
78,10,105,3,15.6,3,8,4,2.20,519.00,1,28,17,7,0,1,19
23,7,85,3,15.6,0,4,4,1.86,258.00,0,20,6,9,0,0,62


In [347]:
df_train_processed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 912 entries, 755 to 229
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Company           912 non-null    int32  
 1   Product           912 non-null    int32  
 2   TypeName          912 non-null    int32  
 3   Inches            912 non-null    float64
 4   ScreenResolution  912 non-null    int32  
 5   Ram               912 non-null    int32  
 6   OpSys             912 non-null    int32  
 7   Weight            912 non-null    float64
 8   Price_in_euros    912 non-null    float64
 9   CPU_Brand         912 non-null    int32  
 10  CPU_Model         912 non-null    int32  
 11  CPU_Speed         912 non-null    int32  
 12  Memory_Capacity   912 non-null    int32  
 13  Memory_Type       912 non-null    int32  
 14  GPU_Brand         912 non-null    int32  
 15  GPU_Model         912 non-null    int32  
dtypes: float64(3), int32(13)
memory usage: 74.8 KB


-----------------------------------------------------------------------------------------------------------------

In [348]:
df_test = pd.read_csv("./data/test.csv", index_col=0)
df_test.index.name = None
df_test

Unnamed: 0,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight
209,Lenovo,Legion Y520-15IKBN,Gaming,15.6,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,16GB,512GB SSD,Nvidia GeForce GTX 1060,No OS,2.4kg
1281,Acer,Aspire ES1-531,Notebook,15.6,1366x768,Intel Celeron Dual Core N3060 1.6GHz,4GB,500GB HDD,Intel HD Graphics 400,Linux,2.4kg
1168,Lenovo,V110-15ISK (i3-6006U/4GB/1TB/No,Notebook,15.6,1366x768,Intel Core i3 6006U 2.0GHz,4GB,1TB HDD,Intel HD Graphics 520,No OS,1.9kg
1231,Dell,Inspiron 7579,2 in 1 Convertible,15.6,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,Windows 10,2.191kg
1020,HP,ProBook 640,Notebook,14.0,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,4GB,256GB SSD,Intel HD Graphics 620,Windows 10,1.95kg
...,...,...,...,...,...,...,...,...,...,...,...
820,MSI,GE72MVR 7RG,Gaming,17.3,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,16GB,512GB SSD + 1TB HDD,Nvidia GeForce GTX 1070,Windows 10,2.9kg
948,Toshiba,Tecra Z40-C-12X,Notebook,14.0,IPS Panel Full HD 1920x1080,Intel Core i5 6200U 2.3GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.47kg
483,Dell,Precision M5520,Workstation,15.6,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,8GB,256GB SSD,Nvidia Quadro M1200,Windows 10,1.78kg
1017,HP,Probook 440,Notebook,14.0,1366x768,Intel Core i5 7200U 2.5GHz,4GB,500GB HDD,Intel HD Graphics 620,Windows 10,1.64kg


 ## 2. Replicar el procesado para ``test.csv``

In [349]:
# Aplicar las mismas transformaciones que en entrenamiento

# 1) Transformar Company en numérico usando el mismo LabelEncoder
df_test["Company"] = df_test["Company"].map(lambda x: le_company.transform([x])[0] if x in le_company.classes_ else -1)

# 2) Transformar Product (primera palabra o última si es alfanumérico) y luego Label Encoding
df_test["Product"] = df_test["Product"].apply(extract_product_keyword).map(lambda x: le_product.transform([x])[0] if x in le_product.classes_ else -1)

# 3) Transformar TypeName en numérico
df_test["TypeName"] = df_test["TypeName"].map(lambda x: le_type.transform([x])[0] if x in le_type.classes_ else -1)

# 4) Transformar ScreenResolution tomando solo el último código alfanumérico y luego Label Encoding
df_test["ScreenResolution"] = df_test["ScreenResolution"].apply(extract_screen_res).map(lambda x: le_screen.transform([x])[0] if x in le_screen.classes_ else -1)

# 5) Dividir CPU en tres columnas y transformar en numérico
df_test[["CPU_Brand", "CPU_Model", "CPU_Speed"]] = df_test["Cpu"].apply(lambda x: pd.Series(extract_cpu_parts(x)))
df_test["CPU_Brand"] = df_test["CPU_Brand"].map(lambda x: le_cpu_brand.transform([x])[0] if x in le_cpu_brand.classes_ else -1)
df_test["CPU_Model"] = df_test["CPU_Model"].map(lambda x: le_cpu_model.transform([x])[0] if x in le_cpu_model.classes_ else -1)
df_test["CPU_Speed"] = df_test["CPU_Speed"].map(lambda x: le_cpu_speed.transform([x])[0] if x in le_cpu_speed.classes_ else -1)

# 6) Dividir Memory en dos columnas y transformar en numérico
df_test[["Memory_Capacity", "Memory_Type"]] = df_test["Memory"].apply(lambda x: pd.Series(extract_memory_parts(x)))
df_test["Memory_Capacity"] = df_test["Memory_Capacity"].map(lambda x: le_memory_capacity.transform([x])[0] if x in le_memory_capacity.classes_ else -1)
df_test["Memory_Type"] = df_test["Memory_Type"].map(lambda x: le_memory_type.transform([x])[0] if x in le_memory_type.classes_ else -1)

# 7) Dividir Gpu en dos columnas y transformar en numérico
df_test[["GPU_Brand", "GPU_Model"]] = df_test["Gpu"].apply(lambda x: pd.Series(extract_memory_parts(x)))
df_test["GPU_Brand"] = df_test["GPU_Brand"].map(lambda x: le_gpu_brand.transform([x])[0] if x in le_gpu_brand.classes_ else -1)
df_test["GPU_Model"] = df_test["GPU_Model"].map(lambda x: le_gpu_model.transform([x])[0] if x in le_gpu_model.classes_ else -1)

# 8) Reducir OpSys a seis categorías y transformar en numérico
df_test["OpSys"] = df_test["OpSys"].apply(categorize_os).map(lambda x: le_os.transform([x])[0] if x in le_os.classes_ else -1)

# Limpiar y convertir columnas a numéricas

# 1) Convertir Ram a numérico (eliminando "GB")
df_test["Ram"] = df_test["Ram"].str.replace("GB", "", regex=True).astype(int)

# 2) Convertir Weight a numérico (eliminando "kg")
df_test["Weight"] = df_test["Weight"].str.replace("kg", "", regex=True).astype(float)

# Convertir GPU_Brand y GPU_Model en numéricos correctamente
df_test["GPU_Brand"] = df_test["GPU_Brand"].astype(int)
df_test["GPU_Model"] = df_test["GPU_Model"].astype(int)

# Asegurar que no queden columnas categóricas sin transformar
for col in df_test.select_dtypes(include=['object']).columns:
    if col in df_train_processed.columns:  # Evitar problemas con columnas nuevas
        df_test[col] = df_test[col].map(lambda x: le.transform([x])[0] if x in le.classes_ else -1)

# Eliminar columnas originales ya transformadas
df_test.drop(columns=["Cpu", "Memory", "Gpu"], inplace=True)

# Modificamos el nombre

df_test_processed = df_test

df_test_processed

Unnamed: 0,Company,Product,TypeName,Inches,ScreenResolution,Ram,OpSys,Weight,CPU_Brand,CPU_Model,CPU_Speed,Memory_Capacity,Memory_Type,GPU_Brand,GPU_Model
209,10,121,1,15.6,3,16,4,2.400,1,29,20,11,2,2,3
1281,0,46,3,15.6,0,4,2,2.400,1,17,7,9,0,1,8
1168,10,185,3,15.6,0,4,4,1.900,1,27,11,4,0,1,16
1231,4,107,0,15.6,3,8,5,2.191,1,28,17,6,2,1,26
1020,7,143,3,14.0,3,4,5,1.950,1,28,17,6,2,1,26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
820,11,42,1,17.3,3,16,5,2.900,1,29,20,11,0,2,4
948,16,173,3,14.0,3,4,5,1.470,1,28,14,1,2,1,16
483,4,-1,5,15.6,3,8,5,1.780,1,29,20,6,2,2,46
1017,7,144,3,14.0,0,4,5,1.640,1,28,17,9,0,1,26


In [350]:
df_test_processed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 391 entries, 209 to 421
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Company           391 non-null    int32  
 1   Product           391 non-null    int64  
 2   TypeName          391 non-null    int32  
 3   Inches            391 non-null    float64
 4   ScreenResolution  391 non-null    int64  
 5   Ram               391 non-null    int32  
 6   OpSys             391 non-null    int32  
 7   Weight            391 non-null    float64
 8   CPU_Brand         391 non-null    int64  
 9   CPU_Model         391 non-null    int64  
 10  CPU_Speed         391 non-null    int64  
 11  Memory_Capacity   391 non-null    int32  
 12  Memory_Type       391 non-null    int32  
 13  GPU_Brand         391 non-null    int32  
 14  GPU_Model         391 non-null    int32  
dtypes: float64(2), int32(8), int64(5)
memory usage: 36.7 KB


In [351]:
# 1️⃣ Relación entre el peso y el tamaño de la pantalla
df_train_processed["Weight_per_Inch"] = df_train_processed["Weight"] / df_train_processed["Inches"]
df_test_processed["Weight_per_Inch"] = df_test_processed["Weight"] / df_test_processed["Inches"]

# 2️⃣ Diferencia relativa entre CPU y RAM (puede ayudar a detectar laptops premium)
df_train_processed["CPU_Ratio"] = df_train_processed["CPU_Brand"] / (df_train_processed["Ram"] + 1)
df_test_processed["CPU_Ratio"] = df_test_processed["CPU_Brand"] / (df_test_processed["Ram"] + 1)

# 3️⃣ Factor de almacenamiento (podemos intentar extraer el tipo de almacenamiento)
df_train_processed["Storage_Efficiency"] = df_train_processed["Memory_Capacity"] / (df_train_processed["Weight"] + 1)
df_test_processed["Storage_Efficiency"] = df_test_processed["Memory_Capacity"] / (df_test_processed["Weight"] + 1)

## Modelado

### 1. Definir X e y

In [352]:
X = df_train_processed.drop(['Price_in_euros'], axis=1)
y = df_train_processed['Price_in_euros'].copy()

### 2. Dividir X_train, X_test, y_train, y_test

In [353]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

### 3. Baseline de modelos


In [354]:

# Entrenar modelo Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))


In [355]:
# Entrenar modelo XGBoost
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))


### 4. Sacar métricas, valorar los modelos 

Recuerda que en la competición se va a evaluar con la métrica de ``RMSE``.

In [356]:
# Comparar resultados
rmse_results = {"Random Forest RMSE": rmse_rf, "XGBoost RMSE": rmse_xgb}
rmse_results

{'Random Forest RMSE': 362.5017316790963, 'XGBoost RMSE': 346.820799800208}

------------------------------------------------------------

## Optimizacion

In [357]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}

grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=3, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)

print("Mejores hiperparámetros:", grid_search.best_params_)

Mejores hiperparámetros: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}


In [358]:
xgb_params = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

xgb_grid = RandomizedSearchCV(XGBRegressor(random_state=42), xgb_params, cv=3, scoring='neg_root_mean_squared_error', n_iter=10, random_state=42)
xgb_grid.fit(X_train, y_train)

print("Mejores hiperparámetros para XGBoost:", xgb_grid.best_params_)

Mejores hiperparámetros para XGBoost: {'subsample': 0.8, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.2, 'colsample_bytree': 0.8}


In [359]:
best_rf_params = grid_search.best_params_

best_rf_model = RandomForestRegressor(**best_rf_params, random_state=42)
best_rf_model.fit(X_train, y_train)

In [360]:
best_xgb_params = xgb_grid.best_params_

best_xgb_model = XGBRegressor(**best_xgb_params, random_state=42)
best_xgb_model.fit(X_train, y_train)

In [361]:
y_pred_rf = best_rf_model.predict(X_test)  # Random Forest
y_pred_xgb = best_xgb_model.predict(X_test)  # XGBoost

In [362]:
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))

print(f"Random Forest RMSE: {rmse_rf}")
print(f"XGBoost RMSE: {rmse_xgb}")

Random Forest RMSE: 361.75480031936445
XGBoost RMSE: 326.73041990653036


-----------------------------------------------------------------

## Una vez listo el modelo, toca predecir ``test.csv``

**RECUERDA: APLICAR LAS TRANSFORMACIONES QUE HAYAS REALIZADO EN `train.csv` a `test.csv`.**


Véase:
- Estandarización/Normalización
- Eliminación de Outliers
- Eliminación de columnas
- Creación de columnas nuevas
- Gestión de valores nulos
- Y un largo etcétera de técnicas que como Data Scientist hayas considerado las mejores para tu dataset.

### 1. Carga los datos de `test.csv` para predecir.


In [363]:
predictions_submit = best_rf_model.predict(df_test_processed)
predictions_submit

array([1456.22375   ,  296.705025  ,  419.2719    ,  904.09798992,
        915.15375   ,  469.09215   ,  867.65773   , 1067.14015   ,
       1339.10125   ,  382.63445   , 2494.31363286, 1457.860525  ,
        566.30245   , 1803.97296333, 1039.95453333,  709.53205   ,
       1817.55715   , 1338.6505    , 1658.9093    ,  711.2609    ,
       1422.95325   ,  411.684175  ,  776.8554    , 1492.1717    ,
        469.8887    ,  764.46218515,  532.79145   ,  648.78975   ,
       2283.53      , 1037.80158019, 1885.76173333,  459.86705   ,
        810.6541    , 3071.522     , 2086.3181    , 1661.44196667,
        666.554375  , 2050.46001667,  925.8733873 , 1413.06195952,
        823.3212    ,  861.4982    ,  522.4883    , 1113.78480162,
       1313.65926667, 1037.65633333, 1104.71464924,  554.58565   ,
        690.87103333,  491.89935   , 1792.12855714,  785.63024333,
       1120.96115   ,  609.237575  , 1731.18143333, 1746.24145655,
        692.0186    ,  945.63856555,  896.02094929,  560.25435

### 3. **¿Qué es lo que subirás a Kaggle?**

**Para subir a Kaggle la predicción esta tendrá que tener una forma específica.**

En este caso, la **MISMA** forma que `sample_submission.csv`. 

In [364]:
sample = pd.read_csv("./data/sample_submission.csv")
sample

Unnamed: 0,laptop_ID,Price_in_euros
0,209,1949.1
1,1281,805.0
2,1168,1101.0
3,1231,1293.8
4,1020,1832.6
...,...,...
386,820,474.3
387,948,1468.8
388,483,520.4
389,1017,515.1


### 4. Mete tus predicciones en un dataframe llamado ``submission``.

In [365]:
#¿Cómo creamos la submission?
submission = pd.DataFrame({
    "laptop_ID": df_test_processed.index,
    "Price_in_euros": predictions_submit
})

In [366]:
submission.head()
submission['Price_in_euros'] = submission['Price_in_euros'].apply(lambda x: round(x, 1))
submission

Unnamed: 0,laptop_ID,Price_in_euros
0,209,1456.2
1,1281,296.7
2,1168,419.3
3,1231,904.1
4,1020,915.2
...,...,...
386,820,1762.7
387,948,940.5
388,483,1800.7
389,1017,899.8


### 5. Pásale el CHEQUEADOR para comprobar que efectivamente está listo para subir a Kaggle.

In [367]:
def chequeador(df_to_submit):
    """
    Esta función se asegura de que tu submission tenga la forma requerida por Kaggle.
    
    Si es así, se guardará el dataframe en un `csv` y estará listo para subir a Kaggle.
    
    Si no, LEE EL MENSAJE Y HAZLE CASO.
    
    Si aún no:
    - apaga tu ordenador, 
    - date una vuelta, 
    - enciendelo otra vez, 
    - abre este notebook y 
    - leelo todo de nuevo. 
    Todos nos merecemos una segunda oportunidad. También tú.
    """
    if df_to_submit.shape == sample.shape:
        if df_to_submit.columns.all() == sample.columns.all():
            if df_to_submit.laptop_ID.all() == sample.laptop_ID.all():
                print("You're ready to submit!")
                submission.to_csv("submission.csv", index = False) #muy importante el index = False
                urllib.request.urlretrieve("https://www.mihaileric.com/static/evaluation-meme-e0a350f278a36346e6d46b139b1d0da0-ed51e.jpg", "gfg.png")     
                img = Image.open("gfg.png")
                img.show()   
            else:
                print("Check the ids and try again")
        else:
            print("Check the names of the columns and try again")
    else:
        print("Check the number of rows and/or columns and try again")
        print("\nMensaje secreto del TA: No me puedo creer que después de todo este notebook hayas hecho algún cambio en las filas de `test.csv`. Lloro.")

In [368]:
chequeador(submission)

You're ready to submit!
