Para este problema decidimos como equipo utilizar modelo un de aprendizaje supervisado. Concretamente un modelo de regresión dado que el valor a predecir es decir, el CCS es un valor continuo y no tendría sentido crear clasificaciones.

Comenzamos la práctica importando los datos de public_test.csv y public_train.csv 

In [133]:
# Importación de los datos:
import sklearn
import pandas as pd
# /workspaces/Practica-IA-3/public_test.csv
# /workspaces/Practica-IA-3/public_train.csv
# Dataframe
test_data = pd.read_csv("public_test.csv")
train_data = pd.read_csv("public_train.csv")
test_data.head()
train_data.head()

Unnamed: 0,ccs,adduct,desc_1,desc_2,desc_3,desc_4,desc_5,desc_6,desc_7,desc_8,...,fgp_611,fgp_612,fgp_613,fgp_614,fgp_615,fgp_616,fgp_617,fgp_618,fgp_619,fgp_620
0,155.71001,Monomer_[M+H],-5.27492,420.85751,0.79886,60.68518,0.22702,54.36304,55.73885,8.28095,...,0,0,1,0,1,1,1,0,1,0
1,179.56,Monomer_[M-H],-2.0,1171.97156,1.13333,121.96276,0.1552,100.4539,97.68512,11.36667,...,1,0,0,1,0,1,0,1,1,0
2,180.0,Monomer_[M+Na],-0.7006,1447.22644,1.30616,141.10509,0.13916,110.47855,108.96603,12.96667,...,0,1,1,0,1,0,0,0,0,1
3,155.53999,Monomer_[M+Na],0.0,479.806,0.92774,74.90678,0.21166,65.6459,66.68343,6.83333,...,1,0,1,1,1,0,1,1,0,0
4,173.5,Monomer_[M+H],-1.63669,978.85028,1.24843,112.27611,0.15789,87.88085,89.68015,10.73333,...,0,1,0,1,0,0,1,0,1,0


Usaremos un modelo de descenso de gradiente concretamente la implementación de XGBoost que admite resolver tareas de regresión. Por tanto, este funcionará como un modelo de regresión de descenso de gradiente. A continuación se realizará el preprocesamiento de datos.


In [134]:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
import numpy as np

# Preprocesamiento de los datos, en primer lugar ver que pinta tienen los primeros datos:

y = train_data["ccs"]
x = train_data.drop(columns=["ccs"])

numeric_features = x.select_dtypes(include=np.number).columns.tolist()
categorical_features = ["adduct"]

# print(test_data) con el print, se pueden observar algunos valores de NaN, por tanto apicamos imputación:
# print(train_data)


### Modelo + prepocesado

In [104]:
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV

def create_preproc_and_model(numeric_features , categorical_features, params):
    scaler = StandardScaler()
    numeric_imputer = SimpleImputer(strategy="mean")
    categorical_imputer = SimpleImputer(strategy="most_frequent")
    oh_encoder = OneHotEncoder(handle_unknown='ignore')

    numeric_transformer = Pipeline(steps=[
        ("imputer", numeric_imputer),
        ("scaler", scaler)
    ])
    categorical_transformer = Pipeline(steps=[
        ("imputer", categorical_imputer),
        ("encoder", oh_encoder)
    ])

    transformer = ColumnTransformer(
        transformers=[
            ('numeric_imp', numeric_transformer, numeric_features),
            ('categorial_imp', categorical_transformer, categorical_features) 
        ]
    )
    
    # Crear modelo XGBoost con los parámetros
    model = xgb.XGBRegressor(**params)
    
    return Pipeline(steps=[
        ("transformer", transformer),
        ("model", model)
    ])


### Entrenamiento + Comprobar mejores parámetros

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV


x_train , x_test , y_train , y_test = train_test_split(x , y , test_size=0.2)
# Hay que ajustar el modelo y seleccionar el que mejor rendimiento tiene.
params = {
        "objective": ["reg:squarederror"],  # Para regresión
        "booster": ["gblinear"],           # Usa modelo lineal
        "alpha": [0.1],   # Regularización L1 (Lasso)
        "lambda": [ 0.7 , 0.8 , 0.9],                   # Regularización L2 (Ridge)
        "n_estimators": [200 , 300],              # Número de árboles que se entrenan
        "learning_rate": [0.02 , 0.03],           # Conservador vs rápido
        "enable_categorical": [True]  # SI TIENE COLUMNAS CATEGóRICAS
    }
pipeline = create_preproc_and_model(numeric_features, categorical_features, params=params)

grid_search = GridSearchCV(estimator=pipeline, 
                           param_grid={"model__" + key: value for key, value in params.items()},  # Añadir "model__" antes de cada parámetro
                           cv=3,
                           scoring='neg_mean_squared_error', 
                           n_jobs=-1, 
                           verbose=1)

grid_search.fit(x_train , y_train)
print("Mejores parámetros encontrados:", grid_search.best_params_)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Mejores parámetros encontrados: {'model__alpha': 0.1, 'model__booster': 'gblinear', 'model__enable_categorical': True, 'model__lambda': 0.7, 'model__learning_rate': 0.03, 'model__n_estimators': 300, 'model__objective': 'reg:squarederror'}


In [129]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

best_model = grid_search.best_estimator_
predicciones = best_model.predict(x_test)

# Evaluación
mse = mean_squared_error(y_test, predicciones)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, predicciones)
r2 = r2_score(y_test, predicciones)

# Resultados
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print("R² Score:", r2)

Mean Squared Error (MSE): 118.12900489559534
Root Mean Squared Error (RMSE): 10.868716800781744
Mean Absolute Error (MAE): 5.905903130023832
R² Score: 0.9628629163718576


### Ya con los parametros fijados

In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

x_train , x_test , y_train , y_test = train_test_split(x , y , test_size=0.2)
# Hay que ajustar el modelo y seleccionar el que mejor rendimiento tiene.
params = {
        "objective": "reg:squarederror",  # Para regresión
        "booster": "gblinear",           # Usa modelo lineal
        "alpha": 0.1,                    # Regularización L1 (Lasso)
        "lambda": 0.7,                   # Regularización L2 (Ridge)
        "n_estimators": 300,             # Número de árboles que se entrenan
        "learning_rate": 0.03,           # Conservador vs rápido
        "enable_categorical": True       # SI TIENE COLUMNAS CATEGóRICAS
    }
pipeline = create_preproc_and_model(numeric_features, categorical_features, params=params)

pipeline.fit(x_train , y_train)

predicciones = pipeline.predict(x_test)

# Evaluación
mse = mean_squared_error(y_test, predicciones)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, predicciones)
r2 = r2_score(y_test, predicciones)

# Resultados
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print("R² Score:", r2)

Mean Squared Error (MSE): 70.38003294890628
Root Mean Squared Error (RMSE): 8.389280836216313
Mean Absolute Error (MAE): 5.476769536943677
R² Score: 0.9792038141759865


### Visualización de la salida de datos

In [124]:
predicciones_test = pipeline.predict(test_data)

# Mostrar predicciones
print("Predicciones para test_data:")
print(predicciones_test)
    
    

Predicciones para test_data:
[176.9566  261.0208  228.08781 ... 218.80911 168.30586 279.02972]
