## Kaggle – DataTops®
Luismi ha decidido cambiar de aires y, por eso, ha comprado una tienda de portátiles. Sin embargo, su única especialidad es Data Science, por lo que ha decidido crear un modelo de ML para establecer los mejores precios.

¿Podrías ayudar a Luismi a mejorar ese modelo?

## Métrica: 
Error de raíz cuadrada media (RMSE) es la desviación estándar de los valores residuales (errores de predicción). Los valores residuales son una medida de la distancia de los puntos de datos de la línea de regresión; RMSE es una medida de cuál es el nivel de dispersión de estos valores residuales. En otras palabras, le indica el nivel de concentración de los datos en la línea de mejor ajuste.


$$ RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sigma_i}\Big)^2}}$$


URL competición:

https://www.kaggle.com/t/dc38762d4f004b6d9301a3bbbc7640b9

## Librerías

In [15]:
import functions as fnc

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from PIL import Image
import urllib.request

from sklearn.model_selection import train_test_split, cross_val_score


from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, RidgeCV, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor, XGBRFRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

from sklearn.metrics import root_mean_squared_error, mean_squared_error


## Funciones de utilidad

In [16]:
# Función para ejecutar varios modelos a la vez y mostrar los resultados
def execute_models(model_names, model_array, train_X, train_y, test_X, test_y):
    result = []
    min = 0
    idx_min = 0
    for idx, name in enumerate(model_names):
        mensaje = f"Modelo ({idx}): {name}"

        model_array[idx].fit(train_X, train_y)
        y_pred = model_array[idx].predict(test_X)

        RMSE = fnc.get_RSME(test_y, y_pred)
        mensaje += f" [{RMSE}, {model_array[idx].score(test_X, test_y)}]"
        print(mensaje)
        result.append(RMSE)

        if (min == 0 or RMSE < min):
            min = RMSE
            idx_min = idx

    print()
    print(f"> El mejor modelo es {model_names[idx_min]}: {result[idx_min]}")
    
    return model_array[idx_min]

In [17]:
def chequeador(df_to_submit, sample):
    """
    Esta función se asegura de que tu submission tenga la forma requerida por Kaggle.
    
    Si es así, se guardará el dataframe en un `csv` y estará listo para subir a Kaggle.
    
    Si no, LEE EL MENSAJE Y HAZLE CASO.
    
    Si aún no:
    - apaga tu ordenador, 
    - date una vuelta, 
    - enciendelo otra vez, 
    - abre este notebook y 
    - leelo todo de nuevo. 
    Todos nos merecemos una segunda oportunidad. También tú.
    """
    if df_to_submit.shape == sample.shape:
        if df_to_submit.columns.all() == sample.columns.all():
            if df_to_submit.laptop_ID.all() == sample.laptop_ID.all():
                print("You're ready to submit!")
                #submission.to_csv("./submission/submission__NEW.csv", index = False) #muy importante el index = False
                df_to_submit.to_csv("./submission/submission__NEW.csv", index = False) #muy importante el index = False
                # urllib.request.urlretrieve("https://www.mihaileric.com/static/evaluation-meme-e0a350f278a36346e6d46b139b1d0da0-ed51e.jpg", "gfg.png")     
                # img = Image.open("./img/gfg.png")
                # img.show()   
            else:
                print("Check the ids and try again")
        else:
            print("Check the names of the columns and try again")
    else:
        print("Check the number of rows and/or columns and try again")
        print("\nMensaje secreto del TA: No me puedo creer que después de todo este notebook hayas hecho algún cambio en las filas de `test.csv`. Lloro.")


## Datos

In [18]:
# Leemos en archivo de train
df = fnc.get_dataframe("./data/train.csv")
df


Unnamed: 0,Inches,Ram_num,Weight_num,Company_cat,TypeName_cat,Cpu_split_cat,Gpu_split_cat,OpSys_split_cat,Price_in_euros
755,15.6,8,1.86,8.0,4.0,11.0,6.0,6.0,539.00
618,15.6,16,2.59,5.0,2.0,11.0,10.0,6.0,879.01
909,15.6,8,2.04,8.0,4.0,11.0,10.0,6.0,900.00
2,13.3,8,1.34,2.0,5.0,11.0,6.0,7.0,898.94
286,15.6,4,2.25,5.0,4.0,11.0,4.0,3.0,428.00
...,...,...,...,...,...,...,...,...,...
28,15.6,8,2.20,5.0,4.0,11.0,4.0,6.0,800.00
1160,13.3,8,1.48,8.0,1.0,11.0,6.0,6.0,1629.00
78,15.6,8,2.20,11.0,4.0,11.0,6.0,5.0,519.00
23,15.6,4,1.86,8.0,4.0,6.0,4.0,5.0,258.00


In [19]:
# Dividimos train_set y test_set
train_set, test_set = train_test_split(df, test_size = 0.20, random_state = 33)

# Sets escalados
train_set_scaled = fnc.transform(train_set)
test_set_scaled = fnc.transform(test_set)


In [20]:
train_set.describe()

Unnamed: 0,Inches,Ram_num,Weight_num,Company_cat,TypeName_cat,Cpu_split_cat,Gpu_split_cat,OpSys_split_cat,Price_in_euros
count,729.0,729.0,729.0,729.0,729.0,729.0,729.0,729.0,729.0
mean,14.963786,8.310014,2.022366,7.347051,3.595336,10.60631,7.120713,5.646091,1114.024582
std,1.467629,5.23381,0.675763,4.153179,1.237922,1.50092,2.178623,1.008679,697.233517
min,10.1,2.0,0.69,1.0,1.0,1.0,1.0,1.0,174.0
25%,14.0,4.0,1.48,5.0,3.0,11.0,6.0,6.0,579.0
50%,15.6,8.0,2.02,8.0,4.0,11.0,6.0,6.0,990.0
75%,15.6,8.0,2.3,11.0,4.0,11.0,10.0,6.0,1498.0
max,18.4,64.0,4.7,19.0,6.0,13.0,11.0,7.0,6099.0


In [21]:
train_set_scaled.describe()

Unnamed: 0,Inches,Ram_num,Weight_num,Company_cat,TypeName_cat,Cpu_split_cat,Gpu_split_cat,OpSys_split_cat,Price_in_euros
count,729.0,729.0,729.0,729.0,729.0,729.0,729.0,729.0,729.0
mean,1.871388e-15,9.746814e-18,3.4113850000000004e-17,7.347051,3.595336,10.60631,7.120713,5.646091,1114.024582
std,1.000687,1.000687,1.000687,4.153179,1.237922,1.50092,2.178623,1.008679,697.233517
min,-3.316319,-1.206453,-3.170071,1.0,1.0,1.0,1.0,1.0,174.0
25%,-0.6571469,-0.8240598,-0.8055597,5.0,3.0,11.0,6.0,6.0,579.0
50%,0.4337956,-0.05927356,0.158257,8.0,4.0,11.0,6.0,6.0,990.0
75%,0.4337956,-0.05927356,0.5604831,11.0,4.0,11.0,10.0,6.0,1498.0
max,2.342945,10.64773,2.774863,19.0,6.0,13.0,11.0,7.0,6099.0


## Modelado

### Dividir X_train, X_test, y_train, y_test

In [22]:
X_train, y_train = fnc.get_X_y(train_set)
X_test, y_test = fnc.get_X_y(test_set)

X_train_scaled, y_train_scaled = fnc.get_X_y(train_set_scaled)
X_test_scaled, y_test_scaled = fnc.get_X_y(test_set_scaled)

In [23]:
X_train

Unnamed: 0,Inches,Ram_num,Weight_num,Company_cat,TypeName_cat,Cpu_split_cat,Gpu_split_cat,OpSys_split_cat
373,15.6,8,2.40,11.0,2.0,11.0,10.0,5.0
625,15.6,16,2.94,12.0,2.0,11.0,10.0,6.0
370,15.6,8,2.20,11.0,4.0,11.0,10.0,5.0
773,13.3,4,1.65,5.0,4.0,11.0,6.0,6.0
1179,14.0,16,1.70,12.0,2.0,11.0,10.0,6.0
...,...,...,...,...,...,...,...,...
667,17.3,32,4.42,5.0,2.0,11.0,10.0,6.0
243,17.3,32,4.70,3.0,2.0,11.0,10.0,6.0
727,17.3,8,2.63,8.0,4.0,11.0,10.0,6.0
640,15.6,4,1.80,11.0,4.0,11.0,8.0,6.0


In [24]:
X_train_scaled

Unnamed: 0,Inches,Ram_num,Weight_num,Company_cat,TypeName_cat,Cpu_split_cat,Gpu_split_cat,OpSys_split_cat
373,0.433796,-0.059274,0.692356,11.0,2.0,11.0,10.0,5.0
625,0.433796,1.470299,1.321175,12.0,2.0,11.0,10.0,6.0
370,0.433796,-0.059274,0.422748,11.0,4.0,11.0,10.0,5.0
773,-1.134434,-0.824060,-0.468646,5.0,4.0,11.0,6.0,6.0
1179,-0.657147,1.470299,-0.376145,12.0,2.0,11.0,10.0,6.0
...,...,...,...,...,...,...,...,...
667,1.592922,4.529444,2.584542,5.0,2.0,11.0,10.0,6.0
243,1.592922,4.529444,2.774863,3.0,2.0,11.0,10.0,6.0
727,1.592922,-0.059274,0.975919,8.0,4.0,11.0,10.0,6.0
640,0.433796,-0.824060,-0.199038,11.0,4.0,11.0,8.0,6.0


## Entrenamiento de todos los modelos

##### Modelos escalados

In [25]:
# Entrenamos todos los modelos posibles con los datos de entrenamiento

linear_regression = LinearRegression()
ridge = Ridge(alpha = 185, random_state=33)
lasso = Lasso(alpha = 115, random_state=33)
elasticNet = ElasticNet(alpha = 110, l1_ratio = 1, random_state=33)
knn_regressor = KNeighborsRegressor(n_neighbors=10)

model_names_transform = ["LinearRegression", "Ridge", "Lasso", "ElasticNet", "KNeighborsRegressor"]
model_array_transform = [linear_regression, ridge, lasso, elasticNet, knn_regressor] 

selected_model = execute_models(model_names_transform, model_array_transform, X_train_scaled, y_train, X_test_scaled, y_test)

Modelo (0): LinearRegression [475.8087681945345, 0.46361865419742154]
Modelo (1): Ridge [467.78021077804766, 0.4815671955937523]
Modelo (2): Lasso [488.80634970233297, 0.4339139386862306]
Modelo (3): ElasticNet [488.54222357415085, 0.4345255417216016]
Modelo (4): KNeighborsRegressor [440.51691209370586, 0.540237065302875]

> El mejor modelo es KNeighborsRegressor: 440.51691209370586


##### Modelos no escalados

In [26]:
# Entrenamos todos los modelos posibles con los datos de entrenamiento

random_forest = RandomForestRegressor(max_depth= 5, random_state=33)
decision_tree = DecisionTreeRegressor(max_depth=5, random_state=33)
ada_boost = AdaBoostRegressor(n_estimators=200, random_state=33)
gradient_boosting = GradientBoostingRegressor() 
xgb = XGBRegressor(max_depth = 5, random_state = 33)
xgbrf = XGBRFRegressor(random_state=33)
lgbm = LGBMRegressor(max_depth= 5, verbose = -1, n_jobs= -1, random_state = 33)
cat_boost = CatBoostRegressor(n_estimators=200, loss_function='RMSE', learning_rate=0.4, verbose = False, random_state=33)
svr = SVR()

model_names = ["RandomForestRegressor", "DecisionTreeRegressor", "AdaBoostRegressor", "GradientBoostingRegressor", "XGBRegressor", "XGBRFRegressor",
               "LGBMRegressor","CatBoostRegressor", "SVR"]
model_array = [random_forest, ada_boost, gradient_boosting, decision_tree,xgb, xgbrf ,lgbm, cat_boost, svr]

selected_model = execute_models(model_names, model_array, X_train, y_train, X_test, y_test)


Modelo (0): RandomForestRegressor [364.2229298828056, 0.6857007454837518]
Modelo (1): DecisionTreeRegressor [470.5619062451242, 0.475383051840794]
Modelo (2): AdaBoostRegressor [333.8807415676069, 0.7358859392352846]
Modelo (3): GradientBoostingRegressor [384.2815443377197, 0.6501290868985434]
Modelo (4): XGBRegressor [283.9273297446064, 0.8090044803226362]
Modelo (5): XGBRFRegressor [353.58255816627235, 0.7037963192581403]
Modelo (6): LGBMRegressor [347.4151817520197, 0.7140392882280425]
Modelo (7): CatBoostRegressor [307.98554665492577, 0.7752656511317806]
Modelo (8): SVR [635.9678378402336, 0.041749422073664166]

> El mejor modelo es XGBRegressor: 283.9273297446064


In [27]:
xgb

##### Optimizacion XGB

In [28]:
xgb

param_grid = {
        'n_estimators': [None, 100, 250, 500, 750],
        'max_depth': [2, 5, 10],
        'learning_rate': [None, 0.1, 0.2, 0.3, 0.4],
        'subsample': [0.3,0.6,1],
        'colsample_bytree': [0.5,1],
}

# xgb_rscv = GridSearchCV(estimator=cb, param_grid=param_grid, cv=4, scoring='neg_mean_squared_error' )

xgb_searchcv = RandomizedSearchCV(estimator=xgb,
                        cv=4,
                        n_iter=10,
                        param_distributions=param_grid,
                        scoring='neg_mean_squared_error' 
                        )
xgb_searchcv.fit(X_train,y_train)

In [29]:
y_pred = xgb_searchcv.best_estimator_.predict(X_test)
print('RMSE:', fnc.get_RSME(y_test,y_pred))

RMSE: 297.742117240665


## Entrenamiento de todos los modelos con cross_val_score

In [30]:
metricas_cv = {}
valores = []
for nombre,modelo in zip(model_names + model_names_transform, model_array + model_array_transform):
    if nombre in model_names_transform:
        metricas_cv[nombre] = cross_val_score(modelo, X_train_scaled, y_train, cv = 3, scoring = "neg_mean_squared_error")
    else:
        metricas_cv[nombre] = cross_val_score(modelo, X_train, y_train, cv = 3, scoring = "neg_mean_squared_error")
    print(f"{type(modelo)} {np.mean(metricas_cv[nombre])}")
    valores.append(np.mean(metricas_cv[nombre]))
ganador = list(metricas_cv.keys())[np.argmax(valores)]

print()
print(f"El ganador es: {ganador}")

<class 'sklearn.ensemble._forest.RandomForestRegressor'> -134809.82123749217
<class 'sklearn.ensemble._weight_boosting.AdaBoostRegressor'> -203845.61129415082
<class 'sklearn.ensemble._gb.GradientBoostingRegressor'> -115364.25116307555
<class 'sklearn.tree._classes.DecisionTreeRegressor'> -180908.51907407376
<class 'xgboost.sklearn.XGBRegressor'> -133313.68146790835
<class 'xgboost.sklearn.XGBRFRegressor'> -123290.2375445473
<class 'lightgbm.sklearn.LGBMRegressor'> -127133.26714870795
<class 'catboost.core.CatBoostRegressor'> -108488.22603859229
<class 'sklearn.svm._classes.SVR'> -472634.90691775054
<class 'sklearn.linear_model._base.LinearRegression'> -181970.29878823212
<class 'sklearn.linear_model._ridge.Ridge'> -200073.8667344934
<class 'sklearn.linear_model._coordinate_descent.Lasso'> -210382.86141898847
<class 'sklearn.linear_model._coordinate_descent.ElasticNet'> -208916.52302097576
<class 'sklearn.neighbors._regression.KNeighborsRegressor'> -212085.04999191905

El ganador es: C

##### Optimizacion CatBoost

In [31]:
pool_train = Pool(X_train, y_train)
pool_test = Pool(X_test)
cat_boost.fit(pool_train)
y_pred = cat_boost.predict(X_test)
print('RMSE:', fnc.get_RSME(y_test,y_pred))
print(f"Pool RSME: {cat_boost.best_score_}")

cat_boost.fit(X_train, y_train)
y_pred = cat_boost.predict(X_test)

print('RMSE:', fnc.get_RSME(y_test,y_pred))
print(f"Sin Pool RSME: {cat_boost.best_score_}")

RMSE: 307.98554665492577
Pool RSME: {'learn': {'RMSE': 122.41062600591428}}
RMSE: 307.98554665492577
Sin Pool RSME: {'learn': {'RMSE': 122.41062600591428}}


In [None]:
param_grid= {
    'n_estimators': [100, 250, 500, 750],
    'depth': [3, 6, 12],
    'learning_rate': [0.1, 0.2, 0.3, 0.4],
    'colsample_bylevel': [0.5,1],
    "border_count": [125,250]
}

# catboost_searchcv = GridSearchCV(estimator=cat_boost, param_grid=param_grid, cv=4, scoring='neg_mean_squared_error' )
catboost_searchcv = RandomizedSearchCV(estimator=cat_boost, 
                                       param_distributions=param_grid, 
                                       n_iter=10, 
                                       scoring='neg_mean_squared_error' )
catboost_searchcv.fit(X_train, y_train)

In [None]:
y_pred = catboost_searchcv.best_estimator_.predict(X_test)
print('RMSE:', fnc.get_RSME(y_test,y_pred))

## Predecir ``test.csv``

In [34]:
X_predict = fnc.get_dataframe("./data/test.csv")
#X_predict_scaled = fnc.transform(X_predict)

In [35]:
X_predict

Unnamed: 0,Inches,Ram_num,Weight_num,Company_cat,TypeName_cat,Cpu_split_cat,Gpu_split_cat,OpSys_split_cat
209,15.6,16,2.400,10.0,2.0,12.0,7.0,4.0
1281,15.6,4,2.400,1.0,4.0,11.0,4.0,2.0
1168,15.6,4,1.900,10.0,4.0,12.0,4.0,4.0
1231,15.6,8,2.191,5.0,1.0,12.0,4.0,5.0
1020,14.0,4,1.950,8.0,4.0,12.0,4.0,5.0
...,...,...,...,...,...,...,...,...
820,17.3,16,2.900,11.0,2.0,12.0,7.0,5.0
948,14.0,4,1.470,16.0,4.0,12.0,4.0,5.0
483,15.6,8,1.780,5.0,6.0,12.0,8.0,5.0
1017,14.0,4,1.640,8.0,4.0,12.0,4.0,5.0


In [37]:
selected_model = xgb_searchcv.best_estimator_
selected_model

In [38]:
sample = pd.read_csv("./data/sample_submission.csv")

predictions_submit = selected_model.predict(X_predict)
submission = pd.DataFrame({"laptop_ID": X_predict.index , fnc.TARGET:predictions_submit})

chequeador(submission, sample)


You're ready to submit!
