In [1]:
import os

import matplotlib.pyplot as plt
import seaborn as sns

if str(os.getcwdb()[-3:]).split("'")[1] != 'src':
    os.chdir(os.path.dirname(os.getcwdb()))

from utils.modeling import *
from utils.functions import *


In [2]:
df_diamonds = pd.read_csv(r'data\processed\diamonds_training.csv', index_col='id')
df_predict = pd.read_csv(r'data\processed\diamonds_testing.csv', index_col='id')


# Consideraciones
- Se localiza el tema en Kaggle: https://www.kaggle.com/datasets/shivam2503/diamonds

- Se empieza a trabajar con ese "dataset" (ver los archivos marcados como "UNUSED" y "no competition")

- Se detecta que existe una competición, si bien ya ha terminado: https://www.kaggle.com/competitions/diamonds-part-datamad0122/overview

- Se elige trabajar con los archivos de la competición, cuyas únicas diferencias es que hay un "train" y un "test", y que la variable "target" está escalada

- El "dataset" final es un listado de diamantes con sus características, y el objetivo es predecir el precio

- Se comparará lo obtenido con los resultados de la competición

# EDA
- Los pasos de esta primera parte se detallan de forma más pormenorizada, paso a paso, en el "notebook" titulado "EDA_diamonds"

- En ese "notebook" se hacen dos cosas:
1) Modificaciones esenciales; se liquidan duplicados, se cambia el nombre de las columnas y se pasan las categóricas a numéricas, tanto del "train" como del "test".

2) Modificaciones opcionales; se detectan y ponen a prueba las posibles modifiaciones que llevar a cabo con el "dataframe" de entrenamiento con tal de mejorar el resultado de los modelos. Los resortes de dichos cambios se guardan en forma de funciones (cuando son exclusivos de este proyecto) o clases (cuando es razonable guardarlos para análisis futuros), que se irán llamando a continuación según convenga.

# Modelaje: selección de cambios
- Se importan los "dataframes" con las modificaciones esenciales

- Se van intercalando modificiaciones opcionales y diversos modelos hasta dar con el mejor resultado

- Los modelos se prueban en este "notebook" para mayor comodidad, pero se ejecutan sin detallarse en "train.py", desde donde se guardan en la carpeta "model"

- Estas son las modificaciones que se van intercalando:

---------- Cambios opcionales (probados) ---------- 

1) Borrado de "outliers" extremadamente altos ("depth (percentage)", "table (percentage)", "width (millimeters)", "depth (millimeters)").

2) Borrado de filas que tienen 0 en todas las variables de tamaño ("lenght (millimeters)", "width (millimeters)" y "depth (millimeters)").

3) Borrado de los "outliers" compartidos moderadamente altos ("depth (percentage)" y "table (percentage)").

4) Asignación del valor con 0 restante en "lenght" al "width" correspondiente ("lenght (millimeters)").

5) Asignación del valor con 0 restante de "depth (millimeters)" a partir de una operación con el "lenght", el "width" y el "depth (percentage)" correspondientes ("depth (millimeters)").

6) Asignación del "outlier" restante del "lenght" al "width" correspondiente ("lenght (millimeters)").

7) Uso del logaritmo ("weight (carat)", "lenght (millimeters)", "width (millimeters)" y "depth (millimeters)").

8) Imputación al siguiente valor más alto ("weight (carat)").

9) Imputación a los valores máximos y mínimos del "boxplot" ("depth (percentage)" y "table (percentage)").

10) Neutralización de "outliers" con un modelo "ridge" ("depth (millimeters)").

11) Escalado "MinMax".

---------- Cambios apuntados (no probados) ----------

1) Sustitución de valores existentes por valores calculados ("depth (percentage)").

2) Descarte de las columnas con altísima correlación ("weight (carat)", "lenght (millimeters)", "width (millimeters)" y "depth (millimeters)").

3) Imputación de los valores máximos de "clarity quality" al que está un punto por debajo ("clarity quality").


## Ronda 1: sin cambios
- Para la primera fase, se prueban todos los modelos sin hacer ninguna modificación adicional

- En esta primera ronda están más detallados los usos de la clase "Regression", que hereda de "Model", para que sirva como ejemplo

- Como era de esperar, los resultados no son demasiado buenos, pero ganan los modelos "de árboles", ya que no se ven afectados por los valores atípicos

- Dado que en la competición se valora la "rsme", esa es la métrica que más se tendrá en cuenta. Puede verse el podio aquí: https://www.kaggle.com/competitions/diamonds-part-datamad0122/leaderboard

In [3]:
# Lo primero es decirle a la clase con qué modelos se va a trabajar a lo largo de todo el proceso
Regression.add_models(['LinearRegression',
                        'Ridge',
                        'DecisionTreeRegressor',
                        'KNeighborsRegressor',
                        'RandomForestRegressor',
                        'SVR',
                        'XGBRegressor'
                        ]
                        )


In [4]:
# Se crea la instancia de la clase "Regression" con la columna "price" como "target"
round_1 = Regression(df_diamonds, 'price')


In [5]:
# Se separa el "dataframe" con los parámetros por defecto. Se guardan las porciones por si acaso
X_train, X_test, y_train, y_test = round_1.split_dataframe()


In [6]:
# Se solicitan 10 "folds", del cual se usará el mejor para comparar los modelos y ver cuál llega más lejos
# Se coge el mejor, y no la media de las métricas, porque de 10 cortes es probable que salgan números de "cortes malos" dispares para los diferentes modelos
# Si fuera el caso, la comparación con las medias seria injusta
# En cambio, con el mejor, es muy probable que de 10 cortes al menos uno de ellos saque el máximo partido a cada modelo
# Así, se comparan en mayor igualdad de condiciones
# Como la "target" es de regresión, la instancia seleccionará automáticamente "KFold" en lugar de "StratifiedFold"
# Se establece un "random_state" para los modelos que lo requieren, que siempre será el mismo
round_1_dict = round_1.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                                    ['RandomForestRegressor', 'random_state=43'],
                                                    ['XGBRegressor', 'random_state=43']
                                                ],
                                    kfolds_num=10
                                    )


-- Regression: using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.88 sec(s). Total time: 0.88
Starting Ridge:
- Ridge done in 0.41 sec(s). Total time: 1.29
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 15.74 sec(s). Total time: 17.03
Starting SVR:
- SVR done in 351.86 sec(s). Total time: 368.89
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 2.64 sec(s). Total time: 371.53
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 229.65 sec(s). Total time: 601.19
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 31.02 sec(s). Total time: 632.21


In [7]:
# Los resultados, así como el modelo entrenado, pueden visualizarse en un diccionario
round_1_dict


In [8]:
# Acto seguido, se miran las métricas
round_1_metrics = round_1.evaluate_metrics()

round_1_metrics


{'LinearRegression': {'test': array([8.069, 9.093, 8.297, ..., 9.234, 8.818, 8.368]),
  'prediction': array([8.1751789 , 8.94534424, 7.94609508, ..., 9.13303572, 8.78005708,
         8.22465851]),
  'model': LinearRegression(),
  'metrics': {'rmse': 0.222037823248969,
   'mse': 0.0493007949531404,
   'mae': 0.12289980638635319,
   'r2_score': 0.9529090797126748,
   'mape': 0.01587291534328809}},
 'Ridge': {'test': array([8.069, 9.093, 8.297, ..., 9.234, 8.818, 8.368]),
  'prediction': array([8.17485364, 8.94499768, 7.94593798, ..., 9.13327901, 8.78003269,
         8.22426296]),
  'model': Ridge(),
  'metrics': {'rmse': 0.2219231703319744,
   'mse': 0.04924989353019452,
   'mae': 0.1229616970889257,
   'r2_score': 0.9529576995138916,
   'mape': 0.015880870501013333}},
 'KNeighborsRegressor': {'test': array([8.069, 9.093, 8.297, ..., 9.234, 8.818, 8.368]),
  'prediction': array([8.2228, 9.072 , 8.0682, ..., 9.09  , 8.7678, 8.706 ]),
  'model': KNeighborsRegressor(),
  'metrics': {'rmse':

In [9]:
# Para una mejor visualización, se ponen en un "dataframe"
# Las predicciones no son muy buenas, si bien el r2_score es alto para todos los casos
round_1.create_dataframe()


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.222038,0.221923,0.183865,0.209064,0.130142,0.098101,0.094787,XGBRegressor: random_state=43,LinearRegression
mse,0.049301,0.04925,0.033806,0.043708,0.016937,0.009624,0.008985,XGBRegressor: random_state=43,LinearRegression
mae,0.1229,0.122962,0.135156,0.127477,0.088698,0.067123,0.066666,XGBRegressor: random_state=43,KNeighborsRegressor
r2_score,0.952909,0.952958,0.967709,0.958252,0.983822,0.990808,0.991418,XGBRegressor: random_state=43,SVR
mape,0.015873,0.015881,0.017806,0.016523,0.011417,0.008684,0.008602,XGBRegressor: random_state=43,KNeighborsRegressor


## Ronda 2: escalado
- Se repite la ronda 1, pero esta vez se escalan las variables

- Exceptuando "Ridge", el escalado "Standard" mejora más que "MinMax" los resultados de los modelos "no de árboles", que solo empeoran. La regresión lineal no se ve afectada en ningún caso

In [10]:
# Se pone a prueba con el mismo proceso que en la ronda anterior, pero esta vez se aplica un escalado MinMax
# Hay una ligera mejora en los modelos "no de árboles"
df_diamonds_2 = df_diamonds.copy()

round_2 = Regression(df_diamonds_2, 'price')
round_2.split_dataframe(scaler='MinMaxScaler')
round_2.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_2.evaluate_metrics()
round_2.create_dataframe()


-- Regression (MinMaxScaler): using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.24 sec(s). Total time: 0.24
Starting Ridge:
- Ridge done in 0.17 sec(s). Total time: 0.41
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 3.38 sec(s). Total time: 3.8
Starting SVR:
- SVR done in 285.42 sec(s). Total time: 289.22
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 2.7 sec(s). Total time: 291.92
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 245.17 sec(s). Total time: 537.09
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 10.62 sec(s). Total time: 547.7


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.222038,0.217754,0.165586,0.136532,0.130527,0.098128,0.094788,XGBRegressor: random_state=43,LinearRegression
mse,0.049301,0.047417,0.027419,0.018641,0.017037,0.009629,0.008985,XGBRegressor: random_state=43,LinearRegression
mae,0.1229,0.127961,0.115148,0.093104,0.088925,0.067186,0.066664,XGBRegressor: random_state=43,Ridge
r2_score,0.952909,0.954709,0.97381,0.982195,0.983726,0.990803,0.991418,XGBRegressor: random_state=43,LinearRegression
mape,0.015873,0.016526,0.015207,0.012264,0.011447,0.008692,0.008602,XGBRegressor: random_state=43,Ridge


In [11]:
# Comprobamos si la situación mejora con "StandardScaler". En efecto, es así (salvo para Ridge, que empeora)
df_diamonds_2b = df_diamonds.copy()

round_2b = Regression(df_diamonds_2b, 'price')
round_2b.split_dataframe(scaler='StandardScaler')
round_2b.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_2b.evaluate_metrics()
round_2b.create_dataframe()


-- Regression (StandardScaler): using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.29 sec(s). Total time: 0.29
Starting Ridge:
- Ridge done in 0.14 sec(s). Total time: 0.43
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 4.56 sec(s). Total time: 4.99
Starting SVR:
- SVR done in 314.16 sec(s). Total time: 319.15
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 2.76 sec(s). Total time: 321.91
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 235.85 sec(s). Total time: 557.76
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 10.85 sec(s). Total time: 568.6


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.222038,0.221959,0.14442,0.110864,0.130774,0.098193,0.094777,XGBRegressor: random_state=43,LinearRegression
mse,0.049301,0.049266,0.020857,0.012291,0.017102,0.009642,0.008983,XGBRegressor: random_state=43,LinearRegression
mae,0.1229,0.122939,0.10546,0.083091,0.089279,0.067219,0.066659,XGBRegressor: random_state=43,Ridge
r2_score,0.952909,0.952942,0.980078,0.98826,0.983665,0.99079,0.99142,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
mape,0.015873,0.015878,0.013921,0.010905,0.011496,0.008697,0.008601,XGBRegressor: random_state=43,Ridge


## Ronda 3: borrado (+ escalado)
- Se borran todos los "outliers" extremadamente altos ("depth (percentage)", "table (percentage)", "width (millimeters)", "depth (millimeters)")

- Se borran las filas que tienen el mismo cero en estas tres columnas: "lenght (millimeters)", "width (millimeters)" y "depth (millimeters)"

- Se borran los "outliers" moderadamente altos compartidos de "depth (percentage)" y "table (percentage)"

- El error mejora en todos los modelos en distintas medidas

In [12]:
# Se aplican los borrados, esta vez con una función a medida ya que son cambios exclusivos de este proyecto
df_diamonds_3 = df_diamonds.copy()

df_diamonds_3 = remove_all(df_diamonds_3)

print(f'Deleted rows: {len(df_diamonds) - len(df_diamonds_3)}')


Deleted rows: 20


In [13]:
# Se escala y se prueban los modelos
round_3 = Regression(df_diamonds_3, 'price')
round_3.split_dataframe(scaler='StandardScaler')
round_3.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_3.evaluate_metrics()
round_3.create_dataframe()


-- Regression (StandardScaler): using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.28 sec(s). Total time: 0.28
Starting Ridge:
- Ridge done in 0.14 sec(s). Total time: 0.42
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 4.73 sec(s). Total time: 5.15
Starting SVR:
- SVR done in 339.47 sec(s). Total time: 344.62
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 3.08 sec(s). Total time: 347.7
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 271.57 sec(s). Total time: 619.26
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 17.66 sec(s). Total time: 636.92


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.145078,0.145065,0.139798,0.105232,0.126619,0.092584,0.090504,XGBRegressor: random_state=43,LinearRegression
mse,0.021048,0.021044,0.019543,0.011074,0.016032,0.008572,0.008191,XGBRegressor: random_state=43,LinearRegression
mae,0.112841,0.112834,0.105272,0.081063,0.088104,0.066677,0.065687,XGBRegressor: random_state=43,LinearRegression
r2_score,0.979612,0.979616,0.98107,0.989273,0.984471,0.991697,0.992066,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
mape,0.014806,0.014805,0.013886,0.010641,0.011362,0.008625,0.008474,XGBRegressor: random_state=43,LinearRegression


## Ronda 4: asignación (+ borrado y escalado)
- Los cambios de esta ronda se aplican a partir de dos hechos probados durante el EDA: 

1) El "lenght" y el "width" son generalmente casi idénticos, ya que los diamantes son semicirculares.

2) El "depth (percentage)" se obtiene (según el autor del "dataset") de dividir "depth (millimeters)" por la media de "lenght" y "width".

- Se asigna el 0 restante de "lenght" al "width" correspondiente ("lenght (millimeters)")

- Se asigna el 0 restante de "depth (millimeters)" a partir de la operación mencionada

- Se asigna el "outlier" restante del "lenght" al "width" correspondiente ("lenght (millimeters)")

- Mejoran todos menos "DecisionTree" y "KNeighbors" en pequeña medida 

In [14]:
# Se aplican el borrado y la asignación
df_diamonds_4 = df_diamonds.copy()

df_diamonds_4 = remove_all(df_diamonds_4)
df_diamonds_4 = assign_values(df_diamonds_4)


In [15]:
# Se escala y se prueban los modelos
round_4 = Regression(df_diamonds_4, 'price')
round_4.split_dataframe(scaler='StandardScaler')
round_4.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_4.evaluate_metrics()
round_4.create_dataframe()


-- Regression (StandardScaler): using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.24 sec(s). Total time: 0.25
Starting Ridge:
- Ridge done in 0.17 sec(s). Total time: 0.41
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 6.01 sec(s). Total time: 6.42
Starting SVR:
- SVR done in 326.11 sec(s). Total time: 332.53
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 3.22 sec(s). Total time: 335.74
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 279.61 sec(s). Total time: 615.36
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 27.84 sec(s). Total time: 643.2


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.144352,0.144347,0.139518,0.105131,0.126641,0.092534,0.09037,XGBRegressor: random_state=43,LinearRegression
mse,0.020838,0.020836,0.019465,0.011053,0.016038,0.008563,0.008167,XGBRegressor: random_state=43,LinearRegression
mae,0.112161,0.112159,0.105166,0.081034,0.088464,0.066706,0.065605,XGBRegressor: random_state=43,LinearRegression
r2_score,0.979816,0.979817,0.981145,0.989294,0.984465,0.991706,0.992089,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
mape,0.014685,0.014685,0.013874,0.010637,0.011401,0.008629,0.008473,XGBRegressor: random_state=43,LinearRegression


## Ronda 5: logaritmo (+ asignación, borrado y escalado)
- Se aplica el logaritmo a las columnas "weight (carat)", "lenght (millimeters)", "width (millimeters)" y "depth (millimeters)"

- Mejoran "KNeighbors",  "SVR" y "DecisionTree" (este último, muy poco). "XGBRegressor" se queda igual. El resto, empeoran

In [16]:
# Se hacen los retoques
df_diamonds_5 = df_diamonds.copy()

df_diamonds_5 = remove_all(df_diamonds_5)
df_diamonds_5 = assign_values(df_diamonds_5)

df_diamonds_5[['weight (carat)', 'lenght (millimeters)', 'width (millimeters)', 'depth (millimeters)']] = np.log(df_diamonds_5[['weight (carat)', 'lenght (millimeters)', 'width (millimeters)', 'depth (millimeters)']])


In [17]:
# Se escala y se prueban los modelos
round_5 = Regression(df_diamonds_5, 'price')
round_5.split_dataframe(scaler='StandardScaler')
round_5.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_5.evaluate_metrics()
round_5.create_dataframe()


-- Regression (StandardScaler): using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.25 sec(s). Total time: 0.25
Starting Ridge:
- Ridge done in 0.18 sec(s). Total time: 0.43
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 6.49 sec(s). Total time: 6.91
Starting SVR:
- SVR done in 363.74 sec(s). Total time: 370.66
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 2.61 sec(s). Total time: 373.27
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 232.83 sec(s). Total time: 606.1
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 13.75 sec(s). Total time: 619.85


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.144715,0.144684,0.137727,0.103949,0.126594,0.092607,0.09037,XGBRegressor: random_state=43,LinearRegression
mse,0.020942,0.020933,0.018969,0.010805,0.016026,0.008576,0.008167,XGBRegressor: random_state=43,LinearRegression
mae,0.112273,0.112272,0.104003,0.079632,0.088505,0.066736,0.065606,XGBRegressor: random_state=43,LinearRegression
r2_score,0.979715,0.979723,0.981626,0.989533,0.984477,0.991693,0.992089,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
mape,0.014703,0.014703,0.013615,0.010411,0.011404,0.008633,0.008473,XGBRegressor: random_state=43,Ridge


## Ronda 6: imputaciones "boxplot" (+ logaritmo, asignación, borrado y escalado)
- Se imputan al siguiente valor más alto de "weight", y al máximo y al mínimo "depth (percentage)" y table

- Pequeñas mejoras en todos excepto en los "de árboles"

In [18]:
# Se hacen los retoques pertinentes
df_diamonds_6 = df_diamonds.copy()

df_diamonds_6 = remove_all(df_diamonds_6)
df_diamonds_6 = assign_values(df_diamonds_6)

df_diamonds_6[['weight (carat)', 'lenght (millimeters)', 'width (millimeters)', 'depth (millimeters)']] = np.log(df_diamonds_6[['weight (carat)', 'lenght (millimeters)', 'width (millimeters)', 'depth (millimeters)']])

df_diamonds_6 = impute_next_higher(df_diamonds_6)

df_diamonds_6 = impute_boxplot_min_max(df_diamonds_6, ['depth (percentage)', 'table (percentage)'])


In [19]:
# Se lleva a cabo la prueba
round_6 = Regression(df_diamonds_6, 'price')
round_6.split_dataframe(scaler='StandardScaler')
round_6.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_6.evaluate_metrics()
round_6.create_dataframe()


-- Regression (StandardScaler): using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.22 sec(s). Total time: 0.22
Starting Ridge:
- Ridge done in 0.12 sec(s). Total time: 0.34
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 4.03 sec(s). Total time: 4.37
Starting SVR:
- SVR done in 305.43 sec(s). Total time: 309.8
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 2.79 sec(s). Total time: 312.59
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 295.88 sec(s). Total time: 608.47
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 16.03 sec(s). Total time: 624.49


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.144554,0.144528,0.136073,0.103512,0.127485,0.092778,0.090047,XGBRegressor: random_state=43,LinearRegression
mse,0.020896,0.020888,0.018516,0.010715,0.016252,0.008608,0.008109,XGBRegressor: random_state=43,LinearRegression
mae,0.112193,0.112198,0.103214,0.079394,0.089111,0.066822,0.065602,XGBRegressor: random_state=43,Ridge
r2_score,0.97976,0.979767,0.982065,0.989621,0.984257,0.991662,0.992146,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
mape,0.014691,0.014692,0.013516,0.010382,0.011483,0.008643,0.008463,XGBRegressor: random_state=43,Ridge


## Ronda 7: imputaciones "ridge" (+ imputaciones "boxplot", logaritmo, asignación, borrado y escalado)
- Se imputan los "outliers" restantes de "depth (millimeters)" aplicando un modelo "Ridge" a "weight (carat)", "lenght (millimeters)" y "width (millimeters)", con las que tiene una altísima correlación

- Los modelos "de árbol" y "ridge" mejoran ligeramente, y los demás empeoran

In [20]:
# Se hacen los cambios
df_diamonds_7 = df_diamonds.copy()

df_diamonds_7 = remove_all(df_diamonds_7)
df_diamonds_7 = assign_values(df_diamonds_7)

df_diamonds_7[['weight (carat)', 'lenght (millimeters)', 'width (millimeters)', 'depth (millimeters)']] = np.log(df_diamonds_7[['weight (carat)', 'lenght (millimeters)', 'width (millimeters)', 'depth (millimeters)']])

df_diamonds_7 = impute_next_higher(df_diamonds_7)

df_diamonds_7 = impute_boxplot_min_max(df_diamonds_7, ['depth (percentage)', 'table (percentage)'])

df_diamonds_7 = apply_ridge(df_diamonds_7)


In [21]:
# Se prueban los modelos
round_7 = Regression(df_diamonds_7, 'price')
round_7.split_dataframe(scaler='StandardScaler')
round_7.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_7.evaluate_metrics()
round_7.create_dataframe()


-- Regression (StandardScaler): using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.49 sec(s). Total time: 0.49
Starting Ridge:
- Ridge done in 0.24 sec(s). Total time: 0.74
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 6.43 sec(s). Total time: 7.16
Starting SVR:
- SVR done in 357.98 sec(s). Total time: 365.14
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 3.83 sec(s). Total time: 368.98
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 262.53 sec(s). Total time: 631.51
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 13.35 sec(s). Total time: 644.86


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.144558,0.144535,0.136088,0.103523,0.127068,0.092686,0.090426,XGBRegressor: random_state=43,LinearRegression
mse,0.020897,0.02089,0.01852,0.010717,0.016146,0.008591,0.008177,XGBRegressor: random_state=43,LinearRegression
mae,0.112202,0.112207,0.103228,0.079388,0.088676,0.066733,0.065449,XGBRegressor: random_state=43,Ridge
r2_score,0.979758,0.979765,0.982061,0.989619,0.98436,0.991679,0.99208,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
mape,0.014692,0.014693,0.013518,0.010381,0.011428,0.008633,0.008445,XGBRegressor: random_state=43,Ridge


## Ronda 8: sustitución (+ borrado parcial y asignación)
- Se sustituyen los valores de "depth (percentage)" por los resultados reales que da el cálculo a partir de las columnas correspondientes

- Como se ha visto en EDA, el resultado serán muchos nuevos "outliers". Por tanto, se aplicará el cambio en el "dataframe" original sin otras modificaciones para comprobar si supone una mejora, al menos, en los modelos "de árbol"

- La única modificación adicional indispensable es el tratamiento de los 0 en esas columnas

- Para comprobar si realmente hay mejora, se aplican los modelos dos veces: una solo con las modificaciones de borrado parcial y asignación, y otra con la sustitución

- Mejoran "SVR" y "RandomForest"

In [22]:
# Se aplican los cambios
df_diamonds_8a = df_diamonds.copy()

df_diamonds_8a = assign_values(df_diamonds_8a)

df_diamonds_8a = remove_all(df_diamonds_8a, zeros_only=True)

df_diamonds_8b = df_diamonds_8a.copy()

df_diamonds_8b['depth (percentage)'] = (df_diamonds_8b['depth (millimeters)'] / ((df_diamonds_8b['lenght (millimeters)']+df_diamonds_8b['width (millimeters)']) / 2)) * 100


In [23]:
# Se ponen a prueba solo el borrado parcial y la asignación
round_8a = Regression(df_diamonds_8a, 'price')
round_8a.split_dataframe()
round_8a.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_8a.evaluate_metrics()
round_8a.create_dataframe()


-- Regression: using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.31 sec(s). Total time: 0.31
Starting Ridge:
- Ridge done in 0.17 sec(s). Total time: 0.48
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 6.99 sec(s). Total time: 7.47
Starting SVR:
- SVR done in 408.32 sec(s). Total time: 415.79
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 3.09 sec(s). Total time: 418.87
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 287.86 sec(s). Total time: 706.73
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 14.35 sec(s). Total time: 721.08


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.154359,0.154318,0.178104,0.161348,0.12791,0.092733,0.09112,XGBRegressor: random_state=43,KNeighborsRegressor
mse,0.023827,0.023814,0.031721,0.026033,0.016361,0.008599,0.008303,XGBRegressor: random_state=43,KNeighborsRegressor
mae,0.115958,0.11595,0.133578,0.12324,0.088054,0.066104,0.065954,XGBRegressor: random_state=43,KNeighborsRegressor
r2_score,0.97703,0.977042,0.96942,0.974903,0.984227,0.99171,0.991996,XGBRegressor: random_state=43,SVR
mape,0.015024,0.015023,0.017619,0.016011,0.011365,0.00856,0.008513,XGBRegressor: random_state=43,KNeighborsRegressor


In [24]:
# Se pone a prueba la sustitución
round_8b = Regression(df_diamonds_8b, 'price')
round_8b.split_dataframe()
round_8b.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_8b.evaluate_metrics()
round_8b.create_dataframe()


-- Regression: using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.39 sec(s). Total time: 0.39
Starting Ridge:
- Ridge done in 0.19 sec(s). Total time: 0.58
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 6.19 sec(s). Total time: 6.78
Starting SVR:
- SVR done in 465.98 sec(s). Total time: 472.75
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 3.64 sec(s). Total time: 476.4
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 327.77 sec(s). Total time: 804.17
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 18.2 sec(s). Total time: 822.36


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.155832,0.156025,0.178175,0.160862,0.128324,0.092704,0.091325,XGBRegressor: random_state=43,KNeighborsRegressor
mse,0.024284,0.024344,0.031746,0.025877,0.016467,0.008594,0.00834,XGBRegressor: random_state=43,KNeighborsRegressor
mae,0.1198,0.119931,0.133753,0.123338,0.088686,0.066106,0.066187,RandomForestRegressor: random_state=43,KNeighborsRegressor
r2_score,0.97659,0.976532,0.969395,0.975054,0.984125,0.991715,0.99196,XGBRegressor: random_state=43,SVR
mape,0.015685,0.015691,0.017646,0.016023,0.011451,0.008564,0.008551,XGBRegressor: random_state=43,KNeighborsRegressor


## Ronda 9: descarte
- Se quitan directamente las columnas con altísima correlación, cercana a 1 ("weight", "lenght", "width" y "depth (millimeters)")

- Se utiliza el "dataframe" sin cambios ni escalado para el contraste

- Los modelos "no de árbol" mejoran radicalmente comparados con cualquier otra ronda con el descarte de las columnas con alta correlación.  "DecisionTree" mejora un poco y "RandomForest" empeora comparados con la ronda 1

- Con el descarte, además, de las de correlación poco relevante, solo mejoran "DecisionTree" y "RandomForest"


In [25]:
# Se hace la prueba
df_diamonds_9a = df_diamonds.copy()

df_diamonds_9a = df_diamonds_9a.drop(columns=['weight (carat)', 'lenght (millimeters)', 'width (millimeters)', 'depth (millimeters)'])

round_9a = Regression(df_diamonds_9a, 'price')
round_9a.split_dataframe()
round_9a.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_9a.evaluate_metrics()
round_9a.create_dataframe()


-- Regression: using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.29 sec(s). Total time: 0.29
Starting Ridge:
- Ridge done in 0.15 sec(s). Total time: 0.44
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 2.49 sec(s). Total time: 2.93
Starting SVR:
- SVR done in 791.47 sec(s). Total time: 794.4
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 2.09 sec(s). Total time: 796.49
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 156.22 sec(s). Total time: 952.71
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 14.68 sec(s). Total time: 967.39


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.979343,0.979343,1.027563,0.986855,1.187047,1.025191,0.957208,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
mse,0.959112,0.959112,1.055885,0.973882,1.40908,1.051016,0.916246,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
mae,0.814501,0.814502,0.827875,0.80344,0.911736,0.816076,0.787496,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
r2_score,0.083879,0.08388,-0.008556,0.069771,-0.345919,-0.003905,0.124824,XGBRegressor: random_state=43,RandomForestRegressor: random_state=43
mape,0.107044,0.107044,0.108093,0.105367,0.1189,0.106677,0.103121,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43


In [26]:
# Para sacar más partido a esta ronda, se prueba también si además se eliminan las columnas con una correlación próxima a 0
df_diamonds_9b = df_diamonds_9a.copy()

df_diamonds_9b = df_diamonds_9b.drop(columns=['cut quality', 'depth (percentage)'])

round_9b = Regression(df_diamonds_9b, 'price')
round_9b.split_dataframe()
round_9b.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_9b.evaluate_metrics()
round_9b.create_dataframe()


-- Regression: using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.17 sec(s). Total time: 0.17
Starting Ridge:
- Ridge done in 0.11 sec(s). Total time: 0.29
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 1.41 sec(s). Total time: 1.69
Starting SVR:
- SVR done in 900.09 sec(s). Total time: 901.78
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 1.79 sec(s). Total time: 903.57
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 48.73 sec(s). Total time: 952.31
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 14.07 sec(s). Total time: 966.37


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.979538,0.979538,1.055886,0.987238,0.979795,0.974961,0.970567,XGBRegressor: random_state=43,KNeighborsRegressor
mse,0.959495,0.959495,1.114896,0.974639,0.959999,0.950548,0.942,XGBRegressor: random_state=43,KNeighborsRegressor
mae,0.814723,0.814723,0.863278,0.803094,0.807722,0.80519,0.801743,XGBRegressor: random_state=43,KNeighborsRegressor
r2_score,0.083514,0.083514,-0.064921,0.069048,0.083032,0.09206,0.100225,XGBRegressor: random_state=43,DecisionTreeRegressor: random_state=43
mape,0.107069,0.107069,0.113267,0.105308,0.106017,0.105685,0.105264,XGBRegressor: random_state=43,KNeighborsRegressor


## Ronda 10: imputaciones "clarity quality"
- Se imputan los valores de 7 "clarity quality" al 6. Esto se hace porque se ha visto en el EDA que las variables relacionadas con el tamaño ('weight (carat)', 'lenght (millimeters)', 'width (millimeters)' y 'depth (millimeters)') dejan de disminuir a partir del 6

- Se utiliza el "dataframe" sin cambios ni escalado para el contraste

- Mejoran "LinearRegression", "Ridge" y "KNeighbors" comparados con la ronda 1

In [27]:
# Se comprueba
df_diamonds_10 = df_diamonds.copy()

df_diamonds_10.loc[df_diamonds_10['clarity quality'] == 7, 'clarity quality'] = 6

round_10 = Regression(df_diamonds_10, 'price')
round_10.split_dataframe()
round_10.apply_models(params_list=[['DecisionTreeRegressor', 'random_state=43'],
                                    ['RandomForestRegressor', 'random_state=43'],
                                    ['XGBRegressor', 'random_state=43']
                                    ],
                        kfolds_num=10
                    )
round_10.evaluate_metrics()
round_10.create_dataframe()


-- Regression: using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.4 sec(s). Total time: 0.4
Starting Ridge:
- Ridge done in 0.17 sec(s). Total time: 0.57
Starting KNeighborsRegressor:
- KNeighborsRegressor done in 8.32 sec(s). Total time: 8.89
Starting SVR:
- SVR done in 529.65 sec(s). Total time: 538.54
Starting DecisionTreeRegressor: random_state=43:
- DecisionTreeRegressor: random_state=43 done in 3.34 sec(s). Total time: 541.88
Starting RandomForestRegressor: random_state=43:
- RandomForestRegressor: random_state=43 done in 382.31 sec(s). Total time: 924.18
Starting XGBRegressor: random_state=43:
- XGBRegressor: random_state=43 done in 50.09 sec(s). Total time: 974.28


Unnamed: 0,LinearRegression,Ridge,KNeighborsRegressor,SVR,DecisionTreeRegressor: random_state=43,RandomForestRegressor: random_state=43,XGBRegressor: random_state=43,BEST,WORST
rmse,0.221734,0.221619,0.183265,0.209357,0.130545,0.099294,0.094895,XGBRegressor: random_state=43,LinearRegression
mse,0.049166,0.049115,0.033586,0.04383,0.017042,0.009859,0.009005,XGBRegressor: random_state=43,LinearRegression
mae,0.122548,0.122607,0.135284,0.127694,0.08999,0.068342,0.067275,XGBRegressor: random_state=43,KNeighborsRegressor
r2_score,0.953038,0.953086,0.967919,0.958134,0.983722,0.990583,0.991399,XGBRegressor: random_state=43,SVR
mape,0.015838,0.015845,0.017839,0.016556,0.011599,0.008856,0.008688,XGBRegressor: random_state=43,KNeighborsRegressor


# Modelaje: selección de modelos

- Total de modelos probados con distintos cambios hasta este punto: 78

- Cada uno de los modelos mejora con los siguientes cambios:

---------- LinearRegression ----------

 · Borrado

 · Asignación

 · Imputaciones "boxplot"

 · Descarte alta correlación
 
 · Imputaciones "clarity quality"

---------- Ridge ----------

· Escalado "MinMax"

· Borrado

· Asignación

· Imputaciones "boxplot"

· Descarte alta correlación

· Imputaciones "clarity quality"

---------- KNeighborsRegressor ----------

· Escalado "Standard"

· Borrado

· Logaritmo

· Imputaciones "boxplot"

· Descarte alta correlación

· Imputaciones "clarity quality"

---------- SVR ----------

· Escalado "Standard"

· Borrado

· Asignación

· Logaritmo

· Imputaciones "boxplot"

· Sustitución

· Descarte alta correlación

---------- DecisionTree ----------

· Borrado

· Logaritmo

· Imputaciones "ridge"

· Descarte correlación ínfima

---------- RandomForest ----------

· Borrado

· Asignación

· Imputaciones "ridge"

· Sustitución

· Descarte correlación ínfima

---------- XGBRegressor ----------

· Esclado "Standard"

· Borrado

· Asignación

· Imputaciones "boxplot"


## "Ridge" y "LinearRegression"

- Como mejoran con los mismos cambios, se prueban juntos

In [28]:
# Se aplican todos los cambios del listado en común
df_ridge_linear = df_diamonds.copy()

df_ridge_linear = remove_all(df_ridge_linear)
df_ridge_linear = assign_values(df_ridge_linear)

df_ridge_linear = impute_next_higher(df_ridge_linear, log=False)

df_ridge_linear = impute_boxplot_min_max(df_ridge_linear, ['depth (percentage)', 'table (percentage)'])

df_ridge_linear = df_ridge_linear.drop(columns=['weight (carat)', 'lenght (millimeters)', 'width (millimeters)', 'depth (millimeters)'])

df_ridge_linear.loc[df_ridge_linear['clarity quality'] == 7, 'clarity quality'] = 6


In [29]:
# Se prueban en un "dataframe" conjunto
# Como "Ridge" mejora más con el escalado "MinMax" y la regresión no se ve afectada, se usa ese
ridge_linear = Regression(df_ridge_linear, 'price')
ridge_linear.split_dataframe(scaler='MinMaxScaler')
ridge_linear.apply_models(selected_list=['LinearRegression', 'Ridge'],
                        kfolds_num=10
                    )
ridge_linear.evaluate_metrics()
ridge_linear.create_dataframe()


-- Regression (MinMaxScaler): using best of 10 folds --
Starting LinearRegression:
- LinearRegression done in 0.35 sec(s). Total time: 0.35
Starting Ridge:
- Ridge done in 0.18 sec(s). Total time: 0.53


Unnamed: 0,LinearRegression,Ridge,BEST,WORST
rmse,0.970784,0.970792,LinearRegression,Ridge
mse,0.942422,0.942437,LinearRegression,Ridge
mae,0.804835,0.804876,LinearRegression,Ridge
r2_score,0.087137,0.087123,LinearRegression,Ridge
mape,0.105599,0.105605,LinearRegression,Ridge
