# Etapa 1 del proyecto. La tarea de regresión: modelos polinomiales y regularizados
**By: Alejandra Ossa Yepes**

Modelo predictivo que permita determinar la demanda sobre el uso de un sistema de alquiler de bicicletas. Este conocimiento puede dar soporte para mejorar el servicio y conocer los factores que inciden en su eficiencia. Fomentar planes de movilidad sostenible es una manera de reducir las emisiones de CO2, que afectan la temperatura del planeta y desequilibran el ciclo natural. 

1. Aplicar técnicas de regresión para construir un modelo predictivo que permita estimar la demanda sobre el uso de un sistema de alquiler de bicicletas siguiendo el ciclo de machine learning.
2. Determinar cuáles son los factores que más inciden en la demanda con base en los datos


In [1]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.linear_model import LinearRegression, Lasso,  Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import PolynomialFeatures, RobustScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline

## 1. Carga de datos

In [2]:
data_raw = pd.read_csv('Datos_Etapa-1.csv', sep = ',')

In [3]:
data_raw.shape

(17379, 9)

In [4]:
data_raw.head()

Unnamed: 0,season,weekday,weathersit,temp,atemp,hum,windspeed,cnt,time_of_day
0,Winter,6,Clear,3.28,3.0014,0.81,0.0,16,Night
1,Winter,6,Clear,2.34,1.9982,0.8,0.0,40,Night
2,Winter,6,Clear,2.34,1.9982,0.8,0.0,32,Night
3,Winter,6,Clear,3.28,3.0014,0.75,0.0,13,Night
4,Winter,6,Clear,3.28,3.0014,0.75,0.0,1,Night


Utilizaremos la variable data para almacenar un conjunto de datos modificado.

In [5]:
data = data_raw

### 2. Preparación de los Datos

Transformar todas las variables categóricas en variables numéricas que puedan ser interpretadas por nuestro modelo y eliminacion de datos duplicados

In [6]:
pd.value_counts(data['season'])

Summer    4496
Spring    4409
Winter    4242
Fall      4232
Name: season, dtype: int64

In [7]:
pd.value_counts(data['weathersit'])

Clear         11413
Mist           4544
Light Rain     1419
Heavy Rain        3
Name: weathersit, dtype: int64

In [8]:
pd.value_counts(data['time_of_day'])

Night      6471
Morning    5805
Evening    5103
Name: time_of_day, dtype: int64

In [9]:
p21 = data.isna().sum()
p21

season         0
weekday        0
weathersit     0
temp           0
atemp          0
hum            0
windspeed      0
cnt            0
time_of_day    0
dtype: int64

In [10]:
data.duplicated().sum()

42

In [11]:
data.drop_duplicates(inplace= True)

In [12]:
data.shape

(17337, 9)

In [13]:
data = pd.get_dummies(data)

In [14]:
data.shape

(17337, 17)

In [15]:
data.head()

Unnamed: 0,weekday,temp,atemp,hum,windspeed,cnt,season_Fall,season_Spring,season_Summer,season_Winter,weathersit_Clear,weathersit_Heavy Rain,weathersit_Light Rain,weathersit_Mist,time_of_day_Evening,time_of_day_Morning,time_of_day_Night
0,6,3.28,3.0014,0.81,0.0,16,0,0,0,1,1,0,0,0,0,0,1
1,6,2.34,1.9982,0.8,0.0,40,0,0,0,1,1,0,0,0,0,0,1
2,6,2.34,1.9982,0.8,0.0,32,0,0,0,1,1,0,0,0,0,0,1
3,6,3.28,3.0014,0.75,0.0,13,0,0,0,1,1,0,0,0,0,0,1
4,6,3.28,3.0014,0.75,0.0,1,0,0,0,1,1,0,0,0,0,0,1


### 3. División de datos

Ahora dividiremos el conjunto de datos resultante en un conjunto de entrenamiento y uno de pruebas. Usaremos el 80% de los datos para el entrenamiento y el 20% restante para las pruebas. Nuestra variable de objetivo es `cnt`

In [16]:
x = data.drop(['cnt'], axis=1)
y = data['cnt']

In [17]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=77 )

### 4. Estandarización
Se hace uso del objeto de la clase `MinMaxScaler()` que, por defecto, escala los valores de cada variable al rango [0,1]

In [18]:
scaler = MinMaxScaler()

In [19]:
columns = x_train.columns
x_train = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train, columns=columns)

In [20]:
x_train.head()

Unnamed: 0,weekday,temp,atemp,hum,windspeed,season_Fall,season_Spring,season_Summer,season_Winter,weathersit_Clear,weathersit_Heavy Rain,weathersit_Light Rain,weathersit_Mist,time_of_day_Evening,time_of_day_Morning,time_of_day_Night
0,0.333333,0.367347,0.3939,0.94,0.263195,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.166667,0.673469,0.6364,0.74,0.193018,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.833333,0.591837,0.5909,0.73,0.228047,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.5,0.530612,0.5152,0.94,0.12284,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.0,0.530612,0.5152,0.68,0.105325,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [21]:
x_test = scaler.transform(x_test)
x_test = pd.DataFrame(x_test, columns=columns)

In [22]:
x_test.head()

Unnamed: 0,weekday,temp,atemp,hum,windspeed,season_Fall,season_Spring,season_Summer,season_Winter,weathersit_Clear,weathersit_Heavy Rain,weathersit_Light Rain,weathersit_Mist,time_of_day_Evening,time_of_day_Morning,time_of_day_Night
0,0.166667,0.367347,0.3939,0.76,0.333373,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.833333,0.408163,0.4242,0.77,0.228047,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,1.0,0.693878,0.6515,0.65,0.350888,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.5,0.77551,0.7727,0.7,0.228047,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.5,0.387755,0.4091,0.4,0.421065,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0


### 5. Regresión lineal Simple

**Entrenamiento de un modelo de referencia**
- Con el conjunto de datos modificado, empezaremos con el entrenamiento de un modelo de referencia, que nos permitirá ver cómo es el desempeño de un modelo de regresión lineal simple sobre este conjunto de datos

In [23]:
regresion = LinearRegression()

In [24]:
regresion.fit(x_train,y_train)

LinearRegression()

In [25]:
list(zip(x_train.columns, regresion.coef_))

[('weekday', 12.816725501738333),
 ('temp', 225.6772130816975),
 ('atemp', 113.45289270976444),
 ('hum', -145.45922526098028),
 ('windspeed', -8.206862182863528),
 ('season_Fall', 39.912409951478615),
 ('season_Spring', 5.330225609480314),
 ('season_Summer', -25.506469980872406),
 ('season_Winter', -19.736165580086027),
 ('weathersit_Clear', 11.073613741749526),
 ('weathersit_Heavy Rain', 14.93623031449973),
 ('weathersit_Light Rain', -33.289752877014216),
 ('weathersit_Mist', 7.279908820764852),
 ('time_of_day_Evening', 92.34233678998645),
 ('time_of_day_Morning', 4.956676442785875),
 ('time_of_day_Night', -97.29901323277247)]

In [26]:
y_pred_RL = regresion.predict(x_test)
RMSE_RL = mean_squared_error(y_test, y_pred_RL, squared=False)
MAEP_RL = mean_absolute_error(y_test, y_pred_RL)
R2_RL =  r2_score(y_test, y_pred_RL)
print('------ Modelo de regresión Lineal----')
print("RMSE: %.2f" % RMSE_RL)
print("MAE: %.2f" % MAEP_RL)
print('R²: %.2f' % R2_RL)

------ Modelo de regresión Lineal----
RMSE: 139.81
MAE: 103.68
R²: 0.42


### 6. Regresión polinomial
- Con el conjunto de datos preparado, vamos a entrenar un modelo de regresión polinomial multivariable.
- **Búsqueda de hiperparámetros:** Como cambiar el valor de los hiperparámetros tiene un impacto directo sobre el desempeño del modelo resultante se va a realizar una busqueda sobre `[2,3]` queremos encontrar un valor que resulte en el mejor desempeño posible.

In [27]:
degrees = [2,3]

In [28]:
def PolynomialRegression(degree = degrees, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))

In [29]:
param_grid = {'polynomialfeatures__degree':degrees}
kfold = KFold(n_splits=10, shuffle=True, random_state = 0)

In [30]:
param_grid

{'polynomialfeatures__degree': [2, 3]}

In [31]:
modelos_grid_p = GridSearchCV(PolynomialRegression(), param_grid, cv=kfold, n_jobs=-1, scoring = 'neg_root_mean_squared_error')

In [32]:
modelos_grid_p.fit(x_train, y_train)

GridSearchCV(cv=KFold(n_splits=10, random_state=0, shuffle=True),
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures(degree=[2, 3])),
                                       ('linearregression',
                                        LinearRegression())]),
             n_jobs=-1, param_grid={'polynomialfeatures__degree': [2, 3]},
             scoring='neg_root_mean_squared_error')

In [33]:
print("Mejor parámetro: ", modelos_grid_p.best_params_)

Mejor parámetro:  {'polynomialfeatures__degree': 2}


In [34]:
mejor_modelo_p = modelos_grid_p.best_estimator_

In [35]:
y_pred_p = mejor_modelo_p.predict(x_test)
RMSE_P = mean_squared_error(y_test, y_pred_p, squared=False)
MAEP_P = mean_absolute_error(y_test, y_pred_p)
R2_P =  r2_score(y_test, y_pred_p)
print('------ Modelo de regresión polinomial múltiple (grado '+str(modelos_grid_p.best_params_.get("polynomialfeatures__degree"))+')----')
print("RMSE: %.2f" % RMSE_P)
print("MAE: %.2f" % MAEP_P)
print('R²: %.2f' % R2_P)

------ Modelo de regresión polinomial múltiple (grado 2)----
RMSE: 136.13
MAE: 99.36
R²: 0.45


### 7. Regularización LASSO
La regularización Lasso consiste en añadir una penalización a la función de coste

In [36]:
lasso = Lasso(max_iter=2000)

In [37]:
lasso

Lasso(max_iter=2000)

In [38]:
param_grid = {'alpha': [1, 2, 3, 4, 5]}
kfold = KFold(n_splits=10, shuffle=True, random_state = 0)

In [39]:
modelos_grid_lasso = GridSearchCV(lasso, param_grid, cv=kfold, n_jobs=-1, scoring = 'neg_root_mean_squared_error')

In [40]:
modelos_grid_lasso.fit(x_train, y_train)

GridSearchCV(cv=KFold(n_splits=10, random_state=0, shuffle=True),
             estimator=Lasso(max_iter=2000), n_jobs=-1,
             param_grid={'alpha': [1, 2, 3, 4, 5]},
             scoring='neg_root_mean_squared_error')

In [41]:
print("Mejor parámetro: ", modelos_grid_lasso.best_params_)

Mejor parámetro:  {'alpha': 1}


In [42]:
mejor_modelo_lasso = modelos_grid_lasso.best_estimator_
list(zip(x_train.columns, mejor_modelo_lasso.coef_))

[('weekday', 3.7746469802889466),
 ('temp', 257.25615174828715),
 ('atemp', 0.0),
 ('hum', -119.82253579640862),
 ('windspeed', -0.0),
 ('season_Fall', 37.82845384505004),
 ('season_Spring', 11.82936694336835),
 ('season_Summer', -0.6131451032045586),
 ('season_Winter', -21.2390032774127),
 ('weathersit_Clear', 5.647151428345879),
 ('weathersit_Heavy Rain', -0.0),
 ('weathersit_Light Rain', -33.45759539354966),
 ('weathersit_Mist', 0.0),
 ('time_of_day_Evening', 92.17617722253503),
 ('time_of_day_Morning', -0.0),
 ('time_of_day_Night', -100.1944936466728)]

In [43]:
y_pred_lasso = mejor_modelo_lasso.predict(x_test)
RMSE_L = mean_squared_error(y_test, y_pred_lasso, squared=False)
MAEP_L = mean_absolute_error(y_test, y_pred_lasso)
R2_L =  r2_score(y_test, y_pred_lasso)
print('------ Modelo de regresión lasso -------')
print("RMSE: %.2f" % RMSE_L)
print("MAE: %.2f" % MAEP_L)
print('R²: %.2f' % R2_L)

------ Modelo de regresión lasso -------
RMSE: 139.97
MAE: 103.66
R²: 0.42


### 8. Tabla comparativa

In [44]:
tb_f = pd.DataFrame(np.array([[str(regresion),  round(RMSE_RL, 2), round(MAEP_RL, 2), round(R2_RL,2)],['Polinomio_grado = '+ str( modelos_grid_p.best_params_.get("polynomialfeatures__degree")), round(RMSE_P,2), round(MAEP_P, 2),  round(R2_P,2)],[str(lasso), round(RMSE_L,2), round(MAEP_L,2), round(R2_L,2)]]),
                   columns=['Model','RMSE','MAE','R2'])

In [45]:
tb_f

Unnamed: 0,Model,RMSE,MAE,R2
0,LinearRegression(),139.81,103.68,0.42
1,Polinomio_grado = 2,136.13,99.36,0.45
2,Lasso(max_iter=2000),139.97,103.66,0.42


### 9. Preguntas
- **¿Cuál es el grado de la transformación polinomial que fue seleccionado utilizando la técnica de validación?** Realizando una validación cruzada resulta una regresion polinomica de grado 2 con un RMSE: 132.35, MAE: 96.13, R²: 0.48

- **¿Cuál fue el valor de α que fue seleccionado utilizando la técnica de validación para la regresión Lasso?** El mejor valor del hiperparámetro fue alpha es 1, el cual si fuere muy alto la regularización puede llegar a anular los coeficientes de algunas variables.

- **A partir de la tabla comparativa, ¿cuál modelo ofrece el mejor rendimiento sobre el conjunto test? ¿Qué interpretación puedes darles a los valores obtenidos sobre las métricas de rendimiento?**  El mejor modelo es el Polinomio_grado = {2}, debido a que su R2 es el mayor, y el RMSE y el MAE es menor con respecto a la Regresion lineal y Regresion lasso.

- **¿Cuáles variables fueron seleccionadas con el modelo Lasso? A partir de estas, ¿qué interpretación de cara al problema puedes dar? Reflexiona sobre cómo este nuevo conocimiento podría ayudar a tomar decisiones en el contexto del problema.**  En la regresión Lasso los coeficiente de alguna de las variables [atemp, windspeed, weathersit_Heavy Rain, weathersit_Mist, time_of_day_Morning] estan en cero, por lo que este modelo no tiene en cuenta estas variables para realizar predicciones. Una de las particularidades de la regresión Lasso es que, gracias al uso del término de penalización, es capaz de determinar variables que no son relevantes para la estimación de la variable objetivo. 

## Cierre
---
*Creado por: Alejandra Ossa Yepes*   
*Versión de: Septiembre 04, 2022*  
*Universidad de los Andes*  