# Regularization

Let's try to regularize the model applying some Lsso regressions, in order to shrink the coefficients of our model towards zero.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import Lasso, LassoCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import RepeatedKFold, GridSearchCV, train_test_split

In [2]:
diamonds = pd.read_csv('data/train_cate.csv')
diamonds.head()

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
0,0,0.5,5,7,4,62.3,55.0,5.11,5.07,3.17,1845
1,1,1.54,2,2,5,63.6,60.0,7.3,7.33,4.65,10164
2,2,1.32,3,1,2,61.7,60.0,6.95,7.01,4.31,5513
3,3,1.2,5,2,3,62.1,55.0,6.83,6.79,4.23,5174
4,4,1.73,4,2,3,61.2,60.0,7.67,7.65,4.69,10957


In [3]:
def lr_results(train_real, test_real, train_pred, test_pred):
    r2_train = r2_score(train_real, train_pred)
    r2_test = r2_score(test_real, test_pred)
    
    mae_train = mean_absolute_error(train_real, train_pred)
    mae_test = mean_absolute_error(test_real, test_pred)
    
    rmse_train = mean_squared_error(train_real, train_pred)**.5
    rmse_test = mean_squared_error(test_real, test_pred)**.5

    results = {'train_set': [r2_train, mae_train, rmse_train], 
               'test_set': [r2_test, mae_test, rmse_test]}
    
    results_df = pd.DataFrame(results, index=['r2', 'mae', 'rmse'])
    
    return results_df

### Simple Lasso regression

Let's apply first a Lasso regression to our raw data and get an idea on how the data could be improved.

In [4]:
X = diamonds.iloc[:,1:10]
y = diamonds.iloc[:,10]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [6]:
model = Lasso().fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(


In [7]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [8]:
lr_results(y_train, y_test, y_train_pred, y_test_pred)

Unnamed: 0,train_set,test_set
r2,0.907895,0.886311
mae,812.631322,810.814385
rmse,1217.925635,1347.210878


As we can see, Lasso regression to our raw data does not improve the results of our model that much. This could be because of the alpha level of the model, which by default is 1. Let's try now a new approach within a Lasso model with cross validation and a list of 100 different alpha values, from 0.01 to 1.

In [9]:
model = Lasso(normalize=True)

x_val = RepeatedKFold(n_splits=10, n_repeats=5, random_state=1)

grid = {'alpha': np.arange(0,1,0.01)}

model = GridSearchCV(model, grid, scoring='neg_mean_squared_error', cv=x_val, n_jobs=-1).fit(X_train, y_train)

  self.best_estimator_.fit(X, y, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [10]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [11]:
lr_results(y_train, y_test, y_train_pred, y_test_pred)

Unnamed: 0,train_set,test_set
r2,0.908031,0.840495
mae,812.187481,817.450026
rmse,1217.025821,1595.747679


In [12]:
model.best_params_

{'alpha': 0.0}

Not a huge one, but we can see a small improvement on the RMSE that tells us maybe Lasso with a range of alphas and cross validation could show us a way of improvement.

Let's try now this very same models dropping those columns that were highly correlated when we tried the simple linear regression models. First, let's drop the columns related to the size of the diamonds.

In [13]:
X = diamonds.iloc[:,1:-4]

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [15]:
model = Lasso()

x_val = RepeatedKFold(n_splits=10, n_repeats=5, random_state=1)

grid = {'alpha': np.arange(0,1,0.01)}

model = GridSearchCV(model, grid, scoring='neg_mean_squared_error', cv=x_val, n_jobs=-1).fit(X_train, y_train)

  self.best_estimator_.fit(X, y, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [16]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [17]:
lr_results(y_train, y_test, y_train_pred, y_test_pred)

Unnamed: 0,train_set,test_set
r2,0.90515,0.903746
mae,857.267988,862.153509
rmse,1231.117795,1258.950541


In [18]:
model.best_params_

{'alpha': 0.0}

RMSE keeps dropping for the test set and increasing for the train, which means the model is getting less overfitted. $R^2$ has also increased for the test set. 

Let's try now dropping those features that were not statistically significant when we performed a OLS test in our simple linear regression notebook.

In [19]:
X = diamonds.iloc[:,1:-3]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [20]:
model = Lasso(normalize=True)

x_val = RepeatedKFold(n_splits=10, n_repeats=5, random_state=1)

grid = {'alpha': np.arange(0,1,0.01)}

model = GridSearchCV(model, grid, scoring='neg_mean_squared_error', cv=x_val, n_jobs=-1).fit(X_train, y_train)

  self.best_estimator_.fit(X, y, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [21]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [22]:
lr_results(y_train, y_test, y_train_pred, y_test_pred)

Unnamed: 0,train_set,test_set
r2,0.9071,0.909604
mae,806.382908,805.133315
rmse,1218.682903,1218.934814


In [23]:
model.best_params_

{'alpha': 0.0}

In [24]:
diamonds_pred = pd.read_csv('data/pred_cate.csv')
diamonds_pred.head()

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z
0,0,0.45,5,6,3,62.8,58.0,4.88,4.84,3.05
1,1,1.23,5,3,3,61.0,56.0,6.96,6.92,4.23
2,2,0.33,5,2,8,61.8,55.0,4.46,4.47,2.76
3,3,0.51,5,7,4,58.0,60.0,5.29,5.26,3.06
4,4,0.4,5,6,4,62.2,59.0,4.71,4.74,2.94


In [25]:
X_pred = diamonds_pred.iloc[:,1:-2]

In [26]:
diamonds_pred['price'] = model.predict(X_pred)

In [27]:
submission_1 = diamonds_pred.iloc[:,[0,-1]]

In [28]:
submission_1.set_index('id', inplace=True)

In [29]:
#submission_1.to_csv('data/submission_1.csv')

In [30]:
submission_1

Unnamed: 0_level_0,price
id,Unnamed: 1_level_1
0,934.091435
1,6772.397080
2,1376.813893
3,2376.159585
4,1062.559595
...,...
13480,-1364.008429
13481,4658.442226
13482,1872.047727
13483,-186.023546


Seems like adding the column 'x' to our features has made a really good improvement on the model. Now $RMSE$ has gone down the 1200 points, $R^2$ gets higher and $MAE$ gets lower. 

Let's perform now a feature ranking with the RFE function of sklearn, in order to see if there is a better feature combination for our model.

X = diamonds.iloc[:,1:-3]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.26)

model = Lasso(normalize=True)

x_val = RepeatedKFold(n_splits=10, n_repeats=5, random_state=1)

grid = {'alpha': np.arange(0.0008,0.0009,0.0000001)}

model = GridSearchCV(model, grid, scoring='neg_mean_squared_error', cv=x_val, n_jobs=-1).fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

lr_results(y_train, y_test, y_train_pred, y_test_pred)

model.best_params_