# Regularized Linear Models Code Tutorial

<b><u>[목적]</u></b>
- Regularized Linear Model을 활용하여 Feature selection(Dim reduection)을 진행함
- Ridge, Lasso, ElasticNet을 활용함
- Hyperparameter를 튜닝할때 for loop 뿐만 아니라 GridsearchCV를 통해 도출할 수 있도록 함

<b><u>[Process]</u></b>
- Data Path = 'https://github.com/GonieAhn/Data-Science-online-course-from-gonie/tree/main/Data%20Store'
- Define X's & Y
- Split Train & Valid data set
- Modeling (Ridge, Lasso, ElasticNet) & Hyperparameter Tunning
- 해석

<b><u>[주의]</u></b>
- Regularized Linear Models의 경우 X's Scaling을 무조건 진행해야함
- Coeff의 Penalty를 변수마다 똑같이 받아야하기 때문 (계수의 Scale을 맞춰야 Penalty를 똑같이 받을 수 있음)

In [1]:
import os
import gc
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings("ignore")

from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from regressors import stats
from sklearn.model_selection import GridSearchCV

In [2]:
%%time
# Data Load 
data = pd.read_csv("../Data Store/TOY_DATA.csv")
print(">>>> Data Shape : {}".format(data.shape))

>>>> Data Shape : (3500, 357)
Wall time: 229 ms


<b><u>[Data Selection]</u></b>
- Data Cleaning 진행

In [3]:
# Missing value dropping
data.dropna(inplace=True)
data.reset_index(inplace=True, drop=True)
print("Data Shape : {}".format(data.shape)) 

Data Shape : (3500, 357)


In [4]:
# Domain Knowlege를 활용하여 Feature를 선택함
sel_col = ["X23","X22","X21","X254","X247","X246","X245",
           "X244","X243","X242","X241","X16","X15","X14",
           "X13","X12","X11","X10","X9","X8","X7","X252",
           "X251","X250","X249","X248","X20","X19","X253",
           "X18","X17","X6","X5","X4"]

In [5]:
# Data Selection
Y = data['Y']
X = data[sel_col]

<b><u>[Data Split]</u></b>
- Data Split을 진행할 때 BigData의 경우 꼭 indexing을 추출하여 모델에 적용시켜야 함
- 이유는 Data Split하여 새로운 Data set을 만들 경우 메모리에 부담을 주기 때문

In [6]:
idx = list(range(X.shape[0]))
train_idx, valid_idx = train_test_split(idx, test_size=0.3, random_state=2021)
print(">>>> # of Train data : {}".format(len(train_idx)))
print(">>>> # of valid data : {}".format(len(valid_idx)))

>>>> # of Train data : 2450
>>>> # of valid data : 1050


In [7]:
# Scaling
scaler = MinMaxScaler().fit(X.iloc[train_idx])
X_scal = scaler.transform(X)
X_scal = pd.DataFrame(X_scal, columns=X.columns)

<b><u>[Ridge Regression]</u></b>
- Hyperparameter Tuning using for Loop
- Hyperparameter Tuning using GridSearchCV
- 변수 해석 방법은 "[Class04] Regression Problem Code Tutorial" 참고

In [8]:
penelty = [0.00001, 0.00005, 0.0001, 0.001, 0.01, 0.1, 0.3, 0.5, 0.6, 0.7, 0.9, 1, 10]

# Using For Loop !! 
# Ridge Regression
# select alpha by checking R2, MSE, RMSE
for a in penelty:
    model = Ridge(alpha=a).fit(X_scal.iloc[train_idx], Y.iloc[train_idx]) #"normalizse=True" --> scaling 
    score = model.score(X_scal.iloc[valid_idx], Y.iloc[valid_idx])
    pred_y = model.predict(X_scal.iloc[valid_idx])
    mse = mean_squared_error(Y.iloc[valid_idx], pred_y)
    print("Alpha:{0:.5f}, R2:{1:.7f}, MSE:{2:.7f}, RMSE:{3:.7f}".format(a, score, mse, np.sqrt(mse))) 

Alpha:0.00001, R2:0.2161946, MSE:0.0094090, RMSE:0.0970002
Alpha:0.00005, R2:0.2161938, MSE:0.0094091, RMSE:0.0970003
Alpha:0.00010, R2:0.2161928, MSE:0.0094091, RMSE:0.0970003
Alpha:0.00100, R2:0.2161759, MSE:0.0094093, RMSE:0.0970014
Alpha:0.01000, R2:0.2160853, MSE:0.0094104, RMSE:0.0970070
Alpha:0.10000, R2:0.2163781, MSE:0.0094068, RMSE:0.0969889
Alpha:0.30000, R2:0.2174113, MSE:0.0093944, RMSE:0.0969249
Alpha:0.50000, R2:0.2180901, MSE:0.0093863, RMSE:0.0968829
Alpha:0.60000, R2:0.2182952, MSE:0.0093838, RMSE:0.0968701
Alpha:0.70000, R2:0.2184292, MSE:0.0093822, RMSE:0.0968618
Alpha:0.90000, R2:0.2185358, MSE:0.0093809, RMSE:0.0968552
Alpha:1.00000, R2:0.2185293, MSE:0.0093810, RMSE:0.0968556
Alpha:10.00000, R2:0.2082147, MSE:0.0095048, RMSE:0.0974927


In [9]:
model_best = Ridge(alpha=0.9).fit(X_scal.iloc[train_idx], Y.iloc[train_idx])
stats.summary(model_best, X_scal.iloc[train_idx], Y.iloc[train_idx], xlabels = list(X_scal.columns))

Residuals:
    Min      1Q  Median     3Q    Max
-0.2148 -0.0572 -0.0078 0.0442 0.6694


Coefficients:
            Estimate               Std. Error          t value   p value
_intercept  0.887908  0.016627+20799.4668470j   0.0000-0.0003j  0.999747
X23        -0.259066   0.01300800+0.00000000j -19.9158+0.0000j  0.000000
X22        -0.092163   0.02015100+0.00000000j  -4.5737+0.0000j  0.000005
X21         0.049482   0.01549500+0.00000000j   3.1934-0.0000j  0.001424
X254       -0.018696   0.01569000+0.00000000j  -1.1916+0.0000j  0.233534
X247        0.112202   0.02963200+0.00000000j   3.7865-0.0000j  0.000156
X246       -0.002081   0.38087500+0.00000000j  -0.0055+0.0000j  0.995640
X245       -0.011075   0.01753700+0.00000000j  -0.6315+0.0000j  0.527782
X244       -0.010399   0.01451900+0.00000000j  -0.7163+0.0000j  0.473902
X243       -0.014547   0.37709100+0.00000000j  -0.0386+0.0000j  0.969231
X242       -0.015234   0.02227900+0.00000000j  -0.6838+0.0000j  0.494167
X241        0.007162 

In [10]:
# Using GridSearchCV
ridge_cv=RidgeCV(alphas=penelty, cv=5)
model = ridge_cv.fit(X_scal.iloc[train_idx], Y.iloc[train_idx])
print("Best Alpha:{0:.5f}, R2:{1:.4f}".format(model.alpha_, model.best_score_))

Best Alpha:0.10000, R2:0.2494


In [11]:
# GridSearchCV Result
model_best = Ridge(alpha=model.alpha_).fit(X_scal.iloc[train_idx], Y.iloc[train_idx])
score = model_best.score(X_scal.iloc[valid_idx], Y.iloc[valid_idx])
pred_y = model_best.predict(X_scal.iloc[valid_idx])
mse = mean_squared_error(Y.iloc[valid_idx], pred_y)
print("Alpha:{0:.5f}, R2:{1:.7f}, MSE:{2:.7f}, RMSE:{3:.7f}".format(0.01, score, mse, np.sqrt(mse)))
stats.summary(model_best, X_scal.iloc[train_idx], Y.iloc[train_idx], xlabels=list(X.columns))

Alpha:0.01000, R2:0.2163781, MSE:0.0094068, RMSE:0.0969889
Residuals:
    Min      1Q  Median     3Q    Max
-0.2162 -0.0569 -0.0077 0.0444 0.6757


Coefficients:
            Estimate               Std. Error          t value   p value
_intercept  0.889774  0.016624+20790.0555420j   0.0000-0.0003j  0.999746
X23        -0.264759   0.01304500+0.00000000j -20.2956+0.0000j  0.000000
X22        -0.107331   0.02008500+0.00000000j  -5.3439+0.0000j  0.000000
X21         0.046228   0.01544900+0.00000000j   2.9924-0.0000j  0.002796
X254       -0.030213   0.01564300+0.00000000j  -1.9315+0.0000j  0.053542
X247        0.142531   0.02953600+0.00000000j   4.8257-0.0000j  0.000001
X246        0.025412   0.37960000+0.00000000j   0.0669-0.0000j  0.946631
X245       -0.022026   0.01748500+0.00000000j  -1.2597+0.0000j  0.207885
X244       -0.006031   0.01448000+0.00000000j  -0.4165+0.0000j  0.677102
X243       -0.046104   0.37582800+0.00000000j  -0.1227+0.0000j  0.902375
X242       -0.024549   0.02220600+0

<b><u>[LASSO Regression]</u></b>
- Hyperparameter Tuning using for Loop
- Hyperparameter Tuning using GridSearchCV
- 변수 해석 방법은 "[Class04] Regression Problem Code Tutorial" 참고

In [12]:
penelty = [0.0000001, 0.0000005, 0.000001, 0.000005,0.00001, 0.00005, 0.0001, 0.001, 0.01]

# LASSO Regression
# select alpha by checking R2, MSE, RMSE
for a in penelty:
    model = Lasso(alpha=a).fit(X_scal.iloc[train_idx], Y.iloc[train_idx]) #"normalizse=True" --> scaling 
    score = model.score(X_scal.iloc[valid_idx], Y.iloc[valid_idx])
    pred_y = model.predict(X_scal.iloc[valid_idx])
    mse = mean_squared_error(Y.iloc[valid_idx], pred_y)
    print("Alpha:{0:.7f}, R2:{1:.4f}, MSE:{2:.4f}, RMSE:{3:.4f}".format(a, score, mse, np.sqrt(mse)))

Alpha:0.0000001, R2:0.2158, MSE:0.0094, RMSE:0.0970
Alpha:0.0000005, R2:0.2157, MSE:0.0094, RMSE:0.0970
Alpha:0.0000010, R2:0.2156, MSE:0.0094, RMSE:0.0970
Alpha:0.0000050, R2:0.2163, MSE:0.0094, RMSE:0.0970
Alpha:0.0000100, R2:0.2179, MSE:0.0094, RMSE:0.0969
Alpha:0.0000500, R2:0.2229, MSE:0.0093, RMSE:0.0966
Alpha:0.0001000, R2:0.2189, MSE:0.0094, RMSE:0.0968
Alpha:0.0010000, R2:0.1831, MSE:0.0098, RMSE:0.0990
Alpha:0.0100000, R2:-0.0004, MSE:0.0120, RMSE:0.1096


In [14]:
model_best = Lasso(alpha=0.00005).fit(X_scal.iloc[train_idx], Y.iloc[train_idx])
stats.summary(model_best, X_scal.iloc[train_idx], Y.iloc[train_idx], xlabels=list(X.columns))

Residuals:
    Min      1Q  Median     3Q    Max
-0.2189 -0.0572 -0.0079 0.0438 0.6743


Coefficients:
            Estimate               Std. Error          t value   p value
_intercept  0.884437  0.016649+20797.3158180j   0.0000-0.0003j  0.999748
X23        -0.264110   0.01295500+0.00000000j -20.3869+0.0000j  0.000000
X22        -0.098391   0.02010100+0.00000000j  -4.8948+0.0000j  0.000001
X21         0.043437   0.01548600+0.00000000j   2.8050-0.0000j  0.005071
X254       -0.020798   0.01567700+0.00000000j  -1.3266+0.0000j  0.184750
X247        0.102123   0.02960800+0.00000000j   3.4491-0.0000j  0.000572
X246       -0.000000   0.38058200+0.00000000j   0.0000+0.0000j  1.000000
X245       -0.011616   0.01752700+0.00000000j  -0.6628+0.0000j  0.507553
X244       -0.004417   0.01451500+0.00000000j  -0.3043+0.0000j  0.760931
X243       -0.003153   0.37680100+0.00000000j  -0.0084+0.0000j  0.993324
X242       -0.006729   0.02226300+0.00000000j  -0.3023+0.0000j  0.762481
X241        0.000000 

In [15]:
# Cross Validation for LASSO
lasso_cv=LassoCV(alphas=penelty, cv=5)
model = lasso_cv.fit(X_scal.iloc[train_idx], Y.iloc[train_idx])
print("Best Alpha : {:.7f}".format(model.alpha_))

Best Alpha : 0.0000500


In [16]:
model_best = Lasso(alpha=model.alpha_).fit(X_scal.iloc[train_idx], Y.iloc[train_idx])
score = model_best.score(X_scal.iloc[valid_idx], Y.iloc[valid_idx])
pred_y = model_best.predict(X_scal.iloc[valid_idx])
mse = mean_squared_error(Y.iloc[valid_idx], pred_y)
print("Alpha:{0:.7f}, R2:{1:.3f}, MSE:{2:.4f}, RMSE:{3:.4f}".format(model.alpha_, score, mse, np.sqrt(mse)))
stats.summary(model_best, X_scal.iloc[train_idx], Y.iloc[train_idx], xlabels=list(X.columns))

Alpha:0.0000500, R2:0.223, MSE:0.0093, RMSE:0.0966
Residuals:
    Min      1Q  Median     3Q    Max
-0.2189 -0.0572 -0.0079 0.0438 0.6743


Coefficients:
            Estimate               Std. Error          t value   p value
_intercept  0.884437  0.016649+20797.3158180j   0.0000-0.0003j  0.999748
X23        -0.264110   0.01295500+0.00000000j -20.3869+0.0000j  0.000000
X22        -0.098391   0.02010100+0.00000000j  -4.8948+0.0000j  0.000001
X21         0.043437   0.01548600+0.00000000j   2.8050-0.0000j  0.005071
X254       -0.020798   0.01567700+0.00000000j  -1.3266+0.0000j  0.184750
X247        0.102123   0.02960800+0.00000000j   3.4491-0.0000j  0.000572
X246       -0.000000   0.38058200+0.00000000j   0.0000+0.0000j  1.000000
X245       -0.011616   0.01752700+0.00000000j  -0.6628+0.0000j  0.507553
X244       -0.004417   0.01451500+0.00000000j  -0.3043+0.0000j  0.760931
X243       -0.003153   0.37680100+0.00000000j  -0.0084+0.0000j  0.993324
X242       -0.006729   0.02226300+0.0000000

<b><u>[ElasticNet]</u></b>
- Hyperparameter Tuning using for Loop
- Hyperparameter Tuning using GridSearchCV
- 변수 해석 방법은 "[Class04] Regression Problem Code Tutorial" 참고

In [17]:
# alphas range (0 ~ 1), alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object.
alphas = [0.000001, 0.000005, 0.00001, 0.00005, 0.0001, 0.001, 0.005, 0.01, 0.05]
# betas range (0 ~ 1), l1_ratio is often to put more values close to 1 (i.e. Lasso) and less close to 0 (i.e. Ridge)
betas = [0.000001, 0.000005, 0.0001, 0.001, 0.005, 0.01, 0.05, 0.1]

# ElasticNet Regression
# select alpha and beta by checking R2, MSE, RMSE
for a in alphas:
    for b in betas:
        model = ElasticNet(alpha=a, l1_ratio=b).fit(X_scal.iloc[train_idx], Y.iloc[train_idx]) #"normalizse=True" --> scaling 
        score = model.score(X_scal.iloc[valid_idx], Y.iloc[valid_idx])
        pred_y = model.predict(X_scal.iloc[valid_idx])
        mse = mean_squared_error(Y.iloc[valid_idx], pred_y)
        print("Alpha:{0:.5f}, Beta: {1:.5f}, R2:{2:.6f}, MSE:{3:.4f}, RMSE:{4:.4f}".format(a, b, score, mse, np.sqrt(mse)))

Alpha:0.00000, Beta: 0.00000, R2:0.215883, MSE:0.0094, RMSE:0.0970
Alpha:0.00000, Beta: 0.00001, R2:0.215883, MSE:0.0094, RMSE:0.0970
Alpha:0.00000, Beta: 0.00010, R2:0.215883, MSE:0.0094, RMSE:0.0970
Alpha:0.00000, Beta: 0.00100, R2:0.215882, MSE:0.0094, RMSE:0.0970
Alpha:0.00000, Beta: 0.00500, R2:0.215881, MSE:0.0094, RMSE:0.0970
Alpha:0.00000, Beta: 0.01000, R2:0.215879, MSE:0.0094, RMSE:0.0970
Alpha:0.00000, Beta: 0.05000, R2:0.215867, MSE:0.0094, RMSE:0.0970
Alpha:0.00000, Beta: 0.10000, R2:0.215851, MSE:0.0094, RMSE:0.0970
Alpha:0.00001, Beta: 0.00000, R2:0.215912, MSE:0.0094, RMSE:0.0970
Alpha:0.00001, Beta: 0.00001, R2:0.215912, MSE:0.0094, RMSE:0.0970
Alpha:0.00001, Beta: 0.00010, R2:0.215912, MSE:0.0094, RMSE:0.0970
Alpha:0.00001, Beta: 0.00100, R2:0.215911, MSE:0.0094, RMSE:0.0970
Alpha:0.00001, Beta: 0.00500, R2:0.215905, MSE:0.0094, RMSE:0.0970
Alpha:0.00001, Beta: 0.01000, R2:0.215897, MSE:0.0094, RMSE:0.0970
Alpha:0.00001, Beta: 0.05000, R2:0.215835, MSE:0.0094, RMSE:0.

In [18]:
# Cross Validation for ElasticNet
grid = dict()
grid['alpha'] = alphas
grid['l1_ratio'] = betas

In [19]:
# define model
model = ElasticNet()
# define search
search = GridSearchCV(model, grid, scoring='neg_root_mean_squared_error', cv=5, n_jobs=-1)
# perform the search
results = search.fit(X_scal.iloc[valid_idx], Y.iloc[valid_idx])
# summarize
print('RMSE: {:.4f}'.format(-results.best_score_))
print('Config: {}'.format(results.best_params_))

RMSE: 0.0968
Config: {'alpha': 0.001, 'l1_ratio': 0.1}


In [20]:
model_best = ElasticNet(alpha=results.best_params_['alpha'], 
                        l1_ratio=results.best_params_['l1_ratio']).fit(X_scal.iloc[train_idx], Y.iloc[train_idx])
score = model_best.score(X_scal.iloc[valid_idx], Y.iloc[valid_idx])
pred_y = model_best.predict(X_scal.iloc[valid_idx])
mse = mean_squared_error(Y.iloc[valid_idx], pred_y)
print("Alpha:{0:.5f}, Beta: {1:.5f}, R2:{2:.6f}, MSE:{3:.4f}, RMSE:{4:.4f}".format(results.best_params_['alpha'], 
                                                                                   results.best_params_['l1_ratio'], 
                                                                                   score, mse, np.sqrt(mse)))
stats.summary(model_best, X_scal.iloc[train_idx], Y.iloc[train_idx], xlabels=list(X.columns))

Alpha:0.00100, Beta: 0.10000, R2:0.214514, MSE:0.0094, RMSE:0.0971
Residuals:
   Min      1Q  Median     3Q    Max
-0.218 -0.0576 -0.0073 0.0438 0.6678


Coefficients:
            Estimate               Std. Error          t value   p value
_intercept  0.868238  0.016688+20820.7268440j   0.0000-0.0003j  0.999754
X23        -0.241872   0.01300400+0.00000000j -18.5997+0.0000j  0.000000
X22        -0.060408   0.02028400+0.00000000j  -2.9781+0.0000j  0.002928
X21         0.037315   0.01560700+0.00000000j   2.3909-0.0000j  0.016880
X254       -0.002670   0.01580300+0.00000000j  -0.1690+0.0000j  0.865829
X247        0.063328   0.02985600+0.00000000j   2.1211-0.0000j  0.034014
X246        0.000000   0.38376500+0.00000000j   0.0000+0.0000j  1.000000
X245       -0.000000   0.01766400+0.00000000j   0.0000+0.0000j  1.000000
X244       -0.015758   0.01461900+0.00000000j  -1.0779+0.0000j  0.281179
X243        0.000000   0.37995200+0.00000000j   0.0000+0.0000j  1.000000
X242       -0.000000   0.0224