# 14. 라쏘(Lasso) 회귀모델

## 14.1 핵심 개념

라쏘(Lasso) 회귀모델은 릿지 모델과 유사하게 특성의 계수값을 0에 가깝게 하지만 실제 중요하지 않은 변수의 계수를 0으로 만들어 제거하는 모델입니다.

## 14.2 scikit-learn

라쏘(Lasso) 회귀모델은 sklearn.linear_model 패키지에 속해 있습니다.

|Regressor with variable selection||
|:--|:--|
|linear_model.ElasticNet() |Linear regression with combined L1 and L2 prioes as regularizer. |
|linear_model.ElasticNetCV() |Elastic Net model with iterative fitting along a regularization path. |
|linear_model.Lars() |Least Angle Regression model a.k.a|
|linear_model.LarsCV() |Cross-validated Least Angle Regression model. |
|**linear_model.Lasso()** |Linear Model trained with L1 priot as regularizer (aka the Lasso) |
|linear_model.LassoCV() |Lasso linear model with iterative fitting along a regularization path. |
|linear_model.LassoLars() |Lasso model fit with Least Angle Regression a.k.a |
|linear_model.LassoLarsCV() |Cross-validated Lasso, using the LARS algorithm.|
|linear_model.LassoLarsIC() |Lasso model fit with Lars using BIC or AIC for model selection|
|linear_model.OrthogonalMatchingPursuit() |Orthogonal Matching Pursuit model (OMP)|
|linear_model.OrthogonalMatchingPursuitCV() |Cross-validated Orthogonal Matching Pursuit model (OMP)|

하이퍼 파라미터로 alpha 값이 있습니다.


|Hyper Parameter||
|:--|:--|
|alpha |규제의 개수를 지정하는 옵션 값. 기본값은 1이며, 0에 가까울수록 규제가 적어져 선형회귀와 유사한 결과를 보이게 됩니다.|

# 14.3 분석 코드

In [2]:
# 데이터 로드
import pandas as pd
data2 = pd.read_csv('./extrafiles/house_price.csv', encoding='utf-8')

print(data2.columns)

X = data2[data2.columns[1:5]]
y = data2[['house_value']]

print(X.columns)

# train-test data 분리
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# stratify 효과 - 범주형 변수를 유사한 비율로 train / test 데이터로 분리시켜 준다.
print(y_train.mean())
print(y_test.mean())

# 표준화 작업 - MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)

X_scaled_train = scaler.transform(X_train)
X_scaled_test = scaler.transform(X_test)

Index(['housing_age', 'income', 'bedrooms', 'households', 'rooms',
       'house_value'],
      dtype='object')
Index(['income', 'bedrooms', 'households', 'rooms'], dtype='object')
house_value    189260.967812
dtype: float64
house_value    188391.001357
dtype: float64


In [3]:
# Lasso 모델 적용
from sklearn.linear_model import Lasso
model=Lasso()
model.fit(X_scaled_train, y_train)
pred_train = model.predict(X_scaled_train)
model.score(X_scaled_train, y_train)

0.5455724679313863

In [4]:
# 테스트 데이터 모델 적용
pred_test = model.predict(X_scaled_test)
model.score(X_scaled_test, y_test)

0.5626850497564577

In [5]:
# 회귀분석의 지표 R Square n RMSE
# RMSE (Root Mean Squared Error)
from sklearn.metrics import mean_squared_error
import numpy as np
MSE_train = mean_squared_error(y_train, pred_train)
MSE_test = mean_squared_error(y_test, pred_test)
print("훈  련데이터 RMSE:", np.sqrt(MSE_train))
print("테스트데이터 RMSE:", np.sqrt(MSE_test))

훈  련데이터 RMSE: 64340.34152172676
테스트데이터 RMSE: 63220.748913873045


In [7]:
# 경고레벨조정
import warnings
warnings.filterwarnings("ignore")

# 하이퍼 파라미터 튜닝 - Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {'alpha' :[0.0, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 0.5, 1, 2, 3]}
             
grid_search = GridSearchCV(Lasso(), param_grid, cv=5)
grid_search.fit(X_scaled_train, y_train)

GridSearchCV(cv=5, estimator=Lasso(),
             param_grid={'alpha': [0.0, 1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1,
                                   0.5, 1, 2, 3]})

In [8]:
# 파라미터 튜닝 결과 확인
print("Best Parameter : {}".format(grid_search.best_params_))
print("Best Score : {:.4f}".format(grid_search.best_score_))
print("Test set Score : {:.4f}".format(grid_search.score(X_scaled_test, y_test)))

Best Parameter : {'alpha': 0.5}
Best Score : 0.5452
Test set Score : 0.5627


In [10]:
# 하이퍼 파라미터 튜닝2 - Randomized Search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {'alpha' :randint(low=0.00001, high=100)}

random_search = RandomizedSearchCV(Lasso(), param_distributions=param_distribs, n_iter=100, cv=5)
random_search.fit(X_scaled_train, y_train)

RandomizedSearchCV(cv=5, estimator=Lasso(), n_iter=100,
                   param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000017AEBE181C0>})

In [11]:
# 파라미터 튜닝 결과값 확인
print("Best Parameter : {}".format(random_search.best_params_))
print("Best Score : {:.4f}".format(random_search.best_score_))
print("Test set Score : {:.4f}".format(random_search.score(X_scaled_test, y_test)))

Best Parameter : {'alpha': 0}
Best Score : 0.5452
Test set Score : 0.5627
