#Regression and Regularization

Using regularization - Lasso and Ridge

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.metrics import mean_squared_error, root_mean_squared_error
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
%matplotlib inline



We are going to use the `Credit` data from ISLP, which can be [downloaded from this link](https://drive.google.com/uc?download&id=1joK1gnatsAANBNBXCBk1vtEr1g3FwR2W)

In [None]:
# read file from disc
from google.colab import files
uploaded = files.upload()

In [None]:
# download Credit data from ISLP

Credit = pd.read_csv('Credit.csv')
Credit.head()


This is a data set regarding credit card customers.  We want to see if we can predict how much balance they will keep on their cards based on demographics and other features.

In [None]:
Credit.describe()

In [None]:
Credit.hist(figsize=(15,15))
plt.show()

Features all here seem to be fairly well behaved, and no missing values.  

Because we are doing regression, we need to `get_dummies` for the categoricals (make sure to use `dtype=int` because regression algorithms need numerics.

In [None]:
Credit=pd.get_dummies(Credit,drop_first=True, dtype=int)
Credit.head()

In [None]:
# define our X and y

X=Credit.drop(["ID","Balance"],axis=1)
y=Credit["Balance"]

First fit a standard OLS model

I like the `OLS` module from `statsmodels.api` because it provides a nice regression output table, but `sklearn.linear_model import LinearRegression` would work fine here also.

(OLS=_Ordinary Least Squares_ and is equivalent to Multiple Linear Regression)

In [None]:
from statsmodels.api import OLS

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

# unfortunately, it requires you to add a constant feature to run a regression
X_train = sm.add_constant(X_train)
mlr_model=OLS(y_train,X_train).fit()

print(mlr_model.summary(slim=True))

Review the regression table...recall that those with p-value (or P>|t|) less than 0.05 are typically labelled as significant.  Which features are significant??

The table above is all based on the training set, lets calculate an RMSE on teh test set.

In [None]:
# unfortunately, it requires you to add a constant feature to run a regression
X_test = sm.add_constant(X_test)
y_pred=mlr_model.predict(X_test)

mlr_rmse=root_mean_squared_error(y_pred,y_test)

print(mlr_rmse.round(1))

96.6


Can you interpret this RMSE value in the context of the problem?

## Ridge Regression

Now lets see if ridge regression can do any better.  Remember that Ridge with $\alpha=0$ is the same as regular multiple regression (OLS) and should match the value above.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

alpha_best = 0
rmse_best = 10000

for a in np.arange(0,10,.5):
  model = Ridge(alpha=a, max_iter=10000) # what we called lambda in class is alpha in the Ridge function

  model.fit(X_train, y_train)
  y_pred=model.predict(X_test)
  rmse=root_mean_squared_error(y_pred,y_test)
  print("alpha=",a,": rmse = ",rmse.round(3))
  if rmse < rmse_best:
      rmse_best = rmse
      alpha_best = a
print("\nBest alpha = ",alpha_best)

What is the best value of $\alpha$?  What does this tell you?

## Lasso Regression

Lasso regularized regression is the same as Ridge, but with a different penalty (L1) based on absolute values.  Can we do better?

In [None]:
alpha_best = 0
rmse_best = 10000

for a in np.arange(1,10,.5):
  model = Lasso(alpha=a, max_iter=10000) # what we called lambda in class is alpha in the Ridge function

  model.fit(X_train, y_train)
  y_pred=model.predict(X_test)
  rmse=root_mean_squared_error(y_pred,y_test)
  print("alpha=",a,": rmse = ",rmse.round(3))
  if rmse < rmse_best:
      rmse_best = rmse
      alpha_best = a
print("\nBest alpha = ",alpha_best)

Looks like we might be able to do better with a little shrinkage!  What is the best across all methods?

Lets take this best method and look at the shrunken coefficients compared to the full (OLS) model.

In [None]:
model_noshrink = Lasso(alpha=0)
model_noshrink.fit(X_train,y_train)
model_noshrink.coef_

model_best = Lasso(alpha=alpha_best)
model_best.fit(X_train,y_train)
model_best.coef_

coef_table = zip(X.columns,model_best.coef_.round(4),model_noshrink.coef_.round(4))


coef_df = pd.DataFrame(coef_table, columns=['colname', 'coef_best','coef_noshrink'])

print(coef_df)

Note the coefficients that were shrunk ALL the way to zero.  This is one of the great features of the Lasso.  These are noise (useless) features and we basically are removing them by setting their values to zero.  

## Regularization with Logistic Regression

We will look again at the Tayko data from earlier in the class: [download link here](https://drive.google.com/uc?download&id=1wo7x7PmnCJ5-79RZXJSIAa7eS8DdrX-y).  This is a company that is trying to predict Purchase from catalog mailings and other customer attributes.



In [None]:

from google.colab import files
uploaded = files.upload()


Saving Tayko.csv to Tayko.csv


In [None]:
from sklearn.linear_model import LogisticRegression


tayko = pd.read_csv('Tayko.csv')

X=tayko.drop(["sequence_number","Spending","Purchase"],axis=1)
y=tayko["Purchase"]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

# fit logistic regression of X on y
model = LogisticRegression(max_iter=10000, C=10000)
model.fit(X_train, y_train)

coef_table = zip(X.columns, model.coef_[0])  # Access coefficients from coef_[0]
coef_df = pd.DataFrame(coef_table, columns=['Feature', 'Coefficient'])
print(coef_df)

y_pred=model.predict(X_test)
rmse=root_mean_squared_error(y_pred,y_test)
print("RMSE= ",rmse.round(3))

In [None]:
# For logistic regression the C parameter is 1/alpha

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

alpha_best = 0
rmse_best = 10000

for c in np.arange(.1,2,.1):
  model = LogisticRegression(penalty='l1',C=c,solver='liblinear') # what we called lambda in class is alpha in the Ridge function
  model.fit(X_train, y_train)
  y_pred=model.predict(X_test)
  rmse=root_mean_squared_error(y_pred,y_test)
  print("C=",round(c,2),": rmse = ",rmse.round(4))
  if rmse < rmse_best:
      rmse_best = rmse
      alpha_best = c
print("\nBest C = ",alpha_best.round(2))

Can toggle between Ridge and Lasso by changing the `penalty` from `l1` to `l2`

Now lets build the table to see if lasso removes any features

In [None]:
model_noshrink = LogisticRegression(penalty='l1',C=10,solver='liblinear')
model_noshrink.fit(X_train,y_train)


C_best = 0.5
model_best = LogisticRegression(penalty='l1',C=C_best,solver='liblinear')
model_best.fit(X_train,y_train)

pd.DataFrame(zip(X.columns, model_best.coef_[0].round(3), model_noshrink.coef_[0].round(3)),
             columns=['Feature', 'Coefficient_best','Coefficient_noshrink'])
