#Regualarized linear model - Ridge regression and lasso

##Introduction

LASSO, which stands for Least Absolute Shrinkage and Selection Operator, is one of the model complexity control techniques like variable selection and ridge regression. In this notebook we'll demonstrate how to use the glmnet package for LASSO regression. For more information about LASSO you can refer to the LASSO Page.
Target audience

This notebook is targeted toward data scientists who understand linear regression and want to find out how to fit LASSO regression in R. An operationalization step is also included to show how you can deploy in Azure a web service based on the selected model.
Data



In [89]:
import pandas as pd
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt

file_path = 'C:\\Users\\sujat\\OneDrive\\Coursera\\Projects\\blogging\\student\\'


In [None]:
def load_data():
    student = pd.read_csv(file_path + "student.csv")
    return student


In [None]:
student = load_data()

In [119]:
len(student)
len(student.columns.tolist())

#student.head(1)

student_X = student.iloc[:,0:52]
student_y = student.iloc[:,52:]


#print student_X
#print student_y

In [85]:
#Preprocesing - One hot encoding

ohc_all = pd.get_dummies(student_X)
ohc_X = pd.concat([ohc_all],axis=1)
print ohc_X.head()


#print student_X

   age  Medu  Fedu  traveltime.x  studytime.x  failures.x  famrel.x  \
0   15     1     1             2            4           1         3   
1   15     1     1             1            2           2         3   
2   15     2     2             1            1           0         4   
3   15     2     4             1            3           0         4   
4   15     3     3             2            3           2         4   

   freetime.x  goout.x  Dalc.x       ...        famsup.y_no  famsup.y_yes  \
0           1        2       1       ...                  0             1   
1           3        4       2       ...                  0             1   
2           3        1       1       ...                  0             1   
3           3        2       1       ...                  0             1   
4           2        1       2       ...                  0             1   

   paid.y_no  paid.y_yes  activities.y_no  activities.y_yes  higher.y_no  \
0          0           1          

In [120]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
   ohc_X, student_y, test_size=0.33, random_state=42)

regr = linear_model.LinearRegression()

In [138]:
import numpy as np
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# The coefficients
#print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))

#print len(regr.coef_[0])


Residual sum of squares: 2.57
Variance score: 0.71


In [139]:
def print_model(coefs, names = None, sort = False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)

names =X_train.columns.tolist()
print "Linear model:", print_model(regr.coef_[0],names,True)



Linear model: 0.735 * G2.y + 0.657 * famrel.y + -0.591 * failures.y + -0.56 * higher.y_no + 0.56 * higher.y_yes + -0.56 * famrel.x + -0.379 * reason_other + -0.378 * guardian.x_father + -0.371 * guardian.y_mother + 0.31 * studytime.x + -0.307 * studytime.y + 0.306 * famsup.x_yes + -0.306 * famsup.x_no + 0.294 * Mjob_at_home + 0.292 * school_GP + -0.292 * school_MS + 0.284 * reason_course + 0.282 * traveltime.x + -0.273 * internet_no + 0.273 * internet_yes + -0.271 * failures.x + -0.257 * health.y + 0.254 * Pstatus_A + -0.254 * Pstatus_T + -0.243 * Mjob_other + 0.229 * health.x + -0.221 * Dalc.y + -0.217 * Fjob_other + -0.211 * activities.y_no + 0.211 * activities.y_yes + 0.207 * famsize_GT3 + -0.207 * famsize_LE3 + 0.206 * guardian.x_mother + 0.205 * famsup.y_no + -0.205 * famsup.y_yes + 0.201 * Dalc.x + 0.198 * guardian.y_father + 0.184 * G1.y + -0.181 * Walc.x + 0.172 * guardian.x_other + 0.172 * guardian.y_other + 0.161 * Walc.y + -0.155 * goout.y + 0.153 * Fjob_health + -0.144 * fr

In [136]:
#Lasso 
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=.3)
lasso.fit(X_train, y_train)
  
print "Lasso model: ", print_model(lasso.coef_, names, True)


 Lasso model:  0.845 * G2.y + 0.133 * G1.y + 0.022 * G3.x + 0.021 * G1.x + -0.001 * absences.x + -0.0 * age + 0.0 * Medu + 0.0 * Fedu + 0.0 * traveltime.x + 0.0 * studytime.x + -0.0 * failures.x + 0.0 * famrel.x + -0.0 * freetime.x + -0.0 * goout.x + -0.0 * Dalc.x + -0.0 * Walc.x + -0.0 * health.x + 0.0 * G2.x + 0.0 * traveltime.y + 0.0 * studytime.y + -0.0 * failures.y + 0.0 * famrel.y + -0.0 * freetime.y + -0.0 * goout.y + -0.0 * Dalc.y + -0.0 * Walc.y + -0.0 * health.y + 0.0 * absences.y + 0.0 * school_GP + -0.0 * school_MS + 0.0 * sex_F + -0.0 * sex_M + -0.0 * address_R + 0.0 * address_U + 0.0 * famsize_GT3 + -0.0 * famsize_LE3 + 0.0 * Pstatus_A + -0.0 * Pstatus_T + 0.0 * Mjob_at_home + -0.0 * Mjob_health + -0.0 * Mjob_other + 0.0 * Mjob_services + 0.0 * Mjob_teacher + 0.0 * Fjob_at_home + 0.0 * Fjob_health + -0.0 * Fjob_other + -0.0 * Fjob_services + 0.0 * Fjob_teacher + 0.0 * reason_course + 0.0 * reason_home + -0.0 * reason_other + -0.0 * reason_reputation + 0.0 * nursery_no + -

In [137]:
#Ridge regression
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=10)
ridge.fit(X_train,y_train)
print "Ridge model:", print_model(ridge.coef_[0],names,True)

Ridge model: 0.749 * G2.y + -0.465 * failures.y + 0.265 * higher.y_yes + -0.265 * higher.y_no + -0.263 * failures.x + -0.246 * reason_other + -0.232 * school_MS + 0.232 * school_GP + -0.214 * Mjob_other + 0.211 * internet_yes + -0.211 * internet_no + -0.21 * Pstatus_T + 0.21 * Pstatus_A + 0.209 * reason_course + 0.203 * famrel.y + 0.198 * Mjob_at_home + 0.182 * traveltime.x + 0.181 * G1.y + 0.168 * famsize_GT3 + -0.168 * famsize_LE3 + -0.144 * guardian.x_father + -0.139 * guardian.y_mother + 0.139 * traveltime.y + 0.124 * guardian.x_other + 0.124 * guardian.y_other + 0.123 * romantic.x_yes + -0.123 * romantic.x_no + -0.115 * Fjob_other + -0.112 * higher.x_no + 0.112 * higher.x_yes + 0.11 * Fjob_at_home + -0.109 * famrel.x + -0.097 * address_R + 0.097 * address_U + 0.096 * famsup.x_yes + -0.096 * famsup.x_no + -0.093 * activities.y_no + 0.093 * activities.y_yes + -0.092 * sex_M + 0.092 * sex_F + -0.088 * nursery_yes + 0.088 * nursery_no + 0.082 * age + 0.079 * reason_home + -0.072 * Fjo