(10 points) Perform feature scaling (using L2 norm) over the auto data set.
Use two thirds of the data for training and the remaining one third for testing.
Train a multivariate linear regression (sklearn.linear_model.LinearRegression)
with “mpg” as the response and all other variables except “car name” as the
predictors. What’s the coefficient for the “year” attribute, and what does the
coefficient suggest? What’s the accuracy (mean squared error) of the model
on the test data (one third of the mpg data set)?

In [1]:
#import the required modules

import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn import metrics

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split





In [2]:
#Read the csv file and dataset

data=pd.read_csv("auto-mpg.csv",na_values="?")
data.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [3]:
#change the horsepower field to numeric and fill missing values.
#We changed the horsepower field to numeric for calculations later

data['horsepower'] = pd.to_numeric(data['horsepower'])
data = data.fillna(0)
data.head()


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


In [4]:
#Drop car name as we don't need it for predictions

data1=data.drop('car name',1)


#Drop mpg as it is the value we need to predict

data1=data1.drop('mpg',1)

X=data1[list(data1.columns)]

y=data['mpg']

#Perform feature scaling (using L2 norm)
X_normalized=preprocessing.normalize(X, norm='l2')

#Split of training and testing data according to problem statement. 1/3rd data for testing
X_train, X_test, y_train, y_test =train_test_split(X_normalized, y, test_size=0.33,random_state=123)


In [5]:
#Define the Linear Regression model

lr = LinearRegression()
model = lr.fit(X_train, y_train)

# Finding coefficients
coefficients_list = pd.DataFrame(model.coef_, X.columns,columns=['Coefficients'])
coefficients_list

Unnamed: 0,Coefficients
cylinders,-2456.030542
displacement,8.53963
horsepower,-163.080173
weight,-61.750582
acceleration,-525.006011
model year,980.794359
origin,1208.596008


**The coefficient of the year attribute is 980.794359, which means that as the year increases the mpg value also increases, which shows that the attributes are positively correlated.**

In [6]:
#Prediction on the test data
y_pred = model.predict(X_test)

#Calculate accuracy(mean squared error)
print('Accuracy here (Mean Squared Error): ',round(metrics.mean_squared_error(y_test, y_pred), 3))

Accuracy here (Mean Squared Error):  11.507


(10 points) Try linear regression with regularization (Ridge and Lasso) as implemented in sklearn (RidgeCV and LassoCV). Use the cross-validation approach and compare the coefficients for the different attributes.

In [7]:
#Set range of values for alphas and find best parameters using RidgeCV for Ridge regression and LassoCV for Lasso Regression
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV

#Set a range of alphas to choose from
alphas = np.logspace(-6, 6, 13)

#Call RidgCV with RepeatedKFold cross validation 
clf_ridge = RidgeCV(cv=RepeatedKFold(n_splits=10,n_repeats=3,random_state=1),alphas=alphas).fit(X_train, y_train)

#Get the best alpha from the cross validation
clf_ridge_alpha=clf_ridge.alpha_

#Set a range of alphas to choose from

alphas = np.logspace(-6, 6, 13)

#Call LassoCV with RepeatedKFold cross validation 

clf_lasso = LassoCV(cv=RepeatedKFold(n_splits=10,n_repeats=3,random_state=1),alphas=alphas).fit(X_train, y_train)

#Get the best alpha from the cross validation

clf_lasso_alpha=clf_lasso.alpha_
print(clf_ridge_alpha)
print(clf_lasso_alpha)


0.0001
0.001


In [8]:
#Use the best alpha value to find the coefficients for Ridge Regression

#Call the Ridge function
model_ridge = Ridge(alpha=clf_ridge_alpha)

#Apply Ridge Regression on given dataset
model_ridge.fit(X_train,y_train)

#Get the predictions
y_pred = model_ridge.predict(X_test)

#Get coefficients for each attribute
coefficients_list_ridge = pd.DataFrame(model_ridge.coef_, X.columns,columns=['Coefficients'])
print(coefficients_list_ridge)

#Get the accuracy and MSE
print('Accuracy (Mean Squared Error): ',round(metrics.mean_squared_error(y_test, y_pred), 3))

              Coefficients
cylinders      -144.450046
displacement    -40.091388
horsepower     -135.194772
weight          -55.469389
acceleration   -309.568770
model year      885.361242
origin          153.798514
Accuracy (Mean Squared Error):  11.953


In [10]:
#Use the best alpha value to find the coefficients for Ridge Regression


#Call the Lasso function
model_lasso = Lasso(alpha=clf_lasso_alpha)

#Apply Lasso Regression on given dataset
smodel_lasso.fit(X_train,y_train)

#Get the predictions
y_pred = model_lasso.predict(X_test)

#Get coefficients for each attribute
coefficients_list_lasso = pd.DataFrame(model_lasso.coef_, X.columns,columns=['Coefficients'])
print(coefficients_list_lasso)

#Get the accuracy and MSE
print('Accuracy (Mean Squared Error): ',round(metrics.mean_squared_error(y_test, y_pred), 3))

              Coefficients
cylinders        -0.000000
displacement    -38.064104
horsepower      -93.725406
weight            0.000000
acceleration     -0.000000
model year      813.792933
origin            0.000000
Accuracy (Mean Squared Error):  12.165


(10 points) Finally, compare the results obtained for ordinary linear regression,
Ridge, and Lasso (using the α values that gave the lowest test MSE for the
latter two). Does the type of regularization used affect the importance of the
attributes? How can you interpret these results?

In case of Ridge regression, the positive coefficient values are larger for model year and origin compared to the same coefficients in Lasso Regression.

For negative coefficient values, the value is larger for cylinders,displacement and smaller for horsepower compared to the same coefficients in Lasso Regression.

The MSE is the best for linear regression without regularization regardless of whether the regularization is Lasso or Ridge. Between Ridge and Lasso, Ridge performs better on the test dataset than Lasso.

For each regression, the coefficient values for each attribute show different behaviour. Like in Lasso Regression, only the important and correlated attributes are selected and the rest of the attributes' coefficients are set to 0. It shows that in Lasso Regression, only a few attributes are correlated to the test dataset. In ridge regression, none of the coefficients are set to 0. Based on the random_state set during train_test_split, we may get different answers when it comes to which regularization is better for linear regression on this particular dataset. 

The type of regularization will affect the importance of the attributes based on above.

### References:

[1]. Introduction to Data Mining 2nd Edition By Tan, Steinbach, Kumar,Karpatne

[2]. https://www.kaggle.com/code/morecoding/ridgecv/notebook

[3]. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html

[4]. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
