Regularization - Regularization is a technique to reduce overfitting by Bias-Variance tradeoff.
There are 3 regularization techniques

1. Ridge regularization (L2 regularization)
2. Lasso regularization (L1 regularization)
3. Elastic net

RIDGE REGULARIZATION (L2)

In [310]:
import pandas as pd
import numpy as np
import matplotlib as mp
import matplotlib.pyplot as plt
%matplotlib inline

In [311]:
df = pd.read_csv("C:/Users/2068671/OneDrive - Cognizant/Desktop/Datasets/Automobile_data.csv")

In [312]:
df.head()

Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0


In [313]:
df.shape

(61, 10)

Handling missing values if any

In [314]:
df.isnull().sum()

index               0
company             0
body-style          0
wheel-base          0
length              0
engine-type         0
num-of-cylinders    0
horsepower          0
average-mileage     0
price               3
dtype: int64

In [315]:
df.dropna(how = "any", axis = 0, inplace = True)

In [316]:
df.isnull().sum()

index               0
company             0
body-style          0
wheel-base          0
length              0
engine-type         0
num-of-cylinders    0
horsepower          0
average-mileage     0
price               0
dtype: int64

Converting categorical to numeric

In [317]:
from word2number import w2n

In [318]:
df["num-of-cylinders"] = df["num-of-cylinders"].apply(w2n.word_to_num)

In [319]:
df.head()

Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,0,alfa-romero,convertible,88.6,168.8,dohc,4,111,21,13495.0
1,1,alfa-romero,convertible,88.6,168.8,dohc,4,111,21,16500.0
2,2,alfa-romero,hatchback,94.5,171.2,ohcv,6,154,19,16500.0
3,3,audi,sedan,99.8,176.6,ohc,4,102,24,13950.0
4,4,audi,sedan,99.4,176.6,ohc,5,115,18,17450.0


In [320]:
def target_mean_encoding(data, column, target):
    labels = data.groupby(column)[target].mean().sort_values(ascending = True).index
    encoded_labels = {a:b for b,a in enumerate(labels, 0)}
    data[column] = data[column].map(encoded_labels)

In [321]:
target_mean_encoding(df, "company", "price")
target_mean_encoding(df, "body-style", "price")
target_mean_encoding(df, "engine-type", "price")

In [322]:
df.head()

Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,0,10,3,88.6,168.8,3,4,111,21,13495.0
1,1,10,3,88.6,168.8,3,4,111,21,16500.0
2,2,10,0,94.5,171.2,4,6,154,19,16500.0
3,3,11,2,99.8,176.6,2,4,102,24,13950.0
4,4,11,2,99.4,176.6,2,5,115,18,17450.0


Dropping index variable

In [323]:
df.drop("index", axis = 1, inplace= True)

In [324]:
df.head()

Unnamed: 0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,10,3,88.6,168.8,3,4,111,21,13495.0
1,10,3,88.6,168.8,3,4,111,21,16500.0
2,10,0,94.5,171.2,4,6,154,19,16500.0
3,11,2,99.8,176.6,2,4,102,24,13950.0
4,11,2,99.4,176.6,2,5,115,18,17450.0


Pre processing

In [325]:
X = df.iloc[:, :-1].values

In [326]:
y = df.iloc[:, -1].values

In [327]:
from sklearn.model_selection import train_test_split

In [328]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 0)

In [329]:
from sklearn.preprocessing import StandardScaler

In [330]:
sc = StandardScaler()

In [331]:
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

Ridge Regression 

In [332]:
from sklearn.linear_model import Ridge

In [333]:
R = Ridge(alpha = 1, solver = "svd", max_iter = 1000)   
# solver{‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’, ‘lbfgs’}, default=’auto’
# Closed form solution - ‘svd’, ‘cholesky’, ‘lsqr'
# Gradient solvers -  ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’, ‘lbfgs’. Pass max_iter with gradient solvers

In [334]:

R.fit(X_train, y_train)

Ridge(alpha=1, max_iter=1000, solver='svd')

In [335]:
R.coef_

array([ 2871.28999108,   361.99211279,  3626.58851532, -1421.12977056,
        1244.79058089,  -482.07879521,  6545.55397989,  1105.61635705])

Comparing with Linear regression

In [336]:
from sklearn.linear_model import LinearRegression

In [350]:
lin = LinearRegression()

In [338]:
lin.fit(X_train, y_train)

LinearRegression()

In [339]:
lin.coef_

array([ 2779.53943924,   493.56128767,  4548.49396841, -2421.25582675,
        1172.55712637, -1201.34996652,  7931.8321774 ,  1504.89863415])

In [340]:
lin.score(X_train, y_train)

0.9305157808190575

In [341]:
R.score(X_train, y_train)

0.9274882279411523

In [342]:
from sklearn.metrics import r2_score

In [343]:
reg_pred = R.predict(X_test)

In [344]:
r2_score(y_test, reg_pred)

0.7225049638779686

In [345]:
lin_pred = lin.predict(X_test)

In [346]:
r2_score(y_test, lin_pred)

0.7283792474488415

RIDGE REGRESSION USING SGD REGRESSOR

In [351]:
from sklearn.linear_model import SGDRegressor

In [445]:
sgd = SGDRegressor(penalty = "l2", max_iter= 1000, alpha= 0.01, learning_rate= "optimal")


In [446]:
sgd.fit(X_train, y_train)

SGDRegressor(alpha=0.01, learning_rate='optimal')

In [447]:
R_grad = sgd.predict(X_test)

In [448]:
r2_score(y_test, R_grad)

0.7420923184371659