Linear, Lasso, and Ridge Regression

Validating Machine Learning Models 
Step 1 - Loading the required libraries and modules.
Step 2 - Reading the data and performing basic data checks.
Step 3 - Creating arrays for the features and the response variable.
Step 4 - Trying out different model validation techniques.



Step 1 - Loading the Required Libraries and Modules

In [21]:
# Import required libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut




Step 2 - Reading the Data and Performing Basic Data Checks

In [22]:
df = pd.read_csv("houses_train.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,price,condition,district,max_floor,street,num_rooms,region,area,url,num_bathrooms,building_type,floor,ceiling_height
0,5546,130000.0,newly repaired,Center,4,Sayat Nova Ave,3,Yerevan,96.0,http://www.myrealty.am/en/item/28244/3-senyaka...,1,stone,3,3.2
1,2979,65000.0,good,Arabkir,5,Hr.Kochar St,3,Yerevan,78.0,http://www.myrealty.am/en/item/18029/3-senyaka...,1,stone,2,2.8
2,2698,129000.0,good,Center,10,M.Khorenatsi St,3,Yerevan,90.0,http://www.myrealty.am/en/item/37797/3-senyaka...,1,panel,3,2.8
3,4548,52000.0,newly repaired,Center,14,Argishti St,2,Yerevan,53.0,http://www.myrealty.am/en/item/36153/2-senyaka...,1,monolit,5,3.0
4,2982,65000.0,newly repaired,Center,12,Mashtots Ave,2,Yerevan,47.0,http://www.myrealty.am/en/item/17566/2-senyaka...,1,panel,3,2.8


In [23]:
print("\n")
print(df.shape)
df.describe().transpose()



(5000, 14)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,5000.0,3445.4724,1905.511742,0.0,1818.75,3433.5,5072.25,6812.0
price,5000.0,85660.0866,51328.921854,18500.0,50000.0,70000.0,105000.0,550000.0
max_floor,5000.0,8.6976,4.148349,1.0,5.0,9.0,11.0,23.0
num_rooms,5000.0,2.6908,0.822758,1.0,2.0,3.0,3.0,6.0
area,5000.0,81.5334,24.715806,27.0,65.0,80.0,97.0,149.0
num_bathrooms,5000.0,1.1662,0.40867,1.0,1.0,1.0,1.0,4.0
floor,5000.0,5.1666,3.395578,0.0,3.0,4.0,7.0,22.0
ceiling_height,5000.0,2.89476,0.144861,2.6,2.8,2.8,3.0,3.8


In [24]:
# Checking Null values
df.isnull().sum()*100/df.shape[0]

Unnamed: 0        0.0
price             0.0
condition         0.0
district          0.0
max_floor         0.0
street            0.0
num_rooms         0.0
region            0.0
area              0.0
url               0.0
num_bathrooms     0.0
building_type     0.0
floor             0.0
ceiling_height    0.0
dtype: float64

In [25]:
non_floats = []
for col in df:
    if df[col].dtypes != "float64":
        non_floats.append(col)
df = df.drop(columns=non_floats)

Step 3 - Creating Arrays for the Features and the Response Variable

In [26]:
x1 = df.drop('price', axis=1).values 
y1 = df['price'].values

Step 4 - Trying out Different Model Validation Techniques

Holdout Validation Approach - Train and Test Set Split

In [27]:
# Evaluate using a train and a test set
#    LinearRegression
#    y = a1x1 + a2x2 + a3x3 + ..... + anxn + b
# Where the following is true:
#  y is the target variable.
#  x1, x2, x3,...xn are the features.
#  a1, a2, a3,..., an are the coefficients.
#  b is the parameter of the model.

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x1, y1, test_size=0.30, random_state=1)
model = LinearRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))

Accuracy: 44.45%


In [31]:
#R^2 (coefficient of determination) regression score function.
#Best possible score is 1.0 and it can be negative.
#Test RMSE = mean_squared_error
pred_train_model= model.predict(X_train)
print('train_model mean_squared_error:', np.sqrt(mean_squared_error(Y_train,pred_train_model)))
print('train_model r2_score:', r2_score(Y_train, pred_train_model))

pred_test_model= model.predict(X_test)
print('test_model mean_squared_error:',np.sqrt(mean_squared_error(Y_test,pred_test_model))) 
print('test_model r2_score:',r2_score(Y_test, pred_test_model))

train_model mean_squared_error: 38561.678308174465
train_model r2_score: 0.44654895126662675
test_model mean_squared_error: 37342.165989360175
test_model r2_score: 0.4444701543263193


Ridge Regression
Loss function = OLS + alpha * summation (squared coefficient values)

In [35]:
rr = Ridge(alpha=0.01)
rr.fit(X_train, Y_train) 
pred_train_rr= rr.predict(X_train)
print('train_model mean_squared_error:', np.sqrt(mean_squared_error(Y_train,pred_train_rr)))
print('train_model r2_score:', r2_score(Y_train, pred_train_rr))

pred_test_rr= rr.predict(X_test)
print('test_model mean_squared_error:', np.sqrt(mean_squared_error(Y_test,pred_test_rr))) 
print('test_model r2_score:', r2_score(Y_test, pred_test_rr))

train_model mean_squared_error: 38561.67835117813
train_model r2_score: 0.4465489500322186
test_model mean_squared_error: 37342.185715820466
test_model r2_score: 0.44446956739516275


Lasso Regression
Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)

In [37]:
model_lasso = Lasso(alpha=0.01)
model_lasso.fit(X_train, Y_train) 
pred_train_lasso= model_lasso.predict(X_train)
print('train_model mean_squared_error:', np.sqrt(mean_squared_error(Y_train,pred_train_lasso)))
print('train_model r2_score:', r2_score(Y_train, pred_train_lasso))

pred_test_lasso= model_lasso.predict(X_test)
print('test_model mean_squared_error:', np.sqrt(mean_squared_error(Y_test,pred_test_lasso))) 
print('test_model r2_score:', r2_score(Y_test, pred_test_lasso))

train_model mean_squared_error: 38561.67830823798
train_model r2_score: 0.44654895126480365
test_model mean_squared_error: 37342.166748103344
test_model r2_score: 0.4444701317510632


K-fold Cross-Validation

In [18]:
kfold = model_selection.KFold(n_splits=10, random_state=1)
model_kfold = LinearRegression()
results_kfold = model_selection.cross_val_score(model_kfold, x1, y1, cv=kfold)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0)) 

Accuracy: 44.10%




Leave One Out Cross-Validation (LOOCV)

In [10]:
loocv = model_selection.LeaveOneOut()
model_loocv = LinearRegression()
results_loocv = model_selection.cross_val_score(model_loocv, x1, y1, cv=loocv)
print("Accuracy: %.2f%%" % (results_loocv.mean()*100.0))



















































































































Accuracy: nan%


