# Testing Models: Linear Regression #

In this series of Notebooks, I will be applying various Machine Learning models to the data set of car and bike accidents. This notebook makes a first pass testing linear regression models on numbers of accidents.

In [8]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [9]:
# read in data set with categorical variables turned into dummy variablees
df = pd.read_csv('data/cleaned_data/md_dum.csv')

In [10]:
# create X and y values for modeling
car_y = df.car_acc_score
car_X = df.drop(columns=['Unnamed: 0', 'car_acc_score', 'car_dens_score', 'bike_dens_score'])
bike_y = df.bike_acc_score
bike_X = df.drop(columns=['Unnamed: 0', 'bike_acc_score', 'car_dens_score', 'bike_dens_score'])

In [11]:
# do train test split
X_car_train, X_car_test, y_car_train, y_car_test = train_test_split(car_X, car_y, test_size=0.3, random_state=18,
                                                                   shuffle=True, stratify=car_y)
X_bike_train, X_bike_test, y_bike_train, y_bike_test = train_test_split(bike_X, bike_y, test_size=0.3, random_state=18,
                                                                   shuffle=True, stratify=car_y)

 ### Linear Regression ###

In [13]:
lr_car = LinearRegression()
lr_car.fit(X_car_train, y_car_train)
lr_car_score = lr_car.score(X_car_test, y_car_test)
print('Linear Regression Score on car accidents: {}'.format(lr_car_score))

lr_bike = LinearRegression()
lr_bike.fit(X_bike_train, y_bike_train)
lr_bike_score = lr_bike.score(X_bike_test, y_bike_test)
print('Linear Regression Score on bike accidents: {}'.format(lr_car_score))

Linear Regression Score on car accidents: 0.06329616419275474
Linear Regression Score on bike accidents: 0.06329616419275474


Not very good at all. Let's see if normalization helps.

In [18]:
lr = LinearRegression(normalize=True)
lr.fit(X_car_train, y_car_train)
lr_score = lr.score(X_car_test, y_car_test)
print('Linear Regression Score on car accidents with normalization: {}'.format(lr_score))

lr.fit(X_bike_train, y_bike_train)
lr_score = lr.score(X_bike_test, y_bike_test)
print('Linear Regression Score on bike accidents with normalization: {}'.format(lr_score))

Linear Regression Score on car accidents with normalization: 0.06315216507881061
Linear Regression Score on bike accidents with normalization: 0.0782305121586967


No, that didn't help either. Let's try some other regression methods.

### Ridge Regression ###

In [19]:
ri = Ridge()
ri.fit(X_car_train, y_car_train)
ri_score = ri.score(X_car_test, y_car_test)
print('Ridge Regression Score on car accidents is: {}'.format(ri_score))

ri.fit(X_bike_train, y_bike_train)
ri_score = ri.score(X_bike_test, y_bike_test)
print('Ridge Regression Score on bike accidents is: {}'.format(ri_score))

Ridge Regression Score on car accidents is: 0.06339349364702906
Ridge Regression Score on bike accidents is: 0.07840052791046803


Ridge regression does not provide better regression.

### Lasso ###

In [20]:
la = Lasso()
la.fit(X_car_train, y_car_train)
la_score = la.score(X_car_test, y_car_test)
print('Lasso Regression Score on car accidents is: {}'.format(la_score))

la.fit(X_bike_train, y_bike_train)
la_score = la.score(X_bike_test, y_bike_test)
print('Lasso Regression Score on bike accidents is: {}'.format(la_score))

Lasso Regression Score on car accidents is: 0.027179812707160966
Lasso Regression Score on bike accidents is: 0.0025571419372041326


That is actually worse. Finally, let's try ElasticNet, though since it just combines the previous methods, I don't think it will be.

### ElasticNet ###

In [21]:
el = ElasticNet()
el.fit(X_car_train, y_car_train)
el_score = el.score(X_car_test, y_car_test)
print('ElasticNet Regression Score on car accidents is: {}'.format(el_score))

el.fit(X_bike_train, y_bike_train)
el_score = el.score(X_bike_test, y_bike_test)
print('ElasticNet Regression Score on bike accidents is: {}'.format(el_score))

ElasticNet Regression Score on car accidents is: 0.027937964552807393
ElasticNet Regression Score on bike accidents is: 0.018861221289405372


Still no good. I wonder if scaling the data will help appreciably. I've already normalized the data for Linear Regression which is the same as the Standard Scaling and it didn't help much. I'll try it for the other methods here.

### Scaling the Data ###

In [23]:
scaler = StandardScaler()

steps=[('scaler', scaler), ('linear_regression', lr)]

def run_pipeline(steps, X_train, X_test, y_train, y_test):
    pipeline = Pipeline(steps)
    pipeline.fit(X_train, y_train)
    return pipeline.score(X_test, y_test)

steps=[('scaler', scaler), ('ridge_regression', ri)]
score = run_pipeline(steps, X_car_train, X_car_test, y_car_train, y_car_test)
print('Ridge Regression Score on car accidents with scaling is {}'.format(score))
score = run_pipeline(steps, X_bike_train, X_bike_test, y_bike_train, y_bike_test)
print('Ridge Regression Score on bike accidents with scaling is {}'.format(score))

steps=[('scaler', scaler), ('lasso_regression',la)]
score = run_pipeline(steps, X_car_train, X_car_test, y_car_train, y_car_test)
print('Lasso Regression Score on car accidents with scaling is {}'.format(score))
score = run_pipeline(steps, X_bike_train, X_bike_test, y_bike_train, y_bike_test)
print('Lasso Regression Score on bike accidents with scaling is {}'.format(score))

steps=[('scaler', scaler), ('elasticnet_regression', el)]
score = run_pipeline(steps, X_car_train, X_car_test, y_car_train, y_car_test)
print('ElasticNet Regression Score on car accidents with scaling is {}'.format(score))
score = run_pipeline(steps, X_bike_train, X_bike_test, y_bike_train, y_bike_test)
print('ElasticNet Regression Score on bike accidents with scaling is {}'.format(score))

Ridge Regression Score on car accidents with scaling is 0.06330043865227453
Ridge Regression Score on bike accidents with scaling is 0.07836707508206453
Lasso Regression Score on car accidents with scaling is 0.021570047248237745
Lasso Regression Score on bike accidents with scaling is -1.9331857254334395e-05
ElasticNet Regression Score on car accidents with scaling is 0.03812427226124482
ElasticNet Regression Score on bikev accidents with scaling is -1.9331857254334395e-05


None of these regression models have yeilded good results. Next I will change this to a classification problem and try to apply Logistic Regression. [Go>>](Testing%20Models%20-%20Logistic%20Regression.ipynb)