# Challenge: Validating a linear regression

Your goal is to achieve a model with a consistent R2 and only statistically significant parameters across multiple samples.

We'll use the property crime model you've been working on with, based on the FBI:UCR data. Since your model formulation to date has used the entire New York State 2013 dataset, you'll need to validate it using some of the other crime datasets available at the FBI:UCR website.

Based on the results of your validation test, create a revised model, and then test both old and new models on a new holdout or set of folds.

Include your model(s) and a brief writeup of the reasoning behind the validation method you chose and the changes you made to submit and review with your mentor.

In [410]:
#importing modules and potential modules
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import random
import nltk
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from sklearn.model_selection import cross_val_score

In [411]:
#importing the NY crime dataset
data = pd.read_csv('https://raw.githubusercontent.com/GenTaylor/Unit2-SupervisedLearning/master/NYCcrimedataframemlr.csv')

In [412]:
#changing column names
data_cols= ['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2', 
            'Robbery', 'Aggravated_Assault', 'Property', 'Burglary', 
            'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3']
data.columns = data_cols
data.columns

Index(['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2',
       'Robbery', 'Aggravated_Assault', 'Property', 'Burglary',
       'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3'],
      dtype='object')

In [413]:

#set X and Y
X = data[['Population','Murder','Robbery']]
Y = data['Property'].values.reshape(-1, 1)

In [414]:
#regression?

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr = linear_model.LinearRegression()

rfit = linear_model.LinearRegression().fit(X, Y)

In [415]:
#cross-val score
scores = cross_val_score(regr, X, Y, cv=5)
print(scores)

[0.93337636 0.54331535 0.2718115  0.83783696 0.91156553]


In [416]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.70 (+/- 0.51)


In [417]:
#train test split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=0)

#shape
X_train.shape, y_train.shape

X_test.shape, y_test.shape


((174, 3), (174, 1))

In [418]:
#fit
rfit.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [419]:
#y_pred

y_pred=rfit.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.89


In [420]:
#explained variance score
from sklearn.metrics import explained_variance_score
explained_variance_score(y_test, y_pred, multioutput='uniform_average')

0.889788019714

In [421]:
#Mean Absolute Error

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred,  sample_weight=None, multioutput='uniform_average')

408.91436400444326

In [422]:
#Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

12858640.86380284

In [423]:
#Median Absolute Error
from sklearn.metrics import median_absolute_error
median_absolute_error(y_test, y_pred)

57.869654197261745

In [424]:
#R2 Score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.8892380101586287

In [425]:
#print commands for all scores??? ***Later

In [426]:
#importing the GA dataset to compare
#note: same data cleaning will be done, followed by the reg, etc. 

data2 = pd.read_csv('https://raw.githubusercontent.com/GenTaylor/Unit2-SupervisedLearning/master/GACcrimedataframemlr.csv')

In [427]:
#changing column names
data2_cols= ['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2', 
            'Robbery', 'Aggravated_Assault', 'Property', 'Burglary', 
            'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3']
data2.columns = data2_cols
data2.columns

Index(['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2',
       'Robbery', 'Aggravated_Assault', 'Property', 'Burglary',
       'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3'],
      dtype='object')

In [428]:
#Robbery and Murder are currently continuous variables. 
#For this model, please use these variables to create categorical 
#features where values greater than 0 are coded 1, 
#and values equal to 0 are coded 0.

#categorical feature
onecat = 1

In [429]:
#Robbery
data2['Robbery'] = np.where(data2['Robbery'] < onecat, 0, 1)

#Murder
data2['Murder'] = np.where(data2['Murder'] < onecat, 0, 1)

In [430]:

#set X2 and Y2
X2 = data2[['Population','Murder','Robbery']]
Y2 = data2['Property'].values.reshape(-1, 1)

In [431]:
#regression?

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr2 = linear_model.LinearRegression()

rfit2 = linear_model.LinearRegression().fit(X2, Y2)

In [432]:
#cross-val score
scores2 = cross_val_score(regr2, X2, Y2, cv=5)
print(scores2)

[ 0.89945549  0.91656818  0.57613321 -0.07378674  0.71079632]


In [433]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores2.mean(), scores2.std() * 2))

Accuracy: 0.61 (+/- 0.72)


In [434]:
#train test split

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, Y2, test_size=0.5, random_state=0)

#shape
X2_train.shape, y2_train.shape

X2_test.shape, y2_test.shape


((127, 3), (127, 1))

In [435]:
#fit
rfit2.fit(X2_train, y2_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [436]:
#y2_pred

y2_pred=rfit2.predict(X2_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit2.score(X2_test, y2_test)))

Accuracy of logistic regression classifier on test set: 0.76


In [437]:
#explained variance score
from sklearn.metrics import explained_variance_score
explained_variance_score(y2_test, y2_pred, multioutput='uniform_average')

0.7650940720494886

In [438]:
#Mean Absolute Error

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y2_test, y2_pred,  sample_weight=None, multioutput='uniform_average')

338.2675202864556

In [439]:
#Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y2_test, y2_pred)

1836420.8078747469

In [440]:
#Median Absolute Error
from sklearn.metrics import median_absolute_error
median_absolute_error(y2_test, y2_pred)

60.66255795802522

In [441]:
#R2 Score
from sklearn.metrics import r2_score
r2_score(y2_test, y2_pred)


0.7623545432138921

## Summary

### New York Dataset

In [456]:
print("The Explained Variance Score of the NY dataset is:", round(explained_variance_score(y_test, y_pred, multioutput='uniform_average'), 2))
print("The Mean Absolute Error of the NY dataset is:", round(mean_absolute_error(y_test, y_pred,  sample_weight=None, multioutput='uniform_average'), 2))
print("The Mean Squared Error of the NY dataset is:", round(mean_squared_error(y_test, y_pred), 2))
print("The Median Absolute Score of the NY dataset is:", round(median_absolute_error(y_test, y_pred), 2))
print("The R2 Score of the NY dataset is:", round(r2_score(y_test, y_pred), 2))

The Explained Variance Score of the NY dataset is: 0.89
The Mean Absolute Error of the NY dataset is: 408.91
The Mean Squared Error of the NY dataset is: 12858640.86
The Median Absolute Score of the NY dataset is: 57.87
The R2 Score of the NY dataset is: 0.89


### Georgia Dataset

In [454]:
print("The Explained Variance Score of the GA dataset is:", round(explained_variance_score(y2_test, y2_pred, multioutput='uniform_average'), 2 ))
print("The Mean Absolute Error of the GA dataset is:", round(mean_absolute_error(y2_test, y2_pred,  sample_weight=None, multioutput='uniform_average'), 2 ))
print("The Mean Squared Error of the GA dataset is:", round(mean_squared_error(y2_test, y2_pred), 2 ))
print("The Median Absolute Score of the GA dataset is:", round(median_absolute_error(y2_test, y2_pred), 2))
print("The R2 Score of the GA dataset is:", round(r2_score(y2_test, y2_pred), 2))

The Explained Variance Score of the GA dataset is: 0.77
The Mean Absolute Error of the GA dataset is: 338.27
The Mean Squared Error of the GA dataset is: 1836420.81
The Median Absolute Score of the GA dataset is: 60.66
The R2 Score of the GA dataset is: 0.76
