# Challenge: Validating a linear regression

Your goal is to achieve a model with a consistent R2 and only statistically significant parameters across multiple samples.

We'll use the property crime model you've been working on with, based on the FBI:UCR data. Since your model formulation to date has used the entire New York State 2013 dataset, you'll need to validate it using some of the other crime datasets available at the FBI:UCR website.

Based on the results of your validation test, create a revised model, and then test both old and new models on a new holdout or set of folds.

Include your model(s) and a brief writeup of the reasoning behind the validation method you chose and the changes you made to submit and review with your mentor.

In [1136]:
#importing modules and potential modules
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import random
import nltk
import math
import warnings
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from sklearn.model_selection import cross_val_score
from sklearn import metrics  
from scipy import stats
# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd")

In [1137]:
#importing the NY crime dataset
data = pd.read_csv('https://raw.githubusercontent.com/GenTaylor/Unit2-SupervisedLearning/master/NYCcrimedataframemlr.csv')

In [1138]:
#changing column names
data_cols= ['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2', 
            'Robbery', 'Aggravated_Assault', 'Property', 'Burglary', 
            'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3']
data.columns = data_cols
data.columns

Index(['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2',
       'Robbery', 'Aggravated_Assault', 'Property', 'Burglary',
       'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3'],
      dtype='object')

In [1139]:
data.dtypes

City                    object
Population               int64
Violent_Crime            int64
Murder                   int64
Rape1                  float64
Rape2                    int64
Robbery                  int64
Aggravated_Assault       int64
Property                 int64
Burglary                 int64
Larceny_Theft            int64
Motor_Vehicle_Theft      int64
Arson3                 float64
dtype: object

In [1140]:
#winsorizing data to "fix" anomalous values
data.Population = scipy.stats.mstats.winsorize(data.Population, limits = 0.01)
data.Violent_Crime = scipy.stats.mstats.winsorize(data.Violent_Crime, limits = 0.01)
data.Murder = scipy.stats.mstats.winsorize(data.Murder, limits = 0.01)
data.Rape1 = scipy.stats.mstats.winsorize(data.Rape1, limits = 0.01)
data.Rape2 = scipy.stats.mstats.winsorize(data.Rape2, limits = 0.01)
data.Robbery = scipy.stats.mstats.winsorize(data.Robbery, limits = 0.01)
data.Aggravated_Assault = scipy.stats.mstats.winsorize(data.Aggravated_Assault, limits = 0.01)
data.Property = scipy.stats.mstats.winsorize(data.Property, limits = 0.01)
data.Burglary = scipy.stats.mstats.winsorize(data.Burglary, limits = 0.01)
data.Larceny_Theft = scipy.stats.mstats.winsorize(data.Larceny_Theft, limits = 0.01)
data.Motor_Vehicle_Theft = scipy.stats.mstats.winsorize(data.Motor_Vehicle_Theft, limits = 0.01)
data.Arson3 = scipy.stats.mstats.winsorize(data.Arson3, limits = 0.01)

In [1141]:
data.sort_values('Population', inplace=True, ascending=False)
print(data.head(n=3))

        City  Population  Violent_Crime  Murder  Rape1  Rape2  Robbery  \
0   New York      199134           1192      21    NaN     75      400   
1    Buffalo      199134           1192      21    NaN     75      400   
2  Rochester      199134           1192      21    NaN     75      400   

   Aggravated_Assault  Property  Burglary  Larceny_Theft  Motor_Vehicle_Theft  \
0                 696      6473      1781           4298                  394   
1                 696      6473      1781           4298                  394   
2                 696      6473      1781           4298                  394   

   Arson3  
0     NaN  
1     NaN  
2   132.0  


In [1142]:

#set X and Y
X = data[['Population','Murder','Robbery']]
Y = data['Property']

In [1143]:
#train test split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=0)

#shape
X_train.shape, y_train.shape

X_test.shape, y_test.shape


((174, 3), (174,))

In [1144]:
#regression?

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr = linear_model.LinearRegression()

rfit = regr.fit(X_train, y_train)
rfit

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [1145]:
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))


Coefficients: 
 [ 1.58918239e-02  2.01231521e+02 -1.82430872e+00]

Intercept: 
 44.47387385248015

R-squared:
0.8832975478217796


In [1146]:
#test it
accuracy = regr.score(X_test, y_test)
accuracy

0.8144104514642367

In [1147]:
#y_pred

y_pred=rfit.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.81


In [1148]:
#explained variance score
from sklearn.metrics import explained_variance_score
explained_variance_score(y_test, y_pred, multioutput='uniform_average')

0.8147753489634106

In [1149]:
#Mean Absolute Error

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred,  sample_weight=None, multioutput='uniform_average')

164.02400659943194

In [1150]:
#Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

95749.3672671282

In [1151]:
#Median Absolute Error
from sklearn.metrics import median_absolute_error
median_absolute_error(y_test, y_pred)

74.27122262808322

In [1152]:
#R2 Score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.8144104514642367

In [1153]:
#print commands for all scores??? ***Later

In [1154]:
#importing the GA dataset to compare
#note: same data cleaning will be done, followed by the reg, etc. 

data2 = pd.read_csv('https://raw.githubusercontent.com/GenTaylor/Unit2-SupervisedLearning/master/GACcrimedataframemlr.csv')

In [1155]:
#changing column names
data2_cols= ['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2', 
            'Robbery', 'Aggravated_Assault', 'Property', 'Burglary', 
            'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3']
data2.columns = data2_cols
data2.columns

Index(['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2',
       'Robbery', 'Aggravated_Assault', 'Property', 'Burglary',
       'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3'],
      dtype='object')

In [1156]:
#winsorize
data2.Population = scipy.stats.mstats.winsorize(data2.Population, limits = 0.01)
data2.Violent_Crime = scipy.stats.mstats.winsorize(data2.Violent_Crime, limits = 0.01)
data2.Murder = scipy.stats.mstats.winsorize(data2.Murder, limits = 0.01)
data2.Rape1 = scipy.stats.mstats.winsorize(data2.Rape1, limits = 0.01)
data2.Rape2 = scipy.stats.mstats.winsorize(data2.Rape2, limits = 0.01)
data2.Robbery = scipy.stats.mstats.winsorize(data2.Robbery, limits = 0.01)
data2.Aggravated_Assault = scipy.stats.mstats.winsorize(data2.Aggravated_Assault, limits = 0.01)
data2.Property = scipy.stats.mstats.winsorize(data2.Property, limits = 0.01)
data2.Burglary = scipy.stats.mstats.winsorize(data2.Burglary, limits = 0.01)
data2.Larceny_Theft = scipy.stats.mstats.winsorize(data2.Larceny_Theft, limits = 0.01)
data2.Motor_Vehicle_Theft = scipy.stats.mstats.winsorize(data2.Motor_Vehicle_Theft, limits = 0.01)
data2.Arson3 = scipy.stats.mstats.winsorize(data2.Arson3, limits = 0.01)


In [1157]:
#Robbery and Murder are currently continuous variables. 
#For this model, please use these variables to create categorical 
#features where values greater than 0 are coded 1, 
#and values equal to 0 are coded 0.

#categorical feature
onecat = 1

In [None]:
#Robbery
data2['Robbery'] = np.where(data2['Robbery'] < onecat, 0, 1)

#Murder
data2['Murder'] = np.where(data2['Murder'] < onecat, 0, 1)

In [1158]:

#set X2 and Y2
X2 = data2[['Population','Murder','Robbery']]
Y2 = data2['Property']

In [1159]:
#train test split

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, Y2, test_size=0.5, random_state=0)

#shape
X2_train.shape, y2_train.shape

X2_test.shape, y2_test.shape


((127, 3), (127,))

In [1160]:
#regression?

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr2 = linear_model.LinearRegression()

rfit2 = regr2.fit(X2_train, y2_train)
rfit2

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [1161]:
# Inspect the results.
print('\nCoefficients: \n', regr2.coef_)
print('\nIntercept: \n', regr2.intercept_)
print('\nR-squared:')
print(regr2.score(X,Y))


Coefficients: 
 [ 1.55050575e-02 -3.81625480e+01  1.49365071e+01]

Intercept: 
 102.91707523369007

R-squared:
0.6302126897537156


In [1162]:
#test it
accuracy2 = regr2.score(X2_test, y2_test)
accuracy2

0.9502826276063441

In [1163]:
#y2_pred

y2_pred=rfit2.predict(X2_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit2.score(X2_test, y2_test)))

Accuracy of logistic regression classifier on test set: 0.95


In [1164]:
#explained variance score
from sklearn.metrics import explained_variance_score
explained_variance_score(y2_test, y2_pred, multioutput='uniform_average')

0.9503514452752404

In [1165]:
#Mean Absolute Error

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y2_test, y2_pred,  sample_weight=None, multioutput='uniform_average')

164.41408473069006

In [1166]:
#Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y2_test, y2_pred)

94562.38423724392

In [1167]:
#Median Absolute Error
from sklearn.metrics import median_absolute_error
median_absolute_error(y2_test, y2_pred)

106.07243063272134

In [1168]:
#R2 Score
from sklearn.metrics import r2_score
r2_score(y2_test, y2_pred)


0.9502826276063441

## Summary

### New York Dataset

In [1169]:
print("The Explained Variance Score of the NY dataset is:", round(explained_variance_score(y_test, y_pred, multioutput='uniform_average'), 2))
print("The Mean Absolute Error of the NY dataset is:", round(mean_absolute_error(y_test, y_pred,  sample_weight=None, multioutput='uniform_average'), 2))
print("The Mean Squared Error of the NY dataset is:", round(mean_squared_error(y_test, y_pred), 2))
print("The Median Absolute Score of the NY dataset is:", round(median_absolute_error(y_test, y_pred), 2))
print("The R2 Score of the NY dataset is:", round(r2_score(y_test, y_pred), 2))


The Explained Variance Score of the NY dataset is: 0.81
The Mean Absolute Error of the NY dataset is: 164.02
The Mean Squared Error of the NY dataset is: 95749.37
The Median Absolute Score of the NY dataset is: 74.27
The R2 Score of the NY dataset is: 0.81


### Georgia Dataset

In [1170]:
print("The Explained Variance Score of the GA dataset is:", round(explained_variance_score(y2_test, y2_pred, multioutput='uniform_average'), 2 ))
print("The Mean Absolute Error of the GA dataset is:", round(mean_absolute_error(y2_test, y2_pred,  sample_weight=None, multioutput='uniform_average'), 2 ))
print("The Mean Squared Error of the GA dataset is:", round(mean_squared_error(y2_test, y2_pred), 2 ))
print("The Median Absolute Score of the GA dataset is:", round(median_absolute_error(y2_test, y2_pred), 2))
print("The R2 Score of the GA dataset is:", round(r2_score(y2_test, y2_pred), 2))


The Explained Variance Score of the GA dataset is: 0.95
The Mean Absolute Error of the GA dataset is: 164.41
The Mean Squared Error of the GA dataset is: 94562.38
The Median Absolute Score of the GA dataset is: 106.07
The R2 Score of the GA dataset is: 0.95


In [1171]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
df.head(10)

Unnamed: 0,Actual,Predicted
6,4090,2796.907464
52,1040,1231.030472
269,29,86.793801
45,1140,460.53737
294,73,78.100973
189,23,148.247484
191,47,146.022628
116,158,235.671658
90,77,328.365416
249,78,99.348342


In [1172]:
df2 = pd.DataFrame({'Actual': y2_test, 'Predicted': y2_pred})  
df2.head(10)

Unnamed: 0,Actual,Predicted
158,24,118.96481
83,450,319.444919
170,1082,1017.172472
101,695,280.150156
150,6800,4280.028983
199,708,344.687058
118,38,146.501792
227,548,290.93742
63,4,111.05723
135,503,692.610368
