Unit 2- Lesson 5- Project 2

## Challenge: Validating a linear regression

Validating regression models for prediction
Statistical tests are useful for making sure a model is a good fit to the test data, and that all the features are useful to the model. However, to make sure a model has good predictive validity for new data, it is necessary to assess the performance of the model on new datasets.

The procedure is the same as what you learned in the Naive Bayes lesson – the holdout method and cross-validation method are both available. You've already had experience writing code to run these kinds of validation models for Naive Bayes: now you can try it again with linear regression. In this case, your goal is to achieve a model with a consistent R2 and only statistically significant parameters across multiple samples.

We'll use the property crime model you've been working on with, based on the FBI:UCR data. Since your model formulation to date has used the entire New York State 2013 dataset, you'll need to validate it using some of the other crime datasets available at the FBI:UCR website. Options include other states crime rates in 2013 or crime rates in New York State in other years or a combination of these.

Iterate
Based on the results of your validation test, create a revised model, and then test both old and new models on a new holdout or set of folds.

Include your model(s) and a brief writeup of the reasoning behind the validation method you chose and the changes you made to submit and review with your mentor.



## Import Data and clean

In [1]:
# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score


%matplotlib inline
sns.set_style('white')

In [2]:
#read in the excel file
df = pd.read_excel ('table_8_offenses_known_to_law_enforcement_new_york_by_city_2013.xls')

In [3]:
#The top few rows of the data fram are blank, drop the blank rows and locate the column titles
#Then reset the index
df_nypd = df.drop([0,1,2, 3])
df_nypd.columns = df.iloc[3]
df_nypd= df_nypd.reset_index(drop=True)

In [4]:
#Clean up the column titles for quick acces and itteration
df_nypd.columns = df_nypd.columns.str.strip().str.lower().str.replace(' ','_').str.replace('(', '').str.replace(')','').str.replace('\n', '_')


In [5]:
#last three rows are blank/null remove them from data set
df_nypd = df_nypd[:-3] 

In [6]:
#read in second data set from 2014
df2 = pd.read_excel('Table_8_Offenses_Known_to_Law_Enforcement_by_New_York_by_City_2014.xls')

In [7]:
#repeat the process- locate column title rows and drop blanks, then reset index
df_14 = df2.drop([0,1,2,3])
df_14.columns = df2.iloc[3]
df_14 = df_14.reset_index(drop=True)

In [8]:
#clean the column titles for quick iterations
df_14.columns = df_14.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str.replace('\n', '_')

In [9]:
#drop blank rows from the end of the data set
df_14 = df_14[:-7]

In [10]:
#removed New York from data set as an outlier (even though it is where I live and I would like to be inclusive, it deserves its own model)
df_nypd= df_nypd.loc[df_nypd['city'] != 'New York']
df_14= df_14.loc[df_14['city'] != 'New York']


In [11]:
df_14.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 369 entries, 0 to 368
Data columns (total 13 columns):
city                                    369 non-null object
population                              369 non-null object
violent_crime                           369 non-null object
murder_and_nonnegligent_manslaughter    369 non-null object
rape_revised_definition1                227 non-null object
rape_legacy_definition2                 142 non-null object
robbery                                 369 non-null object
aggravated_assault                      369 non-null object
property_crime                          368 non-null object
burglary                                369 non-null object
larceny-_theft                          368 non-null object
motor_vehicle_theft                     369 non-null object
arson3                                  365 non-null object
dtypes: object(13)
memory usage: 40.4+ KB


In [12]:
df_nypd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 347 entries, 0 to 347
Data columns (total 13 columns):
city                                    347 non-null object
population                              347 non-null object
violent_crime                           347 non-null object
murder_and_nonnegligent_manslaughter    347 non-null object
rape_revised_definition1                0 non-null object
rape_legacy_definition2                 347 non-null object
robbery                                 347 non-null object
aggravated_assault                      347 non-null object
property_crime                          347 non-null object
burglary                                347 non-null object
larceny-_theft                          347 non-null object
motor_vehicle_theft                     347 non-null object
arson3                                  187 non-null object
dtypes: object(13)
memory usage: 38.0+ KB


## Select Features

In [13]:
features = pd.DataFrame(df_nypd['population'])

In [14]:
#features['population_sqd'] = df_nypd['population']*df_nypd['population']

In [15]:
features['violent_crime'] = df_nypd['violent_crime'].dropna()
features['theft']= df_nypd['larceny-_theft'].dropna()
features['robbery']= df_nypd['robbery'].dropna()
features['assault']=df_nypd['aggravated_assault'].dropna()

In [16]:
features.head(20)

Unnamed: 0,population,violent_crime,theft,robbery,assault
0,1861,0,10,0,0
1,2577,3,20,0,3
2,2846,3,15,0,3
3,97956,791,3243,227,526
4,6388,23,165,4,16
5,4089,5,36,3,2
6,1781,3,10,0,3
7,118296,107,1882,31,68
8,9519,9,188,4,3
9,18182,30,291,12,18


In [17]:
features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 347 entries, 0 to 347
Data columns (total 5 columns):
population       347 non-null object
violent_crime    347 non-null object
theft            347 non-null object
robbery          347 non-null object
assault          347 non-null object
dtypes: object(5)
memory usage: 16.3+ KB


In [18]:
target = df_nypd['property_crime']

In [19]:
X = features.values
y = target.values

In [20]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

In [21]:
# Create a Linear regression and fit it to the training set
regr = LinearRegression()

regr.fit(X_train,y_train)

# Predict on the test data: y_pred
y_pred = regr.predict(X_test)

# Compute and print R^2 
print("R^2: {}".format(regr.score(X_test, y_test)))
print(regr.score(X_train, y_train))

R^2: 0.994386616143785
0.9988698267326669


In [22]:
cv_results= cross_val_score(regr, X, y, cv=3)
print(cv_results)
print(np.mean(cv_results))

[0.99880899 0.99171982 0.9972956 ]
0.9959414717008689


In [23]:
data = pd.DataFrame(features)

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 347 entries, 0 to 347
Data columns (total 5 columns):
population       347 non-null object
violent_crime    347 non-null object
theft            347 non-null object
robbery          347 non-null object
assault          347 non-null object
dtypes: object(5)
memory usage: 16.3+ KB


In [25]:
# Instantiate and fit our model.
reg = LinearRegression()
Yy = y.reshape(-1, 1)
Xx = data[['population','robbery','violent_crime']]
reg.fit(Xx, Yy)

# Inspect the results.
print('\nCoefficients: \n', reg.coef_)
print('\nIntercept: \n', reg.intercept_)
print('\nR-squared:')
print(reg.score(Xx, Yy))


Coefficients: 
 [[ 0.01242641 -2.9106039   4.10925889]]

Intercept: 
 [29.02435237]

R-squared:
0.9272347381948852


In [26]:
cv_results= cross_val_score(reg, Xx, Yy, cv=3)
print("Cross Validation Score",cv_results)
print(np.mean(cv_results))

Cross Validation Score [0.96095093 0.73770297 0.86115825]
0.8532707154782332


## Test on a new dataset

In [27]:
pd.options.display.max_rows = 3000

In [28]:
features2 = pd.DataFrame(df_14['population'])

In [29]:
#change murder and robbery to categorical features2
features2['violent_crime'] = df_nypd['violent_crime'].dropna()
features2['theft']= df_nypd['larceny-_theft'].dropna()
features2['robbery']= df_nypd['robbery'].dropna()
features2['assault']=df_nypd['aggravated_assault'].dropna()

In [30]:
#features2 = features2[:-20]

In [31]:
features2 = features2.dropna(axis=0, how='any')

In [32]:
features2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 347 entries, 0 to 347
Data columns (total 5 columns):
population       347 non-null object
violent_crime    347 non-null object
theft            347 non-null object
robbery          347 non-null object
assault          347 non-null object
dtypes: object(5)
memory usage: 16.3+ KB


In [33]:
#drop property crime lower than 3 to get same number of targets and features
target2 = df_14.loc[df_14['property_crime'] > 3, 'property_crime']

In [34]:
target2= target2.dropna()

In [35]:
target2.describe()

count     347
unique    225
top         9
freq        9
Name: property_crime, dtype: int64

In [36]:
X2 = features2.values
y2 = target2.values

In [37]:
# Create training and test sets
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = 0.3, random_state=42)

In [38]:
# Create a Linear regression and fit it to the training set
regr2 = LinearRegression()

regr2.fit(X2_train,y2_train)

# Predict on the test data: y_pred
y2_pred = regr2.predict(X2_test)

# Compute and print R^2 
print("R^2: {}".format(regr2.score(X2_test, y2_test)))
print(regr2.score(X2_train, y2_train))

R^2: -1.8022997535196148
0.00405185940750219


In [39]:
cv_results2= cross_val_score(regr2, X2, y2, cv=3)
print(cv_results2)
print(np.mean(cv_results2))

[-0.55444713 -0.0113326  -0.30386105]
-0.2898802614718677


In [40]:
data2 = pd.DataFrame(features2)

In [41]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 347 entries, 0 to 347
Data columns (total 5 columns):
population       347 non-null object
violent_crime    347 non-null object
theft            347 non-null object
robbery          347 non-null object
assault          347 non-null object
dtypes: object(5)
memory usage: 16.3+ KB


In [42]:
# Instantiate and fit our model.
reg2 = LinearRegression()
Yy2 = y2.reshape(-1, 1)
Xx2 = data2[['population','robbery','violent_crime']]
reg2.fit(Xx2, Yy2)

# Inspect the results.
print('\nCoefficients: \n', reg2.coef_)
print('\nIntercept: \n', reg2.intercept_)
print('\nR-squared:')
print(reg2.score(Xx2, Yy2))


Coefficients: 
 [[-4.83243646e-05 -2.38852727e+01  9.78650691e+00]]

Intercept: 
 [667.98771714]

R-squared:
0.0011754072490490763


In [43]:
cv_results2= cross_val_score(reg2, Xx2, Yy2, cv=3)
print("Cross Validation Score",cv_results2)
print(np.mean(cv_results2))

Cross Validation Score [-0.14163744 -0.01361468 -0.43842376]
-0.19789195837334125


This was a bit tricky, and after some review, the models doesn't seem to work or have a consistent R-squared.