# Challenge: Validating a linear regression

Your goal is to achieve a model with a consistent R2 and only statistically significant parameters across multiple samples.

We'll use the property crime model you've been working on with, based on the FBI:UCR data. Since your model formulation to date has used the entire New York State 2013 dataset, you'll need to validate it using some of the other crime datasets available at the FBI:UCR website.

Based on the results of your validation test, create a revised model, and then test both old and new models on a new holdout or set of folds.

Include your model(s) and a brief writeup of the reasoning behind the validation method you chose and the changes you made to submit and review with your mentor.

In [1931]:
#importing modules and potential modules
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import random
import nltk
import math
import warnings
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from sklearn.model_selection import cross_val_score
from sklearn import metrics  
from scipy import stats
# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd")

In [1932]:
#importing the NY crime dataset
data = pd.read_csv('https://raw.githubusercontent.com/GenTaylor/Unit2-SupervisedLearning/master/NYCcrimedataframemlr.csv')

In [1933]:
#changing column names
data_cols= ['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2', 
            'Robbery', 'Aggravated_Assault', 'Property', 'Burglary', 
            'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3']
data.columns = data_cols
data.columns

Index(['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2',
       'Robbery', 'Aggravated_Assault', 'Property', 'Burglary',
       'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3'],
      dtype='object')

In [1934]:
data.dtypes

City                    object
Population               int64
Violent_Crime            int64
Murder                   int64
Rape1                  float64
Rape2                    int64
Robbery                  int64
Aggravated_Assault       int64
Property                 int64
Burglary                 int64
Larceny_Theft            int64
Motor_Vehicle_Theft      int64
Arson3                 float64
dtype: object

In [1935]:
#winsorizing data to "fix" anomalous values
data.Population = scipy.stats.mstats.winsorize(data.Population, limits = 0.01)
data.Violent_Crime = scipy.stats.mstats.winsorize(data.Violent_Crime, limits = 0.01)
data.Murder = scipy.stats.mstats.winsorize(data.Murder, limits = 0.01)
data.Rape1 = scipy.stats.mstats.winsorize(data.Rape1, limits = 0.01)
data.Rape2 = scipy.stats.mstats.winsorize(data.Rape2, limits = 0.01)
data.Robbery = scipy.stats.mstats.winsorize(data.Robbery, limits = 0.01)
data.Aggravated_Assault = scipy.stats.mstats.winsorize(data.Aggravated_Assault, limits = 0.01)
data.Property = scipy.stats.mstats.winsorize(data.Property, limits = 0.01)
data.Burglary = scipy.stats.mstats.winsorize(data.Burglary, limits = 0.01)
data.Larceny_Theft = scipy.stats.mstats.winsorize(data.Larceny_Theft, limits = 0.01)
data.Motor_Vehicle_Theft = scipy.stats.mstats.winsorize(data.Motor_Vehicle_Theft, limits = 0.01)
data.Arson3 = scipy.stats.mstats.winsorize(data.Arson3, limits = 0.01)

In [1936]:
data.sort_values('Population', inplace=True, ascending=False)
print(data.head(n=3))

        City  Population  Violent_Crime  Murder  Rape1  Rape2  Robbery  \
0   New York      199134           1192      21    NaN     75      400   
1    Buffalo      199134           1192      21    NaN     75      400   
2  Rochester      199134           1192      21    NaN     75      400   

   Aggravated_Assault  Property  Burglary  Larceny_Theft  Motor_Vehicle_Theft  \
0                 696      6473      1781           4298                  394   
1                 696      6473      1781           4298                  394   
2                 696      6473      1781           4298                  394   

   Arson3  
0     NaN  
1     NaN  
2   132.0  


In [1937]:
#Robbery and Murder are currently continuous variables. 
#For this model, please use these variables to create categorical 
#features where values greater than 0 are coded 1, 
#and values equal to 0 are coded 0.

#categorical feature
onecat = 1

In [1938]:
#Robbery
data['RobberyCat'] = np.where(data['Robbery'] < onecat, 0, 1)

#Murder
data['MurderCat'] = np.where(data['Murder'] < onecat, 0, 1)

#reindex the columns to add these two
data = data.reindex(columns=['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2', 
            'Robbery', 'Aggravated_Assault', 'Property', 'Burglary', 
            'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3', 'MurderCat','RobberyCat'])

In [1939]:

#set X and Y
X = data[['Population','MurderCat','RobberyCat']]
Y = data['Property']

In [1940]:
#train test split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=0)

#shape
X_train.shape, y_train.shape

X_test.shape, y_test.shape


((174, 3), (174,))

In [1941]:
#regression?

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr = linear_model.LinearRegression()

rfit = regr.fit(X_train, y_train)
rfit

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [1942]:
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))


Coefficients: 
 [ 2.68587240e-02  1.48941254e+02 -1.32300259e+01]

Intercept: 
 -67.43954052377154

R-squared:
0.818951663511899


In [1943]:
#test it
accuracy = regr.score(X_test, y_test)
accuracy

0.8163757550733051

In [1944]:
#y_pred

y_pred=rfit.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.82


In [1945]:
#explained variance score
from sklearn.metrics import explained_variance_score
explained_variance_score(y_test, y_pred, multioutput='uniform_average')

0.8167730316225764

In [1946]:
#Mean Absolute Error

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred,  sample_weight=None, multioutput='uniform_average')

166.84398056312813

In [1947]:
#Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

94735.42775091765

In [1948]:
#Median Absolute Error
from sklearn.metrics import median_absolute_error
median_absolute_error(y_test, y_pred)

85.44953244407714

In [1949]:
#R2 Score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.816375755073305

In [1950]:
#print commands for all scores??? ***Later

In [1951]:
#importing the GA dataset to compare
#note: same data cleaning will be done, followed by the reg, etc. 

data2 = pd.read_csv('https://raw.githubusercontent.com/GenTaylor/Unit2-SupervisedLearning/master/GACcrimedataframemlr.csv')

In [1952]:
#changing column names
data2_cols= ['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2', 
            'Robbery', 'Aggravated_Assault', 'Property', 'Burglary', 
            'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3']
data2.columns = data2_cols
data2.columns

Index(['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2',
       'Robbery', 'Aggravated_Assault', 'Property', 'Burglary',
       'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3'],
      dtype='object')

In [1953]:
#winsorize
data2.Population = scipy.stats.mstats.winsorize(data2.Population, limits = 0.01)
data2.Violent_Crime = scipy.stats.mstats.winsorize(data2.Violent_Crime, limits = 0.01)
data2.Murder = scipy.stats.mstats.winsorize(data2.Murder, limits = 0.01)
data2.Rape1 = scipy.stats.mstats.winsorize(data2.Rape1, limits = 0.01)
data2.Rape2 = scipy.stats.mstats.winsorize(data2.Rape2, limits = 0.01)
data2.Robbery = scipy.stats.mstats.winsorize(data2.Robbery, limits = 0.01)
data2.Aggravated_Assault = scipy.stats.mstats.winsorize(data2.Aggravated_Assault, limits = 0.01)
data2.Property = scipy.stats.mstats.winsorize(data2.Property, limits = 0.01)
data2.Burglary = scipy.stats.mstats.winsorize(data2.Burglary, limits = 0.01)
data2.Larceny_Theft = scipy.stats.mstats.winsorize(data2.Larceny_Theft, limits = 0.01)
data2.Motor_Vehicle_Theft = scipy.stats.mstats.winsorize(data2.Motor_Vehicle_Theft, limits = 0.01)
data2.Arson3 = scipy.stats.mstats.winsorize(data2.Arson3, limits = 0.01)


In [1954]:
#Robbery and Murder are currently continuous variables. 
#For this model, please use these variables to create categorical 
#features where values greater than 0 are coded 1, 
#and values equal to 0 are coded 0.

#categorical feature
onecat = 1

In [1955]:
#Robbery
data2['RobberyCat'] = np.where(data2['Robbery'] < onecat, 0, 1)

#Murder
data2['MurderCat'] = np.where(data2['Murder'] < onecat, 0, 1)

#reindex the columns to add these two
data2 = data2.reindex(columns=['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2', 
            'Robbery', 'Aggravated_Assault', 'Property', 'Burglary', 
            'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3', 'MurderCat','RobberyCat'])

In [1956]:

#set X2 and Y2
X2 = data2[['Population','MurderCat','RobberyCat']]
Y2 = data2['Property']

In [1957]:
#train test split

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, Y2, test_size=0.5, random_state=0)

#shape
X2_train.shape, y2_train.shape

X2_test.shape, y2_test.shape


((127, 3), (127,))

In [1958]:
#regression?

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr2 = linear_model.LinearRegression()

rfit2 = regr2.fit(X2_train, y2_train)
rfit2

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [1959]:
# Inspect the results.
print('\nCoefficients: \n', regr2.coef_)
print('\nIntercept: \n', regr2.intercept_)
print('\nR-squared:')
print(regr2.score(X2,Y2))


Coefficients: 
 [3.42391659e-02 2.81695445e+02 6.51055124e+01]

Intercept: 
 -12.103836620855361

R-squared:
0.8588997166405096


In [1960]:
#test it
accuracy2 = regr2.score(X2_test, y2_test)
accuracy2

0.8859235522905617

In [1961]:
#y2_pred

y2_pred=rfit2.predict(X2_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit2.score(X2_test, y2_test)))

Accuracy of logistic regression classifier on test set: 0.89


In [1962]:
#explained variance score
from sklearn.metrics import explained_variance_score
explained_variance_score(y2_test, y2_pred, multioutput='uniform_average')

0.8859521039895434

In [1963]:
#Mean Absolute Error

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y2_test, y2_pred,  sample_weight=None, multioutput='uniform_average')

205.87686430232523

In [1964]:
#Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y2_test, y2_pred)

216973.2703351046

In [1965]:
#Median Absolute Error
from sklearn.metrics import median_absolute_error
median_absolute_error(y2_test, y2_pred)

50.12889640442333

In [1966]:
#R2 Score
from sklearn.metrics import r2_score
r2_score(y2_test, y2_pred)


0.8859235522905617

## Summary

### New York Dataset

In [1967]:
print("The Explained Variance Score of the NY dataset is:", round(explained_variance_score(y_test, y_pred, multioutput='uniform_average'), 2))
print("The Mean Absolute Error of the NY dataset is:", round(mean_absolute_error(y_test, y_pred,  sample_weight=None, multioutput='uniform_average'), 2))
print("The Mean Squared Error of the NY dataset is:", round(mean_squared_error(y_test, y_pred), 2))
print("The Median Absolute Score of the NY dataset is:", round(median_absolute_error(y_test, y_pred), 2))
print("The R2 Score of the NY dataset is:", round(r2_score(y_test, y_pred), 2))


The Explained Variance Score of the NY dataset is: 0.82
The Mean Absolute Error of the NY dataset is: 166.84
The Mean Squared Error of the NY dataset is: 94735.43
The Median Absolute Score of the NY dataset is: 85.45
The R2 Score of the NY dataset is: 0.82


### Georgia Dataset

In [1968]:
print("The Explained Variance Score of the GA dataset is:", round(explained_variance_score(y2_test, y2_pred, multioutput='uniform_average'), 2 ))
print("The Mean Absolute Error of the GA dataset is:", round(mean_absolute_error(y2_test, y2_pred,  sample_weight=None, multioutput='uniform_average'), 2 ))
print("The Mean Squared Error of the GA dataset is:", round(mean_squared_error(y2_test, y2_pred), 2 ))
print("The Median Absolute Score of the GA dataset is:", round(median_absolute_error(y2_test, y2_pred), 2))
print("The R2 Score of the GA dataset is:", round(r2_score(y2_test, y2_pred), 2))


The Explained Variance Score of the GA dataset is: 0.89
The Mean Absolute Error of the GA dataset is: 205.88
The Mean Squared Error of the GA dataset is: 216973.27
The Median Absolute Score of the GA dataset is: 50.13
The R2 Score of the GA dataset is: 0.89


## A look at testing vs prediction

In [1969]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
df.head(10)

Unnamed: 0,Actual,Predicted
6,4090,2699.244859
52,1040,835.652292
269,29,4.085242
45,1140,742.765195
294,73,-10.60648
189,23,107.947927
191,47,104.187706
116,158,245.556496
90,77,412.364706
249,78,25.303634


In [1970]:
df2 = pd.DataFrame({'Actual': y2_test, 'Predicted': y2_pred})  
df2.head(10)

Unnamed: 0,Actual,Predicted
158,24,23.3337
83,450,234.298059
170,1082,1514.339103
101,695,378.410708
150,6800,3456.52155
199,708,191.088232
118,38,84.142459
227,548,621.587091
63,4,5.871725
135,503,1124.310938


#### I do not think the results are as accurate as they can possibly be so I would like to tweak the model to see if I can make it more accurate.

<b>The first thing that I would like to do is remove the categorical features from the model. </b>

<b>NY Dataset:</b>

In [1971]:
#Calling every3thing in this model _3
#set X3 and y3
X3 = data[['Population','Murder','Robbery']]
y3 = data['Property']

#train test split

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.5, random_state=0)

#shape
X3_train.shape, y3_train.shape

X3_test.shape, y3_test.shape

((174, 3), (174,))

In [1972]:
#regression?

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr3 = linear_model.LinearRegression()

rfit3 = regr3.fit(X3_train, y3_train)
rfit3

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [1973]:
# Inspect the results.
print('\nCoefficients: \n', regr3.coef_)
print('\nIntercept: \n', regr3.intercept_)
print('\nR-squared:')
print(regr3.score(X3,y3))


Coefficients: 
 [ 1.58918239e-02  2.01231521e+02 -1.82430872e+00]

Intercept: 
 44.47387385248015

R-squared:
0.8832975478217796


In [1974]:
#test it
accuracy3 = regr3.score(X3_test, y3_test)
accuracy3

0.8144104514642367

In [1975]:
#y3_pred

y3_pred=rfit3.predict(X3_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit3.score(X3_test, y3_test)))

Accuracy of logistic regression classifier on test set: 0.81


## Scoring

In [1976]:
print("The Explained Variance Score of the NY dataset is:", round(explained_variance_score(y3_test, y3_pred, multioutput='uniform_average'), 2 ))
print("The Mean Absolute Error of the NY dataset is:", round(mean_absolute_error(y3_test, y3_pred,  sample_weight=None, multioutput='uniform_average'), 2 ))
print("The Mean Squared Error of the NY dataset is:", round(mean_squared_error(y3_test, y3_pred), 2 ))
print("The Median Absolute Score of the NY dataset is:", round(median_absolute_error(y3_test, y3_pred), 2))
print("The R2 Score of the NY dataset is:", round(r2_score(y3_test, y3_pred), 2))

The Explained Variance Score of the NY dataset is: 0.81
The Mean Absolute Error of the NY dataset is: 164.02
The Mean Squared Error of the NY dataset is: 95749.37
The Median Absolute Score of the NY dataset is: 74.27
The R2 Score of the NY dataset is: 0.81


In [1977]:
df3 = pd.DataFrame({'Actual': y3_test, 'Predicted': y3_pred})  
df3.head(10)

Unnamed: 0,Actual,Predicted
6,4090,2796.907464
52,1040,1231.030472
269,29,86.793801
45,1140,460.53737
294,73,78.100973
189,23,148.247484
191,47,146.022628
116,158,235.671658
90,77,328.365416
249,78,99.348342


<b>There was barely any change so instead of doing the same to the Georgia dataset, I've decided instead to play around vith columns in X.</b>

In [1978]:
#lets add a column to the NY dataset LR
#adding the Violent_Crime column

#Calling every4thing in this model _4
#set X4 and y4
X4 = data[['Population','Violent_Crime','Murder','Robbery']]
y4 = data['Property']

#train test split

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.5, random_state=0)

#shape
X4_train.shape, y4_train.shape

X4_test.shape, y4_test.shape


((174, 4), (174,))

In [1979]:
#regression?

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr4 = linear_model.LinearRegression()

rfit4 = regr4.fit(X4_train, y4_train)
rfit4

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [1980]:
# Inspect the results.
print('\nCoefficients: \n', regr4.coef_)
print('\nIntercept: \n', regr4.intercept_)
print('\nR-squared:')
print(regr4.score(X4,y4))


Coefficients: 
 [ 1.51900217e-02  4.92003645e+00  1.55253901e+02 -1.37236286e+01]

Intercept: 
 36.35806744133794

R-squared:
0.9095012291598568


In [1981]:
#test it
accuracy4 = regr4.score(X4_test, y4_test)
accuracy4


Flushing oldest 200 entries.
  'Flushing oldest {cull_count} entries.'.format(sz=sz, cull_count=cull_count))


0.8801088173390196

In [1982]:
#y4_pred

y4_pred=rfit4.predict(X4_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit4.score(X4_test, y4_test)))

Accuracy of logistic regression classifier on test set: 0.88


## Scoring

In [1983]:
print("The Explained Variance Score of the NY dataset is:", round(explained_variance_score(y4_test, y4_pred, multioutput='uniform_average'), 2 ))
print("The Mean Absolute Error of the NY dataset is:", round(mean_absolute_error(y4_test, y4_pred,  sample_weight=None, multioutput='uniform_average'), 2 ))
print("The Mean Squared Error of the NY dataset is:", round(mean_squared_error(y4_test, y4_pred), 2 ))
print("The Median Absolute Score of the NY dataset is:", round(median_absolute_error(y4_test, y4_pred), 2))
print("The R2 Score of the NY dataset is:", round(r2_score(y4_test, y4_pred), 2))


The Explained Variance Score of the NY dataset is: 0.88
The Mean Absolute Error of the NY dataset is: 144.41
The Mean Squared Error of the NY dataset is: 61854.26
The Median Absolute Score of the NY dataset is: 72.55
The R2 Score of the NY dataset is: 0.88


In [1984]:
df4 = pd.DataFrame({'Actual': y4_test, 'Predicted': y4_pred})  
df4.head(10)

Unnamed: 0,Actual,Predicted
6,4090,3542.828184
52,1040,1328.293252
269,29,101.409277
45,1140,793.398361
294,73,73.42019
189,23,135.548909
191,47,133.422306
116,158,212.052479
90,77,317.552688
249,78,118.329431


<b>Scores are showing a significant increase with adding this column so I would like to test it out on the Georgia dataset.</b>

In [1985]:
#Calling every5thing in this model _5
#set X5 and y5
X5 = data2[['Population','Violent_Crime','Murder','Robbery']]
y5 = data2['Property']

#train test split

X5_train, X5_test, y5_train, y5_test = train_test_split(X5, y5, test_size=0.5, random_state=0)

#shape
X5_train.shape, y5_train.shape

X5_test.shape, y5_test.shape


((127, 4), (127,))

In [1986]:


#regression?

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr5 = linear_model.LinearRegression()

rfit5 = regr5.fit(X5_train, y5_train)
rfit5

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [1987]:
# Inspect the results.
print('\nCoefficients: \n', regr5.coef_)
print('\nIntercept: \n', regr5.intercept_)
print('\nR-squared:')
print(regr5.score(X5,y5))


Coefficients: 
 [ 1.32328973e-02  4.88621578e+00 -1.39957489e+01  4.79753949e+00]

Intercept: 
 53.24927733119881

R-squared:
0.9631291243526451


In [1988]:
#test it
accuracy5 = regr5.score(X5_test, y5_test)
accuracy5


0.9656660515072408

In [1989]:
#y5_pred

y5_pred=rfit5.predict(X5_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit5.score(X5_test, y5_test)))

Accuracy of logistic regression classifier on test set: 0.97


Scoring

In [1990]:
print("The Explained Variance Score of the GA dataset is:", round(explained_variance_score(y5_test, y5_pred, multioutput='uniform_average'), 2 ))
print("The Mean Absolute Error of the GA dataset is:", round(mean_absolute_error(y5_test, y5_pred,  sample_weight=None, multioutput='uniform_average'), 2 ))
print("The Mean Squared Error of the GA dataset is:", round(mean_squared_error(y5_test, y5_pred), 2 ))
print("The Median Absolute Score of the GA dataset is:", round(median_absolute_error(y5_test, y5_pred), 2))
print("The R2 Score of the GA dataset is:", round(r2_score(y5_test, y5_pred), 2))

The Explained Variance Score of the GA dataset is: 0.97
The Mean Absolute Error of the GA dataset is: 126.73
The Mean Squared Error of the GA dataset is: 65303.13
The Median Absolute Score of the GA dataset is: 62.03
The R2 Score of the GA dataset is: 0.97


In [1991]:
df5 = pd.DataFrame({'Actual': y5_test, 'Predicted': y5_pred})  
df5.head(15)

Unnamed: 0,Actual,Predicted
158,24,96.262621
83,450,440.123408
170,1082,1201.184893
101,695,325.423855
150,6800,4964.7571
199,708,535.540425
118,38,105.105599
227,548,457.828381
63,4,60.196548
135,503,667.007515


<b>Accuracy and explained variance scores increase significantly for the GA dataset as well. 

I saw that Violent_Crime had a high Coefficient score in both datasets so I decided to keep it. Next I decided to add 'Aggravated_Assault', and bring the total to 5 variates. </b>

In [2004]:
#adding 'Aggravated_Assault' for 5 total
#Calling everything in this model _6
#set X6 and y6
X6 = data[['Population','Violent_Crime','Murder','Robbery','Aggravated_Assault']]
y6 = data['Property']

#train test split

X6_train, X6_test, y6_train, y6_test = train_test_split(X6, y6, test_size=0.5, random_state=0)

#shape
X6_train.shape, y6_train.shape

X6_test.shape, y6_test.shape

((174, 5), (174,))

In [2005]:

# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
regr6 = linear_model.LinearRegression()

rfit6 = regr6.fit(X6_train, y6_train)
rfit6


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [None]:
# Inspect the results.
print('\nCoefficients: \n', regr6.coef_)
print('\nIntercept: \n', regr6.intercept_)
print('\nR-squared:')
print(regr6.score(X6,y6))

In [2007]:
#test it
accuracy6 = regr6.score(X6_test, y6_test)
accuracy6

0.7850669913743296

In [2008]:
#y6_pred

y6_pred=rfit6.predict(X6_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit6.score(X6_test, y6_test)))

Accuracy of logistic regression classifier on test set: 0.79


In [2009]:


print("The Explained Variance Score of the NY dataset is:", round(explained_variance_score(y6_test, y6_pred, multioutput='uniform_average'), 2 ))
print("The Mean Absolute Error of the NY dataset is:", round(mean_absolute_error(y6_test, y6_pred,  sample_weight=None, multioutput='uniform_average'), 2 ))
print("The Mean Squared Error of the NY dataset is:", round(mean_squared_error(y6_test, y6_pred), 2 ))
print("The Median Absolute Score of the NY dataset is:", round(median_absolute_error(y6_test, y6_pred), 2))
print("The R2 Score of the NY dataset is:", round(r2_score(y6_test, y6_pred), 2))

The Explained Variance Score of the NY dataset is: 0.79
The Mean Absolute Error of the NY dataset is: 156.2
The Mean Squared Error of the NY dataset is: 110888.25
The Median Absolute Score of the NY dataset is: 69.44
The R2 Score of the NY dataset is: 0.79


In [2010]:
df6 = pd.DataFrame({'Actual': y6_test, 'Predicted': y6_pred})  
df6.head(10)

Unnamed: 0,Actual,Predicted
6,4090,1893.617946
52,1040,1416.863365
269,29,38.556714
45,1140,1655.68867
294,73,54.582269
189,23,117.146167
191,47,115.344694
116,158,194.64662
90,77,251.456158
249,78,135.992961


Out of curiousity I wanted to check the data with all values that didn't contain a null value.

In [2012]:
#null count
data.isnull().sum().sum

<bound method Series.sum of City                     0
Population               0
Violent_Crime            0
Murder                   0
Rape1                  348
Rape2                    0
Robbery                  0
Aggravated_Assault       0
Property                 0
Burglary                 0
Larceny_Theft            0
Motor_Vehicle_Theft      0
Arson3                 161
MurderCat                0
RobberyCat               0
dtype: int64>

In [2013]:
#will do a set with all but Rape1 and Arson3 (ignoring murdercat and robberycat)
X7 = data[['Population','Violent_Crime','Murder','Rape2','Robbery','Aggravated_Assault','Burglary','Larceny_Theft','Motor_Vehicle_Theft']]
y7 = data['Property']

#train test split

X7_train, X7_test, y7_train, y7_test = train_test_split(X7, y7, test_size=0.5, random_state=0)

#shape
X7_train.shape, y7_train.shape

X7_test.shape, y7_test.shape

((174, 9), (174,))

In [2014]:
#linear regression
regr7 = linear_model.LinearRegression()

rfit7 = regr7.fit(X7_train, y7_train)
rfit7

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [2015]:
# Inspect the results.
print('\nCoefficients: \n', regr7.coef_)
print('\nIntercept: \n', regr7.intercept_)
print('\nR-squared:')
print(regr7.score(X7,y7))


Coefficients: 
 [-4.53431792e-18  1.88737914e-14  1.22840974e-13 -7.42340217e-14
 -7.28583860e-15 -2.25375274e-14  1.00000000e+00  1.00000000e+00
  1.00000000e+00]

Intercept: 
 2.8421709430404007e-13

R-squared:
1.0


In [2016]:
#test it
accuracy7 = regr7.score(X7_test, y7_test)
accuracy7


1.0

In [2017]:
#y7_pred

y7_pred=rfit7.predict(X7_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(rfit7.score(X7_test, y7_test)))

Accuracy of logistic regression classifier on test set: 1.00


In [2018]:

print("The Explained Variance Score of the NY dataset is:", round(explained_variance_score(y7_test, y7_pred, multioutput='uniform_average'), 2 ))
print("The Mean Absolute Error of the NY dataset is:", round(mean_absolute_error(y7_test, y7_pred,  sample_weight=None, multioutput='uniform_average'), 2 ))
print("The Mean Squared Error of the NY dataset is:", round(mean_squared_error(y7_test, y7_pred), 2 ))
print("The Median Absolute Score of the NY dataset is:", round(median_absolute_error(y7_test, y7_pred), 2))
print("The R2 Score of the NY dataset is:", round(r2_score(y7_test, y7_pred), 2))

The Explained Variance Score of the NY dataset is: 1.0
The Mean Absolute Error of the NY dataset is: 0.0
The Mean Squared Error of the NY dataset is: 0.0
The Median Absolute Score of the NY dataset is: 0.0
The R2 Score of the NY dataset is: 1.0


In [2019]:
df7 = pd.DataFrame({'Actual': y7_test, 'Predicted': y7_pred})  
df7.head(10)

Unnamed: 0,Actual,Predicted
6,4090,4090.0
52,1040,1040.0
269,29,29.0
45,1140,1140.0
294,73,73.0
189,23,23.0
191,47,47.0
116,158,158.0
90,77,77.0
249,78,78.0
