<h2>Challenge: make your own regression model</h2>

Now that you've spent some time playing with a sample multivariate linear regression model, it's time to make your own.

You've already gotten started by prepping the FBI:UCR Crime dataset (Thinkful mirror) in a previous assignment.

Using this data, build a regression model to predict property crimes. You can use the features you prepared in the previous assignment, new features of your own choosing, or a combination. The goal here is prediction rather than understanding mechanisms, so the focus is on creating a model that explains a lot of variance.

Submit a notebook with your model and a brief writeup of your feature engineering and selection process to submit and review with your mentor.

In [48]:
#importing modules and potential modules
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import random
import nltk
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

In [49]:
#importing the crime dataset
data = pd.read_csv('https://raw.githubusercontent.com/GenTaylor/Unit2-SupervisedLearning/master/crimedataframemlr.csv')

In [50]:
data.head()

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3
0,Adams Village,1861,0,0,,0,0,0,12,2,10,0,0.0
1,Addison Town and Village,2577,3,0,,0,0,3,24,3,20,1,0.0
2,Akron Village,2846,3,0,,0,0,3,16,1,15,0,0.0
3,Albany,97956,791,8,,30,227,526,4090,705,3243,142,
4,Albion Village,6388,23,0,,3,4,16,223,53,165,5,


<b>Data Cleaning</b>

In [51]:
#Find null values
print(data.isnull().sum())


#revised rape and arson 3 are the only ones with null values, and since I'm not using them I'm going to ignore for now. 

City                                        0
Population                                  0
Violent crime                               0
Murder and\nnonnegligent\nmanslaughter      0
Rape\n(revised\ndefinition)1              348
Rape\n(legacy\ndefinition)2                 0
Robbery                                     0
Aggravated\nassault                         0
Property\ncrime                             0
Burglary                                    0
Larceny-\ntheft                             0
Motor\nvehicle\ntheft                       0
Arson3                                    161
dtype: int64


In [52]:
#changing column names
data_cols= ['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2', 
            'Robbery', 'Aggravated_Assault', 'Property', 'Burglary', 
            'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3']
data.columns = data_cols
data.columns

Index(['City', 'Population', 'Violent_Crime', 'Murder', 'Rape1', 'Rape2',
       'Robbery', 'Aggravated_Assault', 'Property', 'Burglary',
       'Larceny_Theft', 'Motor_Vehicle_Theft', 'Arson3'],
      dtype='object')

In [53]:
#checking data types
data.dtypes

City                    object
Population               int64
Violent_Crime            int64
Murder                   int64
Rape1                  float64
Rape2                    int64
Robbery                  int64
Aggravated_Assault       int64
Property                 int64
Burglary                 int64
Larceny_Theft            int64
Motor_Vehicle_Theft      int64
Arson3                 float64
dtype: object

<b>The 'population' variable is already set for you, but you will need to create the last three features. Robbery and Murder are currently continuous variables. For this model, please use these variables to create categorical features where values greater than 0 are coded 1, and values equal to 0 are coded 0. </b>

In [54]:
#Create a population^2 column

data['Population2'] = data['Population']**2
data.head()

Unnamed: 0,City,Population,Violent_Crime,Murder,Rape1,Rape2,Robbery,Aggravated_Assault,Property,Burglary,Larceny_Theft,Motor_Vehicle_Theft,Arson3,Population2
0,Adams Village,1861,0,0,,0,0,0,12,2,10,0,0.0,3463321
1,Addison Town and Village,2577,3,0,,0,0,3,24,3,20,1,0.0,6640929
2,Akron Village,2846,3,0,,0,0,3,16,1,15,0,0.0,8099716
3,Albany,97956,791,8,,30,227,526,4090,705,3243,142,,9595377936
4,Albion Village,6388,23,0,,3,4,16,223,53,165,5,,40806544


In [55]:
#Robbery and Murder are currently continuous variables. 
#For this model, please use these variables to create categorical 
#features where values greater than 0 are coded 1, 
#and values equal to 0 are coded 0.

#categorical feature
onecat = 1


In [56]:
#Robbery
data['RobberyCat'] = np.where(data['Robbery'] < onecat, 0, 1)

#Murder
data['MurderCat'] = np.where(data['Murder'] < onecat, 0, 1)

In [57]:
data.head()

Unnamed: 0,City,Population,Violent_Crime,Murder,Rape1,Rape2,Robbery,Aggravated_Assault,Property,Burglary,Larceny_Theft,Motor_Vehicle_Theft,Arson3,Population2,RobberyCat,MurderCat
0,Adams Village,1861,0,0,,0,0,0,12,2,10,0,0.0,3463321,0,0
1,Addison Town and Village,2577,3,0,,0,0,3,24,3,20,1,0.0,6640929,0,0
2,Akron Village,2846,3,0,,0,0,3,16,1,15,0,0.0,8099716,0,0
3,Albany,97956,791,8,,30,227,526,4090,705,3243,142,,9595377936,1,1
4,Albion Village,6388,23,0,,3,4,16,223,53,165,5,,40806544,1,0


In [58]:
#set X and Y
X = data[['Population','Population2','Murder','Robbery']]
Y = data['Property'].values.reshape(-1, 1)

In [59]:
#regression?

regr = linear_model.LinearRegression()
regr.fit(X,Y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [60]:
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X,Y))


Coefficients: 
 [[ 1.59234099e-02 -1.01045767e-09  1.17559526e+02  2.09186042e+00]]

Intercept: 
 [24.1435902]

R-squared:
0.9987417422426106


In [61]:
#Train and Build a Linear Regression Model

#Train/test

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [62]:
#prediction
yprediction = regressor.predict(X_test)

In [63]:
#try it out
print (yprediction[0:5])

[[ 11.83987542]
 [ 10.88218422]
 [176.55202622]
 [568.07875575]
 [ 37.28520096]]
