## Colorado Crime Data: Logistic, Lasso and Ridge Regressions:

> This dataset is from the same FBI site where the New York Crime data was taken from and represents the same data from Colorado.  Below, we investigate how Logistic, Lasso and Ridge Regressions work with the data.

In [58]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

In [37]:
col_list = ['Population', 'Violent_Crime', 'Murder', 'Rape', 'Rape_2', 'Robbery', 
            'Assault', 'Property_Crime', 'Burglary', 'Larceny', 'MV_Theft', 'Arson']

data = pd.read_excel('Colorado_2013_Crime.xls', names = col_list, header = 3, 
                     index_col = 0, skiprows = [0], skipfooter = 2).drop('Rape_2', 1)

In [38]:
# A bit of feature engineering:

# Making a Binary Violent Crime Column:
data.loc[data['Violent_Crime'] == 0, 'Binary_Violent_Crime'] = 0
data.loc[data['Violent_Crime'] > 0, 'Binary_Violent_Crime'] = 1

#Same for Murder and Larceny:
data.loc[data['Larceny'] == 0, 'Binary_Larceny'] = 0
data.loc[data['Larceny'] > 0, 'Binary_Larceny'] = 1

data.loc[data['Murder'] == 0, 'Binary_Murder'] = 0
data.loc[data['Murder'] > 0, 'Binary_Murder'] = 1

In [39]:
# Doing some per-capita and per-Larceny columns in a new dataframe:

data_modified = pd.DataFrame(data)

data_modified['Larceny/Capita'] = data_modified.Larceny / data_modified.Population
data_modified['Assault/Capita'] = data_modified.Assault / data_modified.Population
data_modified['Violent_Crime/Capita'] = data_modified.Violent_Crime / data_modified.Population
data_modified['Murder/Capita'] = data_modified.Murder / data_modified.Population

data_modified['Violent_Crime/Larceny'] = data_modified.Violent_Crime / data_modified.Larceny

In [45]:
# Removing NaN and Inf values:
data.loc[data['Violent_Crime/Larceny'] == np.inf] = np.nan
data = data.dropna(axis = 0, how = 'any')

In [46]:

data_modified.loc[data_modified['Violent_Crime/Larceny'] == np.inf] = np.nan
data_modified = data_modified.dropna(axis = 0, how = 'any')

# Double checking for cleanliness:
# data_modified.where(cond = data_modified.values == np.nan).count()

#### Variables and Models:

> After doing a bit of feature engineering, now we can start by defining our input data and our target variables - as well as defining and fitting our first models:

In [68]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split, cross_val_score

In [95]:
# Defining our input and target variables (or at least one of them for now) as well as our train/test sets:

x = data.drop('Binary_Violent_Crime', 1)

y = data.Binary_Violent_Crime

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = .2)

###### Logistic Regression:

In [96]:
log_regr1 = linear_model.LogisticRegression()

log_regr1.fit(X_train, y_train)
log_regr1.fit(X_test, y_test)

log1_train_score = log_regr1.score(X_train, y_train)
log1_test_score = log_regr1.score(X_test, y_test)

In [None]:
# Having trouble plotting this:

sns.regplot(logistic = True)

In [105]:
# Super over-fitting so far:

print("The training score is: {}".format(log1_train_score))
print("The test score is: {}".format(log1_test_score))
print("The coefficients are:\n{}".format(log_regr1.coef_))

The training score is: 0.9603960396039604
The test score is: 0.9615384615384616
The coefficients are:
[[ 1.96607564e-04  4.81992893e-01  9.54810693e-03  9.25295584e-02
   8.02894512e-02  2.99625777e-01  5.67619266e-02 -1.29456571e-01
  -7.08045793e-02  2.57023077e-01  5.26381408e-03 -5.88603566e-02
   9.54404438e-03 -4.35454438e-04  1.28442031e-04  2.53640385e-04
   1.79449678e-06  3.06817878e-02]]


###### Ridge Regression:

In [130]:
# Coefficients above are pretty large - trying to get them a bit smaller:

ridge_regr1 = linear_model.LogisticRegression(penalty = 'l2')

ridge_regr1.fit(X_train, y_train)
ridge_regr1.fit(X_test, y_test)

ridge1_train_score = ridge_regr1.score(X_train, y_train)
ridge1_test_score = ridge_regr1.score(X_test, y_test)

In [131]:
print("The training score is: {}".format(ridge1_train_score))
print("The test score is: {}".format(ridge1_test_score))
print("The coefficients are:\n{}".format(ridge_regr1.coef_))

The training score is: 0.9603960396039604
The test score is: 0.9615384615384616
The coefficients are:
[[ 1.96607565e-04  4.81992894e-01  9.54810696e-03  9.25295587e-02
   8.02894514e-02  2.99625777e-01  5.67619266e-02 -1.29456571e-01
  -7.08045795e-02  2.57023077e-01  5.26381409e-03 -5.88603568e-02
   9.54404442e-03 -4.35454439e-04  1.28442031e-04  2.53640386e-04
   1.79449679e-06  3.06817878e-02]]


###### Lasso Regression:

In [135]:
# Coefficients for the first two models are the same - seeing if Lasso/L1 gets us anything different:

lasso_regr1 = linear_model.LogisticRegression(penalty = 'l1', max_iter = 400)

lasso_regr1.fit(X_train, y_train)
lasso_regr1.fit(X_test, y_test)

lasso1_train_score = lasso_regr1.score(X_train, y_train)
lasso1_test_score = lasso_regr1.score(X_test, y_test)



In [137]:
# This looks much better:

print("The training score is: {}".format(lasso1_train_score))
print("The test score is: {}".format(lasso1_test_score))
print("The coefficients are:\n{}".format(lasso_regr1.coef_))

The training score is: 0.9306930693069307
The test score is: 0.8461538461538461
The coefficients are:
[[ 0.00027429  0.07360446  0.          0.          0.04515929  0.07867417
   0.00129802 -0.01071891 -0.00483718  0.02492427  0.          0.
   0.          0.          0.          0.          0.          0.        ]]
