## Colorado Crime Data: Logistic, Lasso and Ridge Regressions:

> This dataset is from the same FBI site where the New York Crime data was taken from and represents the same data from Colorado.  Below, we investigate how Logistic, Lasso and Ridge Regressions work with the data.

In [58]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

In [37]:
col_list = ['Population', 'Violent_Crime', 'Murder', 'Rape', 'Rape_2', 'Robbery', 
            'Assault', 'Property_Crime', 'Burglary', 'Larceny', 'MV_Theft', 'Arson']

data = pd.read_excel('Colorado_2013_Crime.xls', names = col_list, header = 3, 
                     index_col = 0, skiprows = [0], skipfooter = 2).drop('Rape_2', 1)

In [38]:
# A bit of feature engineering:

# Making a Binary Violent Crime Column:
data.loc[data['Violent_Crime'] == 0, 'Binary_Violent_Crime'] = 0
data.loc[data['Violent_Crime'] > 0, 'Binary_Violent_Crime'] = 1

#Same for Murder and Larceny:
data.loc[data['Larceny'] == 0, 'Binary_Larceny'] = 0
data.loc[data['Larceny'] > 0, 'Binary_Larceny'] = 1

data.loc[data['Murder'] == 0, 'Binary_Murder'] = 0
data.loc[data['Murder'] > 0, 'Binary_Murder'] = 1

In [39]:
# Doing some per-capita and per-Larceny columns in a new dataframe:

data_modified = pd.DataFrame(data)

data_modified['Larceny/Capita'] = data_modified.Larceny / data_modified.Population
data_modified['Assault/Capita'] = data_modified.Assault / data_modified.Population
data_modified['Violent_Crime/Capita'] = data_modified.Violent_Crime / data_modified.Population
data_modified['Murder/Capita'] = data_modified.Murder / data_modified.Population

data_modified['Violent_Crime/Larceny'] = data_modified.Violent_Crime / data_modified.Larceny

In [45]:
# Removing NaN and Inf values:
data.loc[data['Violent_Crime/Larceny'] == np.inf] = np.nan
data = data.dropna(axis = 0, how = 'any')

In [46]:

data_modified.loc[data_modified['Violent_Crime/Larceny'] == np.inf] = np.nan
data_modified = data_modified.dropna(axis = 0, how = 'any')

# Double checking for cleanliness:
# data_modified.where(cond = data_modified.values == np.nan).count()

#### Variables and First Model (Regular Logistic Regression):

> After doing a bit of feature engineering, now we can start by defining our input data and our target variables - as well as defining and fitting our first models:

In [68]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split, cross_val_score

In [70]:
# Defining our input and target variables (or at least one of them for now) as well as our train/test sets:

x = data.drop(labels = ['Rape', 'Binary_Violent_Crime', 'Assault/Capita', 'Violent_Crime/Capita',
       'Murder/Capita', 'Violent_Crime/Larceny'], axis = 1)

y = data.Binary_Violent_Crime

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = .2)

###### Linear Regression for Practice:

In [71]:
log_regr1 = linear_model.LogisticRegression()

log_regr1.fit(X_train, y_train)
log_regr1.fit(X_test, y_test)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)