In [1]:
import numpy as np
import pandas as pd
import math
from matplotlib import pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import statsmodels.api as sm
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

  from pandas.core import datetools


Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

1. Vanilla logistic regression
2. Ridge logistic regression
3. Lasso logistic regression

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

## Cleaning and Normalizing the Data.

In [2]:
# Access the data file from the FBI: UCR 
dataset = pd.read_excel("NYCCrime.xls", header=4)

In [3]:
# Change the dataset into a DataFrame
data = pd.DataFrame(dataset)

In [4]:
# Rename the group columns
data.columns = ['City', 'Population', 'Violent_crime', 'Murder_manslaughter', 'Rape1', 'Rape', 'Robbery',
                'Aggravated_assault', 'Property_crime', 'Burglary', 'Theft', 'Vehicle_theft', 'Arson']

In [5]:
# Drop the unnecessary columns
data = data.drop(['Rape1', 'City'], axis=1)

In [6]:
# Drop the last three rows with null values
data = data.drop(data.index[348:])

In [7]:
# Change the Arson null values to 0. 
data.Arson = np.nan_to_num(data.Arson)

In [8]:
data_group = data

In [9]:
# Function to remove outlier data
def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]

In [10]:
# Filter the continuous variables through the outlier removal function and then drop the null values. 
for group in data_group.loc[:, 'Population':]:
    data_group[group] = reject_outliers(data_group[group], m=2)
data_group = data_group.dropna()

In [11]:
# Recode Property_crime column to binary.
data_group['Property_crime'] = np.where(data_group['Property_crime']>=112, 1, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [12]:
# Independent variables
X = data_group.drop('Property_crime', axis=1, inplace=False)
# Dependent variable
Y = data_group['Property_crime']

## Vanilla logistic regression

In [13]:
# Declare a logistic regression classifier.
lr = LogisticRegression(C=1e20)
# Fit the variables to the logistic model.
fit = lr.fit(X, Y)

In [14]:
print('\nCoefficients: \n', fit.coef_)
print('\nIntercept: \n', fit.intercept_)

logistic_pred_y = lr.predict(X)
print('\n Logistic Accuracy by Property Crime')
print(pd.crosstab(logistic_pred_y, Y))

print('\n Percentage accuracy')
print(lr.score(X, Y))

score = cross_val_score(fit, X, Y, cv=5)
print('\nEach Cross Validated R2 score: \n', score)
print("\nOverall Logistic Regression R2: %0.2f (+/- %0.2f)\n" % (score.mean(), score.std() * 2))


Coefficients: 
 [[ -1.06272199e-04   3.07708244e-02   2.63299447e-03   4.63583878e-03
    2.39181562e-02  -4.16164977e-04  -8.39111576e-03   1.58498936e-02
    6.32665645e-03  -3.57276926e-05]]

Intercept: 
 [-0.04937526]

 Logistic Accuracy by Property Crime
Property_crime   0    1
row_0                  
0               83    4
1               89  167

 Percentage accuracy
0.728862973761

Each Cross Validated R2 score: 
 [ 0.9         0.88405797  0.95588235  0.72058824  0.69117647]

Overall Logistic Regression R2: 0.83 (+/- 0.21)



For the Ridge and Lasso models, how do I know which lambda to use? For each model I'm going to iterate through different lambdas to see which produces the best outcome. 

In [15]:
#Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20, 1e2, 1e3, 1e4, 1e8, 1e10, 1e15]

## Ridge logistic regression

In [19]:
# Lists for the R2 values of the Ridge model
Ridge_list = []

# Iterate through each lambda value
for i in alpha_ridge:
    # Declare a Ridge logistic regression classifier.
    ridge = LogisticRegression(penalty='l2', C=i)
    # Fit the variables to the logistic model.
    ridgefit = ridge.fit(X, Y)
    # Appending the R2 value from each iteration to an outside list
    Ridge_list.append(ridgefit.score(X, Y))

In [20]:
Ridge_list

[0.49854227405247814,
 0.49854227405247814,
 0.49854227405247814,
 0.67930029154518945,
 0.72011661807580174,
 0.7288629737609329,
 0.7288629737609329,
 0.7288629737609329,
 0.7288629737609329,
 0.7288629737609329,
 0.7288629737609329,
 0.7288629737609329,
 0.7288629737609329,
 0.7288629737609329,
 0.7288629737609329,
 0.7288629737609329]

From the list above, I can see that the Ridge R2 stop improving at a lambda of 1e-2. Now I am going to print out the coefficients, intercepts, accuracy and crossvalidation of the model.

In [25]:
# Declare a logistic regression classifier.
ridge = LogisticRegression(penalty='l2', C=1e-2)
# Fit the variables to the logistic model.
ridgefit = ridge.fit(X, Y)

In [26]:
print('\nRidge Coefficients: \n', ridgefit.coef_)
print('\nRidge Intercept: \n', ridgefit.intercept_)

ridge_pred_y = ridgefit.predict(X)
print('\n Ridge Logistic Accuracy by Property Crime')
print(pd.crosstab(ridge_pred_y, Y))

print('\n Percentage accuracy')
print(ridgefit.score(X, Y))

score = cross_val_score(ridgefit, X, Y, cv=5)
print('\nEach Cross Validated R2 score: \n', score)
print("\nOverall Ridge Logistic Regression R2: %0.2f (+/- %0.2f)\n" % (score.mean(), score.std() * 2))


Ridge Coefficients: 
 [[ -1.05910349e-04   2.99280384e-02   2.56471122e-03   4.51514966e-03
    2.32888548e-02  -4.40677249e-04  -8.09462632e-03   1.58302213e-02
    6.15991498e-03  -3.42880697e-05]]

Ridge Intercept: 
 [-0.04810701]

 Ridge Logistic Accuracy by Property Crime
Property_crime   0    1
row_0                  
0               83    4
1               89  167

 Percentage accuracy
0.728862973761

Each Cross Validated R2 score: 
 [ 0.81428571  0.72463768  0.77941176  0.72058824  0.69117647]

Overall Ridge Logistic Regression R2: 0.75 (+/- 0.09)



## Lasso logistic regression

In [30]:
# Lists for the R2 values of Lasso model
Lasso_list = []

# Iterate through each lambda value
for i in alpha_ridge: 
    # Declare a logistic regression classifier.
    lasso = LogisticRegression(penalty='l1', C=i)
    # Fit the variables to the logistic model.
    lassofit = lasso.fit(X, Y)
    # Appending the R2 value from each iteration to an outside list
    Lasso_list.append(lassofit.score(X, Y)) 

In [31]:
Lasso_list

[0.50145772594752192,
 0.50145772594752192,
 0.50145772594752192,
 0.49854227405247814,
 0.63265306122448983,
 0.67346938775510201,
 0.99125364431486884,
 0.98833819241982512,
 0.99416909620991256,
 0.99125364431486884,
 0.98833819241982512,
 0.99125364431486884,
 0.99125364431486884,
 0.99125364431486884,
 0.99125364431486884,
 0.99125364431486884]

From the list above, I can see that the Lasso R2 stop improving around a lambda of 10. Now I am going to print out the coefficients, intercepts, accuracy and crossvalidation of the model.

In [32]:
# Declare a logistic regression classifier.
lasso = LogisticRegression(penalty='l1', C=10)
# Fit the variables to the logistic model.
lassofit = lasso.fit(X, Y)

In [33]:
print('\nLasso Coefficients: \n', lassofit.coef_)
print('\nLasso Intercept: \n', lassofit.intercept_)

lasso_pred_y = lassofit.predict(X)
print('\n Lasso Logistic Accuracy by Property Crime')
print(pd.crosstab(lasso_pred_y, Y))

print('\n Percentage accuracy')
print(lassofit.score(X, Y))

score = cross_val_score(lassofit, X, Y, cv=5)
print('\nEach Cross Validated R2 score: \n', score)
print("\nOverall Lasso Logistic Regression R2: %0.2f (+/- %0.2f)\n" % (score.mean(), score.std() * 2))


Lasso Coefficients: 
 [[ -1.50888520e-04   2.24946254e-02   1.07153502e+00  -1.92452802e-01
    3.85397369e-02   2.26731147e-02   1.96318673e-01   1.39286617e-01
    8.37299493e-02  -2.78537072e-01]]

Lasso Intercept: 
 [-15.64517878]

 Lasso Logistic Accuracy by Property Crime
Property_crime    0    1
row_0                   
0               171    2
1                 1  169

 Percentage accuracy
0.991253644315

Each Cross Validated R2 score: 
 [ 0.98571429  0.97101449  0.95588235  0.97058824  0.97058824]

Overall Lasso Logistic Regression R2: 0.97 (+/- 0.02)



## Evaluation of models

The logistic regression model had a moderate R-squared result. However, the accuracy of the logistic model shows that it produced more false positive results and didn't have consistent R-squared during cross validation. The ridge regression model had the worst improvement in results. While the Ridge accuracy is identical to the logistic model its cross validation had the lowest returns. Finally, the Lasso model performed best with its highest cross validated R-squared scores. It had very good accuracy and did not appear to be biased to any particular errors. 