# Comparing Logistic Regression Models:
Vanilla Logistic Regression <br>
Ridge Logistic Regression <br>
Lasso Logistic Regression

# Task:
1. Pick a  dataset with binary outcome and potential for at least 15 features. <br>
2. Engineer your features, then create 3 models. <br>
3. Each model will be run on a training set, a test-set or multiple test-sets, if you take a folds approach). <br>
4. Evaluate all 3 models and decide on the best. Be clear on decisions that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why oyu think that particular model is the best of the 3. <br>
5. Reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but wish you could have done?

[California - Offenses Known to Law Enforcement](https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table-8-state-cuts/table_8_offenses_known_to_law_enforcement_california_by_city_2013.xls)

In [1]:
# Import modules.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import sklearn
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import statsmodels.api as sm

# Aesthetics.
%matplotlib inline
sns.set_style('white')

In [2]:
# Load data.
cal_crime = pd.read_csv('~/src/data/unit3/cal-crime-2013.csv')
print(cal_crime.shape)
cal_crime.head()

(462, 13)


Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson
0,Adelanto,31165,198,2,,15,52,129,886,381,372,133,17
1,Agoura Hills,20762,19,0,,2,10,7,306,109,185,12,7
2,Alameda,76206,158,0,,10,85,63,1902,287,1285,330,17
3,Albany,19104,29,0,,1,24,4,557,94,388,75,7
4,Alhambra,84710,163,1,,9,81,72,1774,344,1196,234,7


# Data cleaning

In [3]:
cal_crime.columns = ['city', 'population', 'violent_crime', 'murder', 'rape_1', 'rape_2', 'robbery',
                     'agg_assault', 'property_crime', 'burglary', 'larceny',
                     'motor_theft', 'arson']
cal_crime.tail()

Unnamed: 0,city,population,violent_crime,murder,rape_1,rape_2,robbery,agg_assault,property_crime,burglary,larceny,motor_theft,arson
457,Yountville,2969,1,0,,1,0,0,57,17,34,6,0
458,Yreka,7639,49,1,,2,2,44,278,71,193,14,2
459,Yuba City,65133,174,2,,15,39,118,1980,438,1210,332,16
460,Yucaipa,52524,107,0,,7,31,69,926,262,534,130,13
461,Yucca Valley,21214,86,3,,7,15,61,429,141,234,54,2


In [4]:
cal_crime.isnull().sum()

city                0
population          0
violent_crime       0
murder              0
rape_1            462
rape_2              0
robbery             0
agg_assault         0
property_crime      0
burglary            0
larceny             0
motor_theft         0
arson               0
dtype: int64

In [5]:
cal_crime = cal_crime.drop(['rape_1'], axis=1)

In [6]:
# Log population
cal_crime['population'] = np.log(cal_crime['population'])

# Create a binary target feature

In [7]:
# Summing crime_variables & dividing by logged population.
cal_crime['crime_sum'] = cal_crime.iloc[:, 2:].sum(axis=1) 
cal_crime['crimes_by_log_pop'] = cal_crime['crime_sum'] / cal_crime['population']
cal_crime.head()

Unnamed: 0,city,population,violent_crime,murder,rape_2,robbery,agg_assault,property_crime,burglary,larceny,motor_theft,arson,crime_sum,crimes_by_log_pop
0,Adelanto,10.347051,198,2,15,52,129,886,381,372,133,17,2185,211.171281
1,Agoura Hills,9.94088,19,0,2,10,7,306,109,185,12,7,657,66.090731
2,Alameda,11.241195,158,0,10,85,63,1902,287,1285,330,17,4137,368.021356
3,Albany,9.857653,29,0,1,24,4,557,94,388,75,7,1179,119.602506
4,Alhambra,11.346989,163,1,9,81,72,1774,344,1196,234,7,3881,342.029064


In [8]:
cal_crime.describe()

Unnamed: 0,population,violent_crime,murder,rape_2,robbery,agg_assault,property_crime,burglary,larceny,motor_theft,arson,crime_sum,crimes_by_log_pop
count,462.0,462.0,462.0,462.0,462.0,462.0,462.0,462.0,462.0,462.0,462.0,462.0,462.0
mean,10.248111,269.692641,3.030303,13.125541,103.971861,149.564935,1883.084416,412.158009,1168.404762,302.521645,13.426407,4318.980519,360.118917
std,1.358883,1004.375521,13.711276,43.273952,493.353465,476.014021,5427.650545,1026.459128,3512.717199,982.348315,70.413206,12872.690813,902.075346
min,4.744932,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1.0,0.0,0.0,9.0,1.367648
25%,9.409018,27.0,0.0,2.0,5.0,17.0,281.0,75.5,161.25,22.0,1.0,642.5,67.656225
50%,10.374733,77.5,0.0,5.0,21.0,51.0,721.0,175.0,444.5,77.0,3.0,1627.0,156.198614
75%,11.183263,200.75,2.0,12.0,65.75,120.0,1879.25,401.25,1194.5,250.5,10.0,4279.5,387.558765
max,15.171017,16524.0,251.0,764.0,7885.0,7624.0,85844.0,15728.0,55734.0,14382.0,1430.0,206166.0,13589.464657


In [9]:
# If crimes_by_log_pop > 360.118917, categorize as "1"
cal_crime['crime_cat'] = cal_crime['crimes_by_log_pop'].apply(lambda x: 1 if x > 360.118917 else 0)

In [10]:
# Creating categorical features.
#cal_crime['population_squared'] = cal_crime['population'] * cal_crime['population']
#cal_crime['murder_cat'] = cal_crime['murder'].apply(lambda x: 1 if x > 0 else 0)
#cal_crime['robbery_cat'] = cal_crime['robbery'].apply(lambda x: 1 if x > 0 else 0)
#cal_crime['arson_cat'] = cal_crime['arson'].apply(lambda x: 1 if x > 0 else 0)
#cal_crime['violent_crime_cat'] = cal_crime['violent_crime'].apply(lambda x: 1 if x > 0 else 0)
print(cal_crime.shape)

(462, 15)


In [11]:
print(cal_crime['crime_cat'].sum())

124


In [12]:
cal_crime = cal_crime.drop(['crime_sum', 'crimes_by_log_pop'], axis=1)
cal_crime.head()

Unnamed: 0,city,population,violent_crime,murder,rape_2,robbery,agg_assault,property_crime,burglary,larceny,motor_theft,arson,crime_cat
0,Adelanto,10.347051,198,2,15,52,129,886,381,372,133,17,0
1,Agoura Hills,9.94088,19,0,2,10,7,306,109,185,12,7,0
2,Alameda,11.241195,158,0,10,85,63,1902,287,1285,330,17,1
3,Albany,9.857653,29,0,1,24,4,557,94,388,75,7,0
4,Alhambra,11.346989,163,1,9,81,72,1774,344,1196,234,7,0


In [13]:
# Standardize.
crime_categorical = cal_crime['crime_cat']
cal_crime = pd.DataFrame(StandardScaler().fit_transform(cal_crime.drop(['city', 'crime_cat'], axis=1)))
cal_crime.columns = ['population', 'violent_crime', 'murder', 'rape_2', 'robbery',
       'agg_assault', 'property_crime', 'burglary', 'larceny', 'motor_theft',
       'arson']

In [14]:
df = cal_crime
df['crime_cat'] = crime_categorical
df.head()

Unnamed: 0,population,violent_crime,murder,rape_2,robbery,agg_assault,property_crime,burglary,larceny,motor_theft,arson,crime_cat
0,0.072888,-0.071458,-0.075224,0.043363,-0.105458,-0.043249,-0.183904,-0.030388,-0.226966,-0.172755,0.050807,0
1,-0.226336,-0.249871,-0.221248,-0.257374,-0.190682,-0.299822,-0.29088,-0.295664,-0.280259,-0.296063,-0.091366,0
2,0.731601,-0.111327,-0.221248,-0.072305,-0.038497,-0.182051,0.003489,-0.122064,0.033228,0.028002,0.050807,1
3,-0.287649,-0.239904,-0.221248,-0.280508,-0.162274,-0.306131,-0.244585,-0.310293,-0.222406,-0.231861,-0.091366,0
4,0.809539,-0.106343,-0.148236,-0.095439,-0.046613,-0.163123,-0.02012,-0.066473,0.007864,-0.069829,-0.091366,0


# Target & Features

In [15]:
# Pseudocode
# Y = target                .... Y is target
# X = df.drop['target']     .... X is features

In [16]:
target = df['crime_cat']
features = df.drop(['crime_cat'], axis=1)
train, test = train_test_split(df, test_size=0.25, random_state=42)

feature_cols = features.columns

X_test = test[feature_cols]
Y_test = test['crime_cat']
X_train = train[feature_cols]
Y_train = train['crime_cat']

# Vanilla Logistic Regression

In [17]:
# Declare a logistic regression classifier.
lr = LogisticRegression(C=1e9)

# Fit the model.
fit = lr.fit(X_train, Y_train)

# Display.
print('Coefficients')
print(pd.DataFrame(fit.coef_, columns=feature_cols))
print('\nIntercept:', fit.intercept_)
print('\nTrain Set Accuracy:')
print(lr.score(X_train, Y_train))
print('\nTest Set Accuracy:')
print(lr.score(X_test, Y_test))
pred_y_sklearn = lr.predict(X_test)
print('\nTest Set Accuracy By crime_cat')
print(pd.crosstab(pred_y_sklearn, Y_test))

# Cross-Validation.
cvScoreTest = cross_val_score(lr, X_test, Y_test, cv=10)
print('\nVanilla LogRegr Test Acc: %0.2f (+/- %0.2f)' % (cvScoreTest.mean(), cvScoreTest.std() * 2))
cvScoreTrain = cross_val_score(lr, X_train, Y_train, cv=10)
print('\nVanilla LogRegr Train Acc: %0.2f (+/- %0.2f)' % (cvScoreTrain.mean(), cvScoreTrain.std() * 2))

Coefficients
   population  violent_crime     murder     rape_2    robbery  agg_assault  \
0  -15.299046      31.887458 -30.725208  16.376182  39.510607    25.728039   

   property_crime   burglary     larceny  motor_theft     arson  
0      106.508104  38.162159  140.393418    46.576656 -3.682357  

Intercept: [18.59339915]

Train Set Accuracy:
1.0

Test Set Accuracy:
0.9827586206896551

Test Set Accuracy By crime_cat
crime_cat   0   1
row_0            
0          82   0
1           2  32

Vanilla LogRegr Test Acc: 0.96 (+/- 0.09)

Vanilla LogRegr Train Acc: 0.99 (+/- 0.02)


# Ridge Regression

In [19]:
# Declare regression classifier - l2 regularization.
ridgeRegression = LogisticRegression(penalty='l2', C=1.0)

# Fit the model.
ridgeFit = ridgeRegression.fit(X_train, Y_train)

# Display.
print('Coefficients')
print(pd.DataFrame(ridgeFit.coef_, columns=feature_cols))
print('\nIntercept:', ridgeFit.intercept_)
print('\nTrain Set Accuracy:')
print(ridgeRegression.score(X_train, Y_train))
print('\nTest Set Accuracy:')
print(ridgeRegression.score(X_test, Y_test))
pred_y_sklearn = ridgeRegression.predict(X_test)
print('\nTest Set Accuracy By crime_cat')
print(pd.crosstab(pred_y_sklearn, Y_test))

# Cross-Validation.
cvScoreTest = cross_val_score(ridgeRegression, X_test, Y_test, cv=10)
print('\nRidge Regr Test Acc: %0.2f (+/- %0.2f)' % (cvScoreTest.mean(), cvScoreTest.std() * 2))
cvScoreTrain = cross_val_score(ridgeRegression, X_train, Y_train, cv=10)
print('\nRidge Regr Train Acc: %0.2f (+/- %0.2f)' % (cvScoreTrain.mean(), cvScoreTrain.std() * 2))

Coefficients
   population  violent_crime    murder    rape_2   robbery  agg_assault  \
0     0.86917       1.188779  0.257912  1.291886  0.966335      1.38188   

   property_crime  burglary   larceny  motor_theft     arson  
0        2.462639  1.804646  2.712368     2.021859  0.780831  

Intercept: [-0.24442676]

Train Set Accuracy:
0.976878612716763

Test Set Accuracy:
0.9396551724137931

Test Set Accuracy By crime_cat
crime_cat   0   1
row_0            
0          82   5
1           2  27

Ridge Regr Test Acc: 0.90 (+/- 0.19)

Ridge Regr Train Acc: 0.97 (+/- 0.04)


# Lasso Regression

In [20]:
# Declare regression classifier - l1 regularization.
lassoRegression = LogisticRegression(penalty='l1', C=1.0)

# Fit the model.
lassoFit = lassoRegression.fit(X_train, Y_train)

# Display.
print('Coefficients')
print(pd.DataFrame(lassoFit.coef_, columns=feature_cols))
print('\nIntercept:', lassoFit.intercept_)
print('\nTrain Set Accuracy:')
print(lassoRegression.score(X_train, Y_train))
print('\nTest Set Accuracy:')
print(lassoRegression.score(X_test, Y_test))
pred_y_sklearn = lassoRegression.predict(X_test)
print('\nTest Set Accuracy By crime_cat')
print(pd.crosstab(pred_y_sklearn, Y_test))

# Cross-Validation.
cvScoreTest = cross_val_score(lassoRegression, X_test, Y_test, cv=10)
print('\nLasso Regr Test Acc: %0.2f (+/- %0.2f)' % (cvScoreTest.mean(), cvScoreTest.std() * 2))
cvScoreTrain = cross_val_score(lassoRegression, X_train, Y_train, cv=10)
print('\nLasso Regr Train Acc: %0.2f (+/- %0.2f)' % (cvScoreTrain.mean(), cvScoreTrain.std() * 2))

Coefficients
   population  violent_crime  murder    rape_2  robbery  agg_assault  \
0         0.0            0.0     0.0  0.957523      0.0      1.73337   

   property_crime  burglary   larceny  motor_theft  arson  
0       10.758782       0.0  5.547575     3.023149    0.0  

Intercept: [0.2901841]

Train Set Accuracy:
0.9942196531791907

Test Set Accuracy:
0.9827586206896551

Test Set Accuracy By crime_cat
crime_cat   0   1
row_0            
0          83   1
1           1  31

Lasso Regr Test Acc: 0.94 (+/- 0.18)

Lasso Regr Train Acc: 0.99 (+/- 0.04)


#### Initial thoughts:
The model may have been overfit. <br>
Ridge regression performed the worst. <br>
Test set accuracy is the same for lasso & vanilla. <br>


# Thoughts
Vanilla regression is best model, as measured by the mean of the cross-validation scores & lower variability. Vanilla scored highest in training set accuracy and cross-val test set accuracy. Vanilla & lasso surprisingly tied on test set accuracy, I'd like to go over this further with my mentor. Vanilla and Lasso both had training set cross-validation scores of 0.99, while vanilla's variability was lower at (+/-) 0.02 vs. lasso's standard deviation of (+/-) 0.04.