# Comparing Logistic Regression Models:
Vanilla Logistic Regression <br>
Ridge Logistic Regression <br>
Lasso Logistic Regression

# Task:
1. Pick a  dataset with binary outcome and potential for at least 15 features. <br>
2. Engineer your features, then create 3 models. <br>
3. Each model will be run on a training set, a test-set or multiple test-sets, if you take a folds approach). <br>
4. Evaluate all 3 models and decide on the best. Be clear on decisions that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why oyu think that particular model is the best of the 3. <br>
5. Reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but wish you could have done?

[California - Offenses Known to Law Enforcement](https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table-8-state-cuts/table_8_offenses_known_to_law_enforcement_california_by_city_2013.xls)

In [53]:
# Import modules.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import sklearn
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import statsmodels.api as sm

# Aesthetics.
%matplotlib inline
sns.set_style('white')

In [54]:
# Load data.
cal_crime = pd.read_csv('~/src/data/unit3/cal-crime-2013.csv')
print(cal_crime.shape)
cal_crime.head()

(462, 13)


Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson
0,Adelanto,31165,198,2,,15,52,129,886,381,372,133,17
1,Agoura Hills,20762,19,0,,2,10,7,306,109,185,12,7
2,Alameda,76206,158,0,,10,85,63,1902,287,1285,330,17
3,Albany,19104,29,0,,1,24,4,557,94,388,75,7
4,Alhambra,84710,163,1,,9,81,72,1774,344,1196,234,7


# Data cleaning

In [55]:
cal_crime.columns = ['city', 'population', 'violent_crime', 'murder', 'rape_1', 'rape_2', 'robbery',
                     'agg_assault', 'property_crime', 'burglary', 'larceny',
                     'motor_theft', 'arson']
cal_crime.tail()

Unnamed: 0,city,population,violent_crime,murder,rape_1,rape_2,robbery,agg_assault,property_crime,burglary,larceny,motor_theft,arson
457,Yountville,2969,1,0,,1,0,0,57,17,34,6,0
458,Yreka,7639,49,1,,2,2,44,278,71,193,14,2
459,Yuba City,65133,174,2,,15,39,118,1980,438,1210,332,16
460,Yucaipa,52524,107,0,,7,31,69,926,262,534,130,13
461,Yucca Valley,21214,86,3,,7,15,61,429,141,234,54,2


In [56]:
cal_crime.isnull().sum()

city                0
population          0
violent_crime       0
murder              0
rape_1            462
rape_2              0
robbery             0
agg_assault         0
property_crime      0
burglary            0
larceny             0
motor_theft         0
arson               0
dtype: int64

In [57]:
cal_crime = cal_crime.drop(['rape_1'], axis=1)

In [58]:
# Creating categorical features.
cal_crime['population_squared'] = cal_crime['population'] * cal_crime['population']
#cal_crime['murder_cat'] = cal_crime['murder'].apply(lambda x: 1 if x > 0 else 0)
#cal_crime['robbery_cat'] = cal_crime['robbery'].apply(lambda x: 1 if x > 0 else 0)
#cal_crime['arson_cat'] = cal_crime['arson'].apply(lambda x: 1 if x > 0 else 0)
#cal_crime['violent_crime_cat'] = cal_crime['violent_crime'].apply(lambda x: 1 if x > 0 else 0)
print(cal_crime.shape)

(462, 17)


# Creating a binary target feature: "dangerous_city"

In [59]:
cal_crime.columns

Index(['city', 'population', 'violent_crime', 'murder', 'rape_2', 'robbery',
       'agg_assault', 'property_crime', 'burglary', 'larceny', 'motor_theft',
       'arson', 'population_squared', 'murder_cat', 'robbery_cat', 'arson_cat',
       'violent_crime_cat'],
      dtype='object')

# Standardize

In [60]:
##### DONT STANDARDIZE dangerous_city #######

cal_crime = pd.DataFrame(StandardScaler().fit_transform(cal_crime.drop(['city'], axis=1)))
cal_crime.columns = ['population', 'violent_crime', 'murder', 'rape_2', 'robbery',
       'agg_assault', 'property_crime', 'burglary', 'larceny', 'motor_theft',
       'arson', 'dangerous_city']
cal_crime.head()

Unnamed: 0,population,violent_crime,murder,rape_2,robbery,agg_assault,property_crime,burglary,larceny,motor_theft,arson,population_squared,murder_cat,robbery_cat,arson_cat,violent_crime_cat
0,-0.180906,-0.071458,-0.075224,0.043363,-0.105458,-0.043249,-0.183904,-0.030388,-0.226966,-0.172755,0.050807,-0.065954,1.090676,0.286299,0.495261,0.093454
1,-0.23117,-0.249871,-0.221248,-0.257374,-0.190682,-0.299822,-0.29088,-0.295664,-0.280259,-0.296063,-0.091366,-0.066719,-0.916863,0.286299,0.495261,0.093454
2,0.036716,-0.111327,-0.221248,-0.072305,-0.038497,-0.182051,0.003489,-0.122064,0.033228,0.028002,0.050807,-0.059105,-0.916863,0.286299,0.495261,0.093454
3,-0.239181,-0.239904,-0.221248,-0.280508,-0.162274,-0.306131,-0.244585,-0.310293,-0.222406,-0.231861,-0.091366,-0.066812,-0.916863,0.286299,0.495261,0.093454
4,0.077804,-0.106343,-0.148236,-0.095439,-0.046613,-0.163123,-0.02012,-0.066473,0.007864,-0.069829,-0.091366,-0.057167,1.090676,0.286299,0.495261,0.093454


# Target & Features

In [61]:
df = cal_crime

In [64]:
# Pseudocode
# Y = target                .... Y is target
# X = df.drop['target']     .... X is features

target = df['dangerous_city']
features = df.drop(['dangerous_city'], axis=1)
train, test = train_test_split(df, test_size=0.25, random_state=42)
feature_cols = features.columns

X_test = test[feature_cols]
Y_test = test['dangerous_city']
X_train = train[feature_cols]
Y_train = train['dangerous_city']