<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Gridsearch and Multinomial Models with SF Crime Data

_Authors: Joseph Nelson (SF)_

---

### Multinomial logistic regression models

So far, we have been using logistic regression for binary problems where there are only two class labels. Logistic regression can be extended to dependent variables with multiple classes.

There are two ways sklearn solves multiple-class problems with logistic regression: a multinomial loss or a "one vs. rest" (OvR) process where a model is fit for each target class vs. all the other classes. 

**Multinomial vs. OvR**
- (both) 'k' classes
- (M) 'k-1' models with 1 reference category
- (OvR) 'k*(k-1)/2' models

You will use the gridsearch in conjunction with multinomial logistic to optimize a model that predicts the category (type) of crime based on various features captured by San Francisco police departments.

**Necessary lab imports**

In [1]:
import numpy as np
import pandas as pd
import patsy

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.grid_search import GridSearchCV

import seaborn as sns

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'



### 1. Read in the data

In [34]:
sf_crime= pd.read_csv('/Users/Mahendra/desktop/GA/hw/6.3.4_optimization-gridsearch_hyperparameters-lab/datasets/sf_crime_train.csv')
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [3]:
# A:

### 2. Create column for hour, month, and year from 'Dates' column.

> *Hint: `pd.to_datetime` may or may not be helpful.*


In [37]:
sf_crime['year'] = pd.DatetimeIndex(sf_crime['Dates']).year
sf_crime['month'] = pd.DatetimeIndex(sf_crime['Dates']).month
sf_crime['day'] = pd.DatetimeIndex(sf_crime['Dates']).day
sf_crime['time'] = pd.DatetimeIndex(sf_crime['Dates']).time

In [38]:
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,year,month,day,time
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,13,23:53:00
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,13,23:53:00
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414,2015,5,13,23:33:00
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873,2015,5,13,23:30:00
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541,2015,5,13,23:30:00


### 3. Validate and clean the data.

In [40]:
sf_crime.drop(['Dates'], axis = 1, inplace = True)

In [42]:
sf_crime.isnull().sum()

Category      0
Descript      0
DayOfWeek     0
PdDistrict    0
Resolution    0
Address       0
X             0
Y             0
year          0
month         0
day           0
time          0
dtype: int64

In [43]:
sf_crime.Category.value_counts()

LARCENY/THEFT                  4885
OTHER OFFENSES                 2291
NON-CRIMINAL                   2255
ASSAULT                        1536
VEHICLE THEFT                   967
VANDALISM                       877
BURGLARY                        732
WARRANTS                        728
SUSPICIOUS OCC                  592
MISSING PERSON                  535
DRUG/NARCOTIC                   496
ROBBERY                         465
FRAUD                           363
SECONDARY CODES                 261
WEAPON LAWS                     212
TRESPASS                        130
STOLEN PROPERTY                 111
SEX OFFENSES FORCIBLE           103
FORGERY/COUNTERFEITING           85
DRUNKENNESS                      74
KIDNAPPING                       50
PROSTITUTION                     44
DRIVING UNDER THE INFLUENCE      42
DISORDERLY CONDUCT               37
ARSON                            35
LIQUOR LAWS                      25
RUNAWAY                          16
BRIBERY                     

In [44]:
sf_crime.Descript.value_counts()

GRAND THEFT FROM LOCKED AUTO                             2127
STOLEN AUTOMOBILE                                         625
AIDED CASE, MENTAL DISTURBED                              591
DRIVERS LICENSE, SUSPENDED OR REVOKED                     589
BATTERY                                                   520
PETTY THEFT FROM LOCKED AUTO                              498
PETTY THEFT OF PROPERTY                                   484
LOST PROPERTY                                             468
WARRANT ARREST                                            429
MALICIOUS MISCHIEF, VANDALISM                             361
FOUND PROPERTY                                            353
MALICIOUS MISCHIEF, VANDALISM OF VEHICLES                 340
GRAND THEFT FROM UNLOCKED AUTO                            321
SUSPICIOUS OCCURRENCE                                     305
INVESTIGATIVE DETENTION                                   246
FOUND PERSON                                              233
PETTY TH

In [45]:
sf_crime.DayOfWeek.value_counts()

Wednesday    2930
Friday       2733
Saturday     2556
Thursday     2479
Sunday       2456
Monday       2447
Tuesday      2399
Name: DayOfWeek, dtype: int64

### 4. Set up a target and predictor matrix for predicting violent crime vs. non-violent crime vs. non-crimes.

**Non-Violent Crimes:**
- bad checks
- bribery
- drug/narcotic
- drunkenness
- embezzlement
- forgery/counterfeiting
- fraud
- gambling
- liquor
- loitering 
- trespass

**Non-Crimes:**
- non-criminal
- runaway
- secondary codes
- suspicious occ
- warrants

**Violent Crimes:**
- everything else



**What type of model do you need here? What should your "baseline" category be?**

In [46]:
# A:

zeros = ['non-criminal', 'runaway', 'secondary codes', 'suspicious occ', 'warrants']
ones  = ['bad checks', 'bribery', 'drug/narcotic', 'drunkenness', 'embezzlement', 'forgery/counterfeiting', 'fraud', 
         'gambling','liquor', 'loitering', 'trespass', 'other offenses']

In [47]:
crime_cat = []
#iterate through sf_crime Category
for crime in sf_crime['Category']:
    # convert values to lower
    crime = crime.lower()
    # checks list of sub categories
    if crime in zeros:
        # appends the overlaying category
        crime_cat.append('non-crime')
    elif crime in ones:
        crime_cat.append('non-violent')
    else:
        crime_cat.append('violent')
        
# take that list and add it to the DF
sf_crime['cat_number'] = crime_cat

In [50]:
dummies = pd.get_dummies(sf_crime[['DayOfWeek','PdDistrict','Resolution']], drop_first = True)
sf_crime = sf_crime.merge(dummies, left_index = True, right_index = True,how = 'outer')

In [52]:
# A:
x = sf_crime.drop(['Category','Descript','DayOfWeek','PdDistrict',
                   'Resolution','Address','X','Y','cat_number','time'], axis = 1)
y = sf_crime['cat_number'].values

In [53]:
x.columns

Index([u'year', u'month', u'day', u'DayOfWeek_Monday', u'DayOfWeek_Saturday',
       u'DayOfWeek_Sunday', u'DayOfWeek_Thursday', u'DayOfWeek_Tuesday',
       u'DayOfWeek_Wednesday', u'PdDistrict_CENTRAL', u'PdDistrict_INGLESIDE',
       u'PdDistrict_MISSION', u'PdDistrict_NORTHERN', u'PdDistrict_PARK',
       u'PdDistrict_RICHMOND', u'PdDistrict_SOUTHERN', u'PdDistrict_TARAVAL',
       u'PdDistrict_TENDERLOIN', u'Resolution_ARREST, CITED',
       u'Resolution_CLEARED-CONTACT JUVENILE FOR MORE INFO',
       u'Resolution_EXCEPTIONAL CLEARANCE', u'Resolution_JUVENILE BOOKED',
       u'Resolution_LOCATED', u'Resolution_NONE', u'Resolution_NOT PROSECUTED',
       u'Resolution_PSYCHOPATHIC CASE', u'Resolution_UNFOUNDED'],
      dtype='object')

### 5. Standardize the predictor matrix

In [54]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
Xs = ss.fit_transform(x)

### 6. Find the optimal hyperparameters (optimal regularization) to predict your crime categories.

> **Note:** Gridsearching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently - the gridsearch object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. I recommend the logistic regression one, but the downside is that lasso and ridge must be searched separately.

**Reference for logistic regression regularization hyperparameters:**
- `solver`: algorithm used for optimization (relevant for multiclass)
    - Newton-cg - Handles Multinomial Loss, L2 only
    - Sag - Handles Multinomial Loss, Large Datasets, L2 Only, Works best on sclaed data
    - lbfgs - Handles Multinomial Loss, L2 Only
    - Liblinear - Small Datasets, no Warm Starts
- `Cs`: Regularization strengths (smaller values are stronger penalties)
- `cv`: vross-validations or number of folds
- `penalty`: `'l1'` - LASSO, `'l2'` - Ridge 

In [8]:
# Example:
# fit model with five folds and lasso regularization
# use Cs=15 to test a grid of 15 distinct parameters
# remember: Cs describes the inverse of regularization strength

# logreg_cv = LogisticRegressionCV(solver='liblinear', 
#                                  Cs=[1,5,10], 
#                                  cv=5, penalty='l1')

**Split data into training and testing with 50% in testing.**

In [55]:
# A:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.5,random_state=12)

**Gridsearch hyperparameters for the training data.**

In [58]:
# A:
logreg_cv = LogisticRegressionCV(Cs=100, cv=5, penalty='l1', scoring='accuracy', solver='liblinear')
logreg_cv.fit(x_train, y_train)

LogisticRegressionCV(Cs=100, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
           refit=True, scoring='accuracy', solver='liblinear', tol=0.0001,
           verbose=0)

**Find the best parameters for each target class.**

In [11]:
# A:

**Build three logisitic regression models using the best parameters for each target class.**

In [12]:
# A:

### 7. Build confusion matrices for the models above
- Use the holdout test data from the train-test split

In [13]:
# A:

### 8. Print classification reports for your three models.

In [14]:
# A:

**Describe the metrics in the classification report.**

In [15]:
# A: