<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Gridsearch and Multinomial Models with SF Crime Data


---

### Multinomial logistic regression models

So far, we have been using logistic regression for binary problems where there are only two class labels. Logistic regression can be extended to dependent variables with multiple classes.

There are two ways sklearn solves multiple-class problems with logistic regression: a multinomial loss or a "one vs. rest" (OvR) process where a model is fit for each target class vs. all the other classes. 

**Multinomial vs. OvR**
- (M) 'k-1' models with 1 reference category
- (OvR) 'k*(k-1)/2' models

You will use the gridsearch in conjunction with multinomial logistic to optimize a model that predicts the category (type) of crime based on various features captured by San Francisco police departments.

**Necessary lab imports**

In [137]:
import numpy as np
import pandas as pd
import patsy

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_val_predict
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV


import seaborn as sns

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1. Read in the data

In [138]:
crime_csv = './datasets/sf_crime_train.csv'

In [139]:
#read in the data using pandas
sf_crime = pd.read_csv(crime_csv)
sf_crime.drop('DayOfWeek',axis=1,inplace=True)
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [140]:
# check the shape of your dataframe
sf_crime.shape

(18000, 8)

In [141]:
#check whether there are any missing values
sf_crime.isnull().sum()
#do we need to fix anything here?
#we have duplicate rows, so we need to remove it.
sf_crime=sf_crime[~sf_crime.duplicated()]
sf_crime[sf_crime.duplicated()]

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y


In [142]:
#check what your datatypes are
sf_crime.dtypes
#do we need to fix anything here?
#yup the dates are registered as object.

Dates          object
Category       object
Descript       object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
dtype: object

### 2. Create column for year, month, day, hour, time, and date from 'Dates' column.

> *`pd.to_datetime` and `Series.dt` may be helpful here!*


In [143]:
# convert the 'Dates' column to a datetime object
sf_crime['Dates'] = pd.to_datetime(sf_crime['Dates'])
sf_crime.dtypes

Dates         datetime64[ns]
Category              object
Descript              object
PdDistrict            object
Resolution            object
Address               object
X                    float64
Y                    float64
dtype: object

In [144]:
# create a new column for 'Year','Month',and 'Day_of_Week'
sf_crime['Year'] = sf_crime['Dates'].dt.year
sf_crime['Month'] = sf_crime['Dates'].dt.month
sf_crime['Day_of_Week'] = sf_crime['Dates'].dt.dayofweek
#check the first couple rows to make sure it's what you want
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,2
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,2


In [145]:
# create a column for the 'Hour','Time', and 'Date'
sf_crime['Hour'] = sf_crime['Dates'].dt.hour
sf_crime['Time'] = sf_crime['Dates'].dt.time
sf_crime['Date'] = sf_crime['Dates'].dt.date

In [146]:
# Drop the 'Dates' column
sf_crime.drop('Dates',axis=1,inplace=True)
sf_crime.head()

Unnamed: 0,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week,Hour,Time,Date
0,WARRANTS,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,2,23,23:53:00,2015-05-13
1,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,2,23,23:53:00,2015-05-13
2,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414,2015,5,2,23,23:33:00,2015-05-13
3,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873,2015,5,2,23,23:30:00,2015-05-13
4,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541,2015,5,2,23,23:30:00,2015-05-13


### 3. Validate and clean the data.

In [147]:
# check the 'Category' value counts to see what sort of categories there are
# and to see if anything might require cleaning (particularly the ones with fewer values)
sf_crime.Category.value_counts()

LARCENY/THEFT                  4869
OTHER OFFENSES                 2288
NON-CRIMINAL                   2254
ASSAULT                        1535
VEHICLE THEFT                   965
VANDALISM                       870
BURGLARY                        730
WARRANTS                        728
SUSPICIOUS OCC                  592
MISSING PERSON                  533
DRUG/NARCOTIC                   496
ROBBERY                         465
FRAUD                           362
SECONDARY CODES                 261
WEAPON LAWS                     212
TRESPASS                        130
STOLEN PROPERTY                 111
SEX OFFENSES FORCIBLE           103
FORGERY/COUNTERFEITING           85
DRUNKENNESS                      74
KIDNAPPING                       50
DRIVING UNDER THE INFLUENCE      42
PROSTITUTION                     42
DISORDERLY CONDUCT               37
ARSON                            35
LIQUOR LAWS                      25
RUNAWAY                          15
BRIBERY                     

In [148]:
# What's going on with 'TRESPASS' and 'TRESPASSING'?
# What's going on with 'ASSAULT' and 'ASSUALT'?
# fix these with .loc
#sf_crime.loc[2750,'Category']='ASSAULT'
#sf_crime.loc[4330,'Category']='ASSAULT'
#sf_crime.loc[5519,'Category']='TRESPASS'

In [149]:
# have a look to see whether you have all the days of the week in your data
sf_crime.Day_of_Week.value_counts()

2    2924
4    2724
5    2552
3    2475
6    2453
0    2440
1    2394
Name: Day_of_Week, dtype: int64

In [150]:
# have a look at the value counts for 'Descript', 'PdDistrict', and 'Resolution' to make sure it all checks out
sf_crime.Descript.value_counts()

GRAND THEFT FROM LOCKED AUTO                           2118
STOLEN AUTOMOBILE                                       623
AIDED CASE, MENTAL DISTURBED                            591
DRIVERS LICENSE, SUSPENDED OR REVOKED                   588
BATTERY                                                 519
PETTY THEFT FROM LOCKED AUTO                            495
PETTY THEFT OF PROPERTY                                 481
LOST PROPERTY                                           468
WARRANT ARREST                                          429
MALICIOUS MISCHIEF, VANDALISM                           360
FOUND PROPERTY                                          353
MALICIOUS MISCHIEF, VANDALISM OF VEHICLES               334
GRAND THEFT FROM UNLOCKED AUTO                          321
SUSPICIOUS OCCURRENCE                                   305
INVESTIGATIVE DETENTION                                 246
FOUND PERSON                                            233
PETTY THEFT FROM A BUILDING             

In [151]:
sf_crime.PdDistrict.value_counts()

SOUTHERN      3274
NORTHERN      2247
CENTRAL       2202
MISSION       2117
BAYVIEW       1675
INGLESIDE     1628
TARAVAL       1424
TENDERLOIN    1326
RICHMOND      1097
PARK           972
Name: PdDistrict, dtype: int64

In [152]:
sf_crime.Resolution.value_counts()

NONE                                      12829
ARREST, BOOKED                             4450
UNFOUNDED                                   367
ARREST, CITED                               100
JUVENILE BOOKED                              94
EXCEPTIONAL CLEARANCE                        58
PSYCHOPATHIC CASE                            28
LOCATED                                      25
CLEARED-CONTACT JUVENILE FOR MORE INFO       10
NOT PROSECUTED                                1
Name: Resolution, dtype: int64

In [153]:
# use .describe() to see whether the location coordinates seem appropriate
sf_crime.describe()

Unnamed: 0,X,Y,Year,Month,Day_of_Week,Hour
count,17962.0,17962.0,17962.0,17962.0,17962.0,17962.0
mean,-122.423637,37.768447,2015.0,3.489756,3.008629,13.640797
std,0.026522,0.0244,0.0,0.868845,1.966646,6.540842
min,-122.513642,37.708154,2015.0,2.0,0.0,0.0
25%,-122.434156,37.75381,2015.0,3.0,1.0,10.0
50%,-122.416949,37.775603,2015.0,3.0,3.0,15.0
75%,-122.406539,37.78539,2015.0,4.0,5.0,19.0
max,-122.365565,37.819923,2015.0,5.0,6.0,23.0


### 4. Set up a target and predictor matrix for predicting violent crime vs. non-violent crime vs. non-crimes.

**Non-Violent Crimes:**
- bad checks
- bribery
- drug/narcotic
- drunkenness
- embezzlement
- forgery/counterfeiting
- fraud
- gambling
- liquor
- loitering 
- trespass

**Non-Crimes:**
- non-criminal
- runaway
- secondary codes
- suspicious occ
- warrants

**Violent Crimes:**
- everything else



**What type of model do you need here? What is your baseline accuracy?**

In [154]:
NVC = ['BAD CHECKS','BRIBERY','DRUG/NARCOTIC','DRUNKENNESS',
     'EMBEZZLEMENT','FORGERY/COUNTERFEITING','FRAUD',
     'GAMBLING','LIQUOR','LOITERING','TRESPASS','OTHER OFFENSES']

NOT_C = ['NON-CRIMINAL','RUNAWAY','SECONDARY CODES','SUSPICIOUS OCC','WARRANTS']

#use a list comprehension to get all the categories in sf_crime['Category'].unique() that are NOT in the lists above

VC = [i for i in sf_crime.Category.unique() if (i not in NVC) and (i not in NOT_C)]
VC

['LARCENY/THEFT',
 'VEHICLE THEFT',
 'VANDALISM',
 'ROBBERY',
 'ASSAULT',
 'WEAPON LAWS',
 'BURGLARY',
 'STOLEN PROPERTY',
 'MISSING PERSON',
 'KIDNAPPING',
 'DRIVING UNDER THE INFLUENCE',
 'SEX OFFENSES FORCIBLE',
 'PROSTITUTION',
 'DISORDERLY CONDUCT',
 'ARSON',
 'FAMILY OFFENSES',
 'LIQUOR LAWS',
 'ASSUALT',
 'SUICIDE',
 'TRESPASSING',
 'SEX OFFENSES NON FORCIBLE',
 'EXTORTION']

In [155]:
#add a column called 'Type' into your dataframe that stores whether the observation was:
#Non-Violent, Violent, or Non-Crime
#use .map()!
def typecrime(x):
    if x in NOT_C: return 'NOT_CRIMINAL'
    if x in NVC: return 'NON-VIOLENT'
    if x in VC: return 'VIOLENT_CRIME'

sf_crime['Type']=sf_crime.Category.map(typecrime)

In [156]:
#find the baseline accuracy:
baseline=(sf_crime.Type.value_counts()/len(sf_crime))[0]
baseline

0.592640017815388

In [157]:
#create a target array with 'Type'
#create a predictor matrix with 'Day_of_Week','Month','Year','PdDistrict','Hour', and 'Resolution'
y = sf_crime.Type
X = sf_crime[['Day_of_Week','Month','Year','PdDistrict','Hour','Resolution']]

In [158]:
#use pd.get_dummies() to dummify your categorical variables
#remember to drop a column!
X = pd.get_dummies(X,drop_first=True)

### 5. Create a train/test/split and standardize the predictor matrices

In [166]:
#create a 50/50 train test split; 
#stratify based on your target variable
#use a random state of 2018
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.5,random_state=2018,shuffle=True,stratify=y)
X_train.shape

(8981, 22)

In [167]:
#standardise your predictor matrices
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
X_train=ss.fit_transform(X_train)
X_test=ss.transform(X_test)

### 6. Create a basic Logistic Regression model and use cross_val_score to assess its performance on your training data

In [171]:
#create a default Logistic Regression model and find its mean cross-validated accuracy with your training data
#use 5 cross-validation folds
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
lm=LogisticRegression()
cross_val_score(lm,X_train,y_train,cv=5).mean()

0.6343363978652132

In [177]:
#create a confusion matrix with cross_val_predict
predictions = cross_val_predict(lm,X_test,y_test)
confusion = confusion_matrix(y_test,predictions)
pd.DataFrame(confusion,
             columns=sorted(y_train.unique()),
             index=sorted(y_train.unique()))

Unnamed: 0,NON-VIOLENT,NOT_CRIMINAL,VIOLENT_CRIME
NON-VIOLENT,813,35,886
NOT_CRIMINAL,360,137,1428
VIOLENT_CRIME,595,46,4681


### 7. Find the optimal hyperparameters (optimal regularization) to predict your crime categories using GridSearchCV.

> **Note:** Gridsearching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently - the gridsearch object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. I recommend the logistic regression one, but the downside is that lasso and ridge must be searched separately. To start with, use `GridSearchCV`.

**Reference for logistic regression regularization hyperparameters:**
- `solver`: algorithm used for optimization (relevant for multiclass)
    - Newton-cg - Handles Multinomial Loss, L2 only
    - Sag - Handles Multinomial Loss, Large Datasets, L2 Only, Works best on scaled data
    - lbfgs - Handles Multinomial Loss, L2 Only
    - liblinear - Small Datasets, no Warm Starts
- `C`: Regularization strengths (smaller values are stronger penalties)
- `penalty`: `'l1'` - Lasso, `'l2'` - Ridge 

In [199]:
#create a hyperparameter dictionary for a logistic regression
LogisticRegression()
params={
    'C': np.logspace(-3,3,10),
    'penalty': ['l1','l2'],
    'fit_intercept': [True]
}

In [200]:
#create a gridsearch object using LogisticRegression() and the dictionary you created above
grid=GridSearchCV(lm,params,scoring='accuracy',cv=5)

In [201]:
#fit the gridsearch object on your training data
grid.fit(X_train,y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([1.00000e-03, 4.64159e-03, 2.15443e-02, 1.00000e-01, 4.64159e-01,
       2.15443e+00, 1.00000e+01, 4.64159e+01, 2.15443e+02, 1.00000e+03]), 'penalty': ['l1', 'l2'], 'fit_intercept': [True]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [202]:
#print out the best parameters
grid.best_params_

{'C': 0.021544346900318832, 'fit_intercept': True, 'penalty': 'l1'}

In [203]:
#print out the best mean cross-validated score
grid.best_score_

0.6358980069034629

In [204]:
#assign your best estimator to the variable 'best_logreg'
best_logreg=grid.best_estimator_

In [205]:
#score your model on your testing data
best_logreg.score(X_test,y_test)

0.6285491593363768

### 8. Print out a classification report for your best_logreg model

In [206]:
#use your test data to create your classification report
predictions = best_logreg.predict(X_test)
print(classification_report(y_test,predictions))

               precision    recall  f1-score   support

  NON-VIOLENT       0.45      0.54      0.49      1734
 NOT_CRIMINAL       0.64      0.07      0.13      1925
VIOLENT_CRIME       0.68      0.86      0.76      5322

  avg / total       0.63      0.63      0.57      8981



### 9. Explore LogisticRegressionCV.  

With LogisticRegressionCV, you can access the best regularization strength for predicting each class! Read the documentation and see if you can implement a model with LogisticRegressionCV.

In [208]:
# A:
lmcv=LogisticRegressionCV(Cs=11)
lmcv.fit(X_train,y_test)

LogisticRegressionCV(Cs=11, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

In [216]:
lmcv.scores_

{'NON-VIOLENT': array([[0.80694723, 0.80694723, 0.80694723, 0.80694723, 0.80694723,
         0.80694723, 0.80694723, 0.80694723, 0.80694723, 0.80694723,
         0.80694723],
        [0.80694723, 0.80694723, 0.80694723, 0.80694723, 0.80694723,
         0.80694723, 0.80694723, 0.80694723, 0.80694723, 0.80694723,
         0.80694723],
        [0.80688273, 0.80688273, 0.80688273, 0.80688273, 0.80688273,
         0.80688273, 0.80688273, 0.80688273, 0.80688273, 0.80688273,
         0.80688273]]),
 'NOT_CRIMINAL': array([[0.78557114, 0.78557114, 0.78557114, 0.78557114, 0.78557114,
         0.78557114, 0.78557114, 0.78557114, 0.78557114, 0.78557114,
         0.78557114],
        [0.78557114, 0.78557114, 0.78557114, 0.78557114, 0.78557114,
         0.78557114, 0.78557114, 0.78557114, 0.78557114, 0.78557114,
         0.78557114],
        [0.78583361, 0.78583361, 0.78583361, 0.78583361, 0.78583361,
         0.78583361, 0.78583361, 0.78583361, 0.78583361, 0.78583361,
         0.78583361]]),
 'VIO