### Case Study  1
    Date : 16th March 2019
    Instructor : Bob Wu
    
# Machine Learning: Traffic Police Stop
>This case study explores a real-data of traffic stops stored and distributed under Stanford open policing project. The stops data was recorded for Rhode Island state in Usa. Each row in the dataset contains record of individual stops. Various variables such as reason for stop , outcome of stop , and demographic variable of driver appear as fileds. In this exercise we will use our newly learned indexing, boolean indexing , filtering and method chaining skills to answer some real and interesting questions. 



### Importing the Required Libraries
Often times data analysis requires manipulations that are out of bounds for python built-in capabilities. Hence we need to import these external libraries/packages. It is a standard practice to import the required libraries early on in the process (usually first two cells)

In [1]:
## Data Analysis Libraries
import pandas as pd
import numpy as np

## Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.image as mpl

## Preprocessing Libraries
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

## Training Functions
from sklearn.model_selection import KFold, ParameterGrid

## Machine learning libraries
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

### Importing Dataset
Source : Online Policing Project


In [65]:
ri=pd.read_excel("Rhode Island Traffic Stops.xlsx")
ri.head()

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,0.0,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,0.0,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,0.0,0-15 Min,False,Zone X4
3,RI,2005-02-20,17:15,,M,White,Call for Service,Other,False,,Arrest Driver,1.0,16-30 Min,False,Zone X1
4,RI,2005-02-24,01:20,,F,White,Speeding,Speeding,False,,Citation,0.0,0-15 Min,False,Zone X3


## Exploring Data Types

In order to prepare your data into a machine-friendly format, it's necessary to convert all categorical data into numbers. One-hot encoding is used for nominal data and label encoding for ordinal. Let's explore some of the categorical variables below.

In [26]:
ri['driver_gender'].value_counts() #counts frequency

M    62762
F    23774
Name: driver_gender, dtype: int64

In [21]:
ri['driver_race'].value_counts()

White       61872
Black       12285
Hispanic     9727
Asian        2390
Other         265
Name: driver_race, dtype: int64

In [22]:
ri['violation'].value_counts()

Speeding               48424
Moving violation       16224
Equipment              10922
Other                   4410
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64

In [23]:
ri['stop_outcome'].value_counts()

Citation            77092
Arrest Driver        2735
No Action             625
N/D                   607
Arrest Passenger      343
Name: stop_outcome, dtype: int64

In [20]:
ri['stop_duration'].value_counts()

0-15 Min     69579
16-30 Min    13740
30+ Min       3220
Name: stop_duration, dtype: int64

In [19]:
ri['district'].value_counts()

Zone X4    24279
Zone K3    20405
Zone K2    18397
Zone X3    17013
Zone K1     8678
Zone X1     2969
Name: district, dtype: int64

## Feature Types

First step is to decide which features are relevant for your problem and seperate them as categorical or numeric

In [108]:
nominal_feature = ["driver_gender","driver_race","violation","district"]
ordinal_feature = ["stop_duration","stop_outcome"]
numeric_feature = ["search_conducted","drugs_related_stop"]

all_features = nominal_feature + ordinal_feature + numeric_feature

## Encoding Categorical Variables

In order to prepare your data into a machine-friendly format, it's necessary to convert all categorical data into numbers. One-hot encoding is used for nominal data and label encoding for ordinal.

In [109]:
data = pd.get_dummies(ri[all_features],columns=nominal_feature)
data.head()

Unnamed: 0,stop_duration,stop_outcome,search_conducted,drugs_related_stop,driver_gender_F,driver_gender_M,driver_race_Asian,driver_race_Black,driver_race_Hispanic,driver_race_Other,...,violation_Other,violation_Registration/plates,violation_Seat belt,violation_Speeding,district_Zone K1,district_Zone K2,district_Zone K3,district_Zone X1,district_Zone X3,district_Zone X4
0,0-15 Min,Citation,False,False,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0-15 Min,Citation,False,False,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
2,0-15 Min,Citation,False,False,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,16-30 Min,Arrest Driver,False,False,0,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
4,0-15 Min,Citation,False,False,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [110]:
data["stop_duration"] = pd.Categorical(data["stop_duration"]).codes
data["stop_outcome"] = pd.Categorical(data["stop_outcome"]).codes
data.head()

Unnamed: 0,stop_duration,stop_outcome,search_conducted,drugs_related_stop,driver_gender_F,driver_gender_M,driver_race_Asian,driver_race_Black,driver_race_Hispanic,driver_race_Other,...,violation_Other,violation_Registration/plates,violation_Seat belt,violation_Speeding,district_Zone K1,district_Zone K2,district_Zone K3,district_Zone X1,district_Zone X3,district_Zone X4
0,0,2,False,False,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,2,False,False,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
2,0,2,False,False,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,1,0,False,False,0,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
4,0,2,False,False,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


## Setting up Dataset

In [111]:
data.fillna(0,inplace=True)

In [112]:
y = data['stop_outcome'].values #target column
X = data.drop(['stop_outcome'],axis=1).values #exclude the target column

## Classification

In [None]:
model = SVC()
model.fit(X,y)
print("score:",model.score(X,y))
yp = model.predict(X)
yp

In [106]:
data[(yp==1)].head()

Unnamed: 0,stop_duration,search_conducted,is_arrested,drugs_related_stop,driver_gender_F,driver_gender_M,driver_race_Asian,driver_race_Black,driver_race_Hispanic,driver_race_Other,...,stop_outcome_Citation,stop_outcome_N/D,stop_outcome_No Action,stop_outcome_Warning,district_Zone K1,district_Zone K2,district_Zone K3,district_Zone X1,district_Zone X3,district_Zone X4
3,1,False,1.0,False,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
16,2,False,1.0,False,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
61,2,True,1.0,False,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
71,2,False,1.0,False,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
82,2,False,1.0,False,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## Crossvalidation Dataset

In [78]:
kf = KFold(n_splits=5,random_state=42,shuffle=True) #5 fold cross validation
kf

KFold(n_splits=5, random_state=42, shuffle=True)

## Train model

In [98]:
for train_ind, test_ind in kf.split(X):
    model = SVC()
    model.fit(X[train_ind],y[train_ind])
    print(model.score(X[test_ind],y[test_ind]))

1.0
1.0
1.0
1.0
1.0


In [97]:
outcome = model.predict(X)

NotFittedError: This SVC instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91741 entries, 0 to 91740
Data columns (total 15 columns):
state                 91741 non-null object
stop_date             91741 non-null object
stop_time             91741 non-null object
county_name           0 non-null float64
driver_gender         86536 non-null object
driver_race           86539 non-null object
violation_raw         86539 non-null object
violation             86539 non-null object
search_conducted      91741 non-null bool
search_type           3307 non-null object
stop_outcome          86539 non-null object
is_arrested           86539 non-null object
stop_duration         86539 non-null object
drugs_related_stop    91741 non-null bool
district              91741 non-null object
dtypes: bool(2), float64(1), object(12)
memory usage: 9.3+ MB


* Examining the data types of columns

## Analysis

Unnamed: 0,state,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
3,RI,2005-02-20,17:15,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
4,RI,2005-02-24,01:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


##  What kind of violatons are captured in the dataset ? 

Which column should we seperate out as a series and What kind of series method do you think we should we apply to solve this ?

array(['Equipment/Inspection Violation', 'Speeding', 'Call for Service',
       nan, 'Other Traffic Violation', 'Registration Violation',
       'Special Detail/Directed Patrol', 'APB',
       'Motorist Assist/Courtesy', 'Suspicious Person',
       'Violation of City/Town Ordinance', 'Warrant',
       'Seatbelt Violation'], dtype=object)

### How many types of unique violation are there in the dataset ? 

12

Since now we know that there are 12 different types of viloation, it wil helpful to know the frequency ofeach violation in the dataset . For this we will use value_counts() method of the series.

### What is the proportion of each unique violation in the dataset  ? 

Speeding                            48424
Other Traffic Violation             16224
Equipment/Inspection Violation      10922
Registration Violation               3703
Seatbelt Violation                   2856
Special Detail/Directed Patrol       2467
Call for Service                     1392
Motorist Assist/Courtesy              205
Violation of City/Town Ordinance      181
APB                                    91
Suspicious Person                      56
Warrant                                18
Name: violation_raw, dtype: int64

## How many *females* were caught *speeding* ? 

(15646, 15)

## What proportion of speeding violation  resulted in arrest 

## Is any one particular race more prone to arrest

driver_race  violation_raw                     is_arrested
Asian        Call for Service                  False          0.785714
                                               True           0.214286
             Equipment/Inspection Violation    False          0.968037
                                               True           0.031963
             Motorist Assist/Courtesy          False          1.000000
             Other Traffic Violation           False          0.962264
                                               True           0.037736
             Registration Violation            False          0.956522
                                               True           0.043478
             Seatbelt Violation                False          0.942308
                                               True           0.057692
             Special Detail/Directed Patrol    False          0.982143
                                               True           0.017857
             Speed

## What is the Search Rate for each violation 

Since the data contains a column search_conducted , it will be useful to know out of all the stops for different violation what is the rate of search. We will use the tools we have learned to answer this question.

violation_raw                     search_conducted
APB                               False               0.824176
                                  True                0.175824
Call for Service                  False               0.923132
                                  True                0.076868
Equipment/Inspection Violation    False               0.935726
                                  True                0.064274
Motorist Assist/Courtesy          False               0.926829
                                  True                0.073171
Other Traffic Violation           False               0.942986
                                  True                0.057014
Registration Violation            False               0.906562
                                  True                0.093438
Seatbelt Violation                False               0.968487
                                  True                0.031513
Special Detail/Directed Patrol    False               0.989461
    

## What is the Search Rate by Gender ? 
This question can be answered by same technique applied above. The only addition would to add gender variable in the group by statement. This will insure that the final results are segemented by gender as well.  

In [53]:
ri.groupby(['violation_raw','driver_race'])['search_conducted'].value_counts(normalize=True)[:,:,True].sort_values().plot(kind='barh',figsize=(6,8))

<matplotlib.axes._subplots.AxesSubplot at 0x21a855f2d68>

## String Manipulation

## How many times the search was conducted becasue of Reasonable Suspicion ? 
Here we will make use of string accesseor to the series which will check each unique value against the criteria defines. The resultant value vill be Boolean data type.

In [142]:
ri['search_type'].str.contains('Suspicion',na=False).mean()

0.003433579315682192

### What Proprtion of Searches were conducted where atleast one cause was Reasonabale Suspicion ?
Here first we will filter the records that belong to the instance when the search was conducted. Next we are sub-setting to only the column 'search_type'. Once we have this series we are using series methods for string in pandas. The series string methods start with the *str* accessor followed by the actual method.

In [151]:
ri[ri['search_conducted']==True]['search_type'].str.contains('Suspicion',na=False).mean()

0.09525249470819473

##  Bring your Creative Sprit out ! 

### Using the Techniques presented above , can you utlize other columns in the dataframe and answer someother interesting questions.


> Hints : Utlize duration of stop , time of stop etc or try using violation_raw or search_conducted in more creative ways

In [174]:
ri

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
3,RI,2005-02-20,17:15,,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
4,RI,2005-02-24,01:20,,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3
5,RI,2005-03-14,10:00,,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
6,RI,2005-03-29,21:55,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
7,RI,2005-04-04,21:25,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K1
8,RI,2005-07-14,11:20,,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
9,RI,2005-07-14,19:55,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
