## In this Notebook:

- **Do genders commit different violations?**
- **Does gender affect who gets a ticket for speeding?**
- **Does gender affect whose vehicle is searched?**
- **Does gender affect who is frisked during a search?**

Note:  This is a simplistic view of analysis, it is primarily to practice certain subsetting of the dataset.  This exercise along with its questions mimics the DataCamp course on "Analyzing Police Activity with Pandas".

In other words, this exploration is more about the relationships between data, and not at all to explain causation. Even though the questions might implie a search into causation, this exploration is soley for correlations.

#### Importing the Prepared Dataset

In [10]:
#importing the pandas library to handle actions on the dataframe
import pandas as pd

#importing the prepared datset
RI_traffic_prepd = pd.read_csv("RI_traffic_prepped.csv")
RI_traffic_prepd.drop('Unnamed: 0', axis='columns', inplace=True)
RI_traffic_prepd.head()

Unnamed: 0,driver_gender,driver_race,violation_raw,violation,search_conducted,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district,date_time_stop
0,M,White,Equipment/Inspection Violation,Equipment,False,Citation,False,0-15 Min,False,Zone X4,2005-01-04 12:55:00
1,M,White,Speeding,Speeding,False,Citation,False,0-15 Min,False,Zone K3,2005-01-23 23:15:00
2,M,White,Speeding,Speeding,False,Citation,False,0-15 Min,False,Zone X4,2005-02-17 04:15:00
3,M,White,Call for Service,Other,False,Arrest Driver,True,16-30 Min,False,Zone X1,2005-02-20 17:15:00
4,F,White,Speeding,Speeding,False,Citation,False,0-15 Min,False,Zone X3,2005-02-24 01:20:00


### Do genders commit different violations?

- using value_counts() for categorical columns

In [15]:
RI_traffic_prepd.stop_outcome.value_counts()

Citation            77092
Arrest Driver        2735
No Action             625
N/D                   607
Arrest Passenger      343
Name: stop_outcome, dtype: int64

- The most common stop outcome is a citation followed by a warning.

In [16]:
#proportion of value counts
RI_traffic_prepd.stop_outcome.value_counts(normalize=True)

Citation            0.890835
Arrest Driver       0.031604
No Action           0.007222
N/D                 0.007014
Arrest Passenger    0.003964
Name: stop_outcome, dtype: float64

- This gives a good sense of the proportionality of stop outcomes

Examining the proportions of driver_races.

In [17]:
RI_traffic_prepd.driver_race.value_counts(normalize=True)

White       0.714961
Black       0.141959
Hispanic    0.112400
Asian       0.027618
Other       0.003062
Name: driver_race, dtype: float64

- We can observe the largest proportion at 71% is "White"

In [18]:
#subsetting the data by a value in the driver_race column, in this case "white"
RI_white_drivers = RI_traffic_prepd[RI_traffic_prepd.driver_race == "White"]
RI_white_drivers.shape

(61872, 11)

Now if look at this subsettted dataset for stop_outcomes we can see how it has changed

In [19]:
RI_white_drivers.stop_outcome.value_counts(normalize=True)

Citation            0.902234
Arrest Driver       0.024017
No Action           0.007047
N/D                 0.006433
Arrest Passenger    0.002748
Name: stop_outcome, dtype: float64

**Observation**: The citation percentage went up to 90% (from 89%) largely because the "White" value is contributing to a large extent to those types of stop_oucomes, now that the dataset is isolated to only where the data represents the "White" value.

### Does gender affect who gets a ticket for speeding?

- Filtering by multiple conditions

In [21]:
#creating a dataframe of female drivers
RI_female_drivers = RI_traffic_prepd[RI_traffic_prepd.driver_gender == 'F']
RI_female_drivers.shape

(23774, 11)

In [22]:
#creating another subset of the dataset where the dataset represents females that were arrested
RI_females_arrested = RI_traffic_prepd[(RI_traffic_prepd.driver_gender == 'F') &
                                      (RI_traffic_prepd.is_arrested == True)]
RI_females_arrested.shape

(669, 11)

In [23]:
#now creating a subset of the data where they are FEMALE -OR- ARRESTED
RI_females_OR_arrested = RI_traffic_prepd[(RI_traffic_prepd.driver_gender == 'F') | #<--
                                      (RI_traffic_prepd.is_arrested == True)]
RI_females_OR_arrested.shape

(31385, 11)

Here we include:
- All drivers that are female
- Plus, drivers that are arrested that may not be female

This is because in this scenario, either condition can be met.

### Does gender affect whose vehicle is searched?

Exploring mathematical calculations with Boolean values

**The mean of a Boolean Series represents the percentage of True values.**

In [24]:
#value counts that result in an arrest
RI_traffic_prepd.is_arrested.value_counts(normalize=True)

False    0.909746
True     0.090254
Name: is_arrested, dtype: float64

In [25]:
#taking the mean of these bolean values
RI_traffic_prepd.is_arrested.mean()

0.09025408486936048

##### using the groupby() method to explore arrest rate by district

In [27]:
RI_traffic_prepd.district.unique()

array(['Zone X4', 'Zone K3', 'Zone X1', 'Zone X3', 'Zone K1', 'Zone K2'],
      dtype=object)

In [28]:
#can calculate the arrest rate in a particular zone
RI_traffic_prepd[RI_traffic_prepd.district == 'Zone K1'].is_arrested.mean()

0.06718137819774142

In [29]:
#grouping by district column will give us a way to compare across districts
RI_traffic_prepd.groupby('district').is_arrested.mean()

district
Zone K1    0.067181
Zone K2    0.069522
Zone K3    0.072286
Zone X1    0.230044
Zone X3    0.082466
Zone X4    0.117674
Name: is_arrested, dtype: float64

In [31]:
#grouping by multiple categories to look at arrest rate
RI_traffic_prepd.groupby(['district', 'driver_gender']).is_arrested.mean()

district  driver_gender
Zone K1   F                0.019169
          M                0.026588
Zone K2   F                0.022196
          M                0.034285
Zone K3   F                0.025156
          M                0.034961
Zone X1   F                0.019646
          M                0.024563
Zone X3   F                0.027188
          M                0.038166
Zone X4   F                0.042149
          M                0.049956
Name: is_arrested, dtype: float64

In [32]:
#can change the presentation of the result by swapping the fields, because occurence will be same
RI_traffic_prepd.groupby(['driver_gender', 'district']).is_arrested.mean()

driver_gender  district
F              Zone K1     0.019169
               Zone K2     0.022196
               Zone K3     0.025156
               Zone X1     0.019646
               Zone X3     0.027188
               Zone X4     0.042149
M              Zone K1     0.026588
               Zone K2     0.034285
               Zone K3     0.034961
               Zone X1     0.024563
               Zone X3     0.038166
               Zone X4     0.049956
Name: is_arrested, dtype: float64

### Does gender affect who is frisked during a search?

In [33]:
RI_traffic_prepd.search_conducted.value_counts()

False    88434
True      3307
Name: search_conducted, dtype: int64

In [36]:
#to get the correlation to where a search is not conducted in the search_type as NaN
RI_traffic_prepd.search_type.value_counts(dropna=False)

AttributeError: 'DataFrame' object has no attribute 'search_type'

- oops that column has been dropped, and this is where by way of exploration it is obvious that even though the search_type had 96% missing values, it was for a good reason.  This is where one would go back and add this field back in with a note of determination that it is a reasonable cause for why its missing so many values.  
- Through this exercise it would have been obvious that the NAN values in this field would be the same number of values as "False" in the search_conducted field.  The correlation here makes sense.  
- If there is no search_conducted (False) it would also mean there would be no search_type listed.