# Stanford Open Policing Dataset

In this project we will be analysing Police Activity in the state of Rhode Island.

Our data source is the Stanford Open Policing project, collecting data from 31 states in the US.  For size purposes, we will be focussing on a reduced data set with selected data fields; in particular relating to traffic stops in Rhode Island (State) only.

https://openpolicing.stanford.edu

Concretely, the data we will be analysing has been saved here:
https://github.com/JasonKwo/DataCamp-Data-Scientist-with-Python/blob/master/11-Analysing-Police-Activity-with-Pandas/police.csv



### Examining the dataset

Before beginning our analysis, it's important to familiarize ourselves with the dataset. Here, we'll examine the first few rows, and then count the number of missing values.  To start, I like to run .info() to get an overview of my data.

In [1]:
# Read 'police.csv' into a DataFrame named ri
import pandas as pd

file_name = 'https://raw.githubusercontent.com/JasonKwo/DataCamp-Data-Scientist-with-Python/master/11-Analysing-Police-Activity-with-Pandas/police.csv'
ri = pd.read_csv(file_name) # ri = pd.read_csv('police.csv')

# Examine the DataFrame
print(ri.info())

# Examine the head of the DataFrame
print(ri.head())

# Count the number of missing values in each column
print(ri.isnull().sum())    # .isnull() returns a dataframe of True and False values for missing data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62045 entries, 0 to 62044
Data columns (total 15 columns):
state                 62045 non-null object
stop_date             62045 non-null object
stop_time             62045 non-null object
county_name           0 non-null float64
driver_gender         58275 non-null object
driver_race           58277 non-null object
violation_raw         58277 non-null object
violation             58277 non-null object
search_conducted      62044 non-null object
search_type           2461 non-null object
stop_outcome          58277 non-null object
is_arrested           58277 non-null object
stop_duration         58277 non-null object
drugs_related_stop    62044 non-null object
district              62044 non-null object
dtypes: float64(1), object(14)
memory usage: 7.1+ MB
None
  state   stop_date stop_time  county_name driver_gender driver_race  \
0    RI  2005-01-04     12:55          NaN             M       White   
1    RI  2005-01-23     23:15    

### Dropping Columns

Often, a DataFrame will contain columns that are not useful to our analysis. Such columns should be dropped from the DataFrame, to make it easier for us to focus on the remaining columns.

Here, we will drop the county_name column because it only contains missing values, and we'll drop the state column because all of the traffic stops took place in one state (Rhode Island). Thus, these columns can be dropped because they contain no useful information.

In [2]:
# Examine the shape of the DataFrame
print(ri.shape)

# Drop the 'county_name' and 'state' columns
ri.drop(['county_name', 'state'], axis='columns', inplace=True)

# Examine the shape of the DataFrame (again)
print(ri.shape) # We now have two fewer columns

(62045, 15)
(62045, 13)


### Dropping rows

When you know that a specific column will be critical to your analysis, and only a small fraction of rows are missing a value in that column, it often makes sense to remove those rows from the dataset.

During this project, the driver_gender column will be critical to many of our analyses. Because only a small fraction of rows are missing driver_gender, we'll drop those rows from the dataset.

In [3]:
# Count the number of missing values in each column
print(ri.isnull().sum())

# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)

# Count the number of missing values in each column (again)
print(ri.isnull().sum())

# Examine the shape of the DataFrame
print(ri.shape)

stop_date                 0
stop_time                 0
driver_gender          3770
driver_race            3768
violation_raw          3768
violation              3768
search_conducted          1
search_type           59584
stop_outcome           3768
is_arrested            3768
stop_duration          3768
drugs_related_stop        1
district                  1
dtype: int64
stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           55814
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64
(58275, 13)
