# Lesson I

Now that you have learned the foundations of pandas, this course will give you the chance to apply that knowledge by answering interesting questions about a real dataset! You will explore the Stanford Open Policing Project dataset and analyze the impact of gender on police behavior. During the course, you will gain more practice cleaning messy data, creating visualizations, combining and reshaping datasets, and manipulating time series data. Analyzing Police Activity with pandas will give you valuable experience analyzing a dataset from start to finish, preparing you for your data science career!

## Standord Open Policing Project Datasets

Let's start by introducing the data. We'll be working with a dataset of traffic stops by police officers that was collected by the **Stanford Open Policing Project**:

* They've collected data from 31 states.
* In this course we'll focus on "Rhode Island"
* For size reasons some  of the coluımns and rows have been removed.
* For full datasets: 
    - Download from [here](https://openpolicing.stanford.edu/)

### Preparing the Data

This chapter is about preparing the data for analysis. Before beginning an analysis, it's critical that we first:

* Examine the data - to make sure we understand it -
- Clean the data 

Let's start by importing necessary packages and datasets...

In [1]:
# Import Packages
import pandas as pd

# 'Police' Datasets
ri = pd.read_csv('datasets/police.csv')

ri.head(3)

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4


* Each row represents a single traffic stop
* ``NaN`` indicates a missing value

#### Locating Missing Values

It's important that we locate the missing values so that we can *proactively* decide how to handle them.

We may recall the ``isnull()`` method generates a DataFrame of ``True`` and ``False`` values.

* ``True`` for element is missing
* ``False`` for if not.

In [2]:
ri.isnull()

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91736,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
91737,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
91738,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
91739,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False


One useful trick is to take the sum of this DataFrame, which outputs a count of the number of missing values in each column.

Then we can compare this result to DataFrame's shape.

In [3]:
# sum() calculates the sum of each column
ri.isnull().sum()
# True = 1, False = 0

state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64

#### Dropping a Column

In [4]:
# Lets look at the shape of the DataFrame
ri.shape

(91741, 15)

* ```county_name`` column only contains missing values - both 91741 -
* We can drop ``county_name`` using the ``drop()`` method.

In [5]:
ri.drop('county_name', axis='columns', inplace=True)

#### Dropping Rows

Finally, let's take a look at one more method related to missing values. The ``dropna()`` method is a great way to drop rows based on the presence of missing values in that row. 

For example, let's pretend that the ``stop_date`` and ``stop_time`` columns are critical to our analysis, and thus a row is useless to us without that data. We can tell pandas to drop all rows that have a missing value in either the ``stop_date`` or ``stop_time`` column.

In [6]:
ri.dropna(subset=['stop_date', 'stop_time'], inplace=True)

## Exercise

### Dropping More Columns and Rows

In [8]:
# Examine the shape of the DataFrame
print(ri.shape)

# Drop the 'county_name' and 'state' columns
ri.drop('state', axis='columns', inplace=True)

# Examine the shape of the DataFrame (again)
print(ri.shape)

(91741, 14)
(91741, 13)


In [10]:
# Count the number of missing values in each column
print(ri.isnull().sum())
print('-------------------------')

# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)

# Count the number of missing values in each column (again)
print(ri.isnull().sum())

# Examine the shape of the DataFrame
print(ri.shape)

stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64
-------------------------
stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64
(86536, 13)
