# Lesson I

Now that you have learned the foundations of pandas, this course will give you the chance to apply that knowledge by answering interesting questions about a real dataset! You will explore the Stanford Open Policing Project dataset and analyze the impact of gender on police behavior. During the course, you will gain more practice cleaning messy data, creating visualizations, combining and reshaping datasets, and manipulating time series data. Analyzing Police Activity with pandas will give you valuable experience analyzing a dataset from start to finish, preparing you for your data science career!

## Standord Open Policing Project Datasets

Let's start by introducing the data. We'll be working with a dataset of traffic stops by police officers that was collected by the **Stanford Open Policing Project**:

* They've collected data from 31 states.
* In this course we'll focus on "Rhode Island"
* For size reasons some  of the coluımns and rows have been removed.
* For full datasets: 
    - Download from [here](https://openpolicing.stanford.edu/)

### Preparing the Data

This chapter is about preparing the data for analysis. Before beginning an analysis, it's critical that we first:

* Examine the data - to make sure we understand it -
- Clean the data 

Let's start by importing necessary packages and datasets...

In [1]:
# Import Packages
import pandas as pd

# 'Police' Datasets
ri = pd.read_csv('datasets/police.csv')

ri.head(3)

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4


* Each row represents a single traffic stop
* ``NaN`` indicates a missing value

#### Locating Missing Values

It's important that we locate the missing values so that we can *proactively* decide how to handle them.

We may recall the ``isnull()`` method generates a DataFrame of ``True`` and ``False`` values.

* ``True`` for element is missing
* ``False`` for if not.

In [2]:
ri.isnull()

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91736,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
91737,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
91738,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
91739,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False


One useful trick is to take the sum of this DataFrame, which outputs a count of the number of missing values in each column.

Then we can compare this result to DataFrame's shape.

In [3]:
# sum() calculates the sum of each column
ri.isnull().sum()
# True = 1, False = 0

state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64

#### Dropping a Column

In [4]:
# Lets look at the shape of the DataFrame
ri.shape

(91741, 15)

* ```county_name`` column only contains missing values - both 91741 -
* We can drop ``county_name`` using the ``drop()`` method.

In [5]:
ri.drop('county_name', axis='columns', inplace=True)

#### Dropping Rows

Finally, let's take a look at one more method related to missing values. The ``dropna()`` method is a great way to drop rows based on the presence of missing values in that row. 

For example, let's pretend that the ``stop_date`` and ``stop_time`` columns are critical to our analysis, and thus a row is useless to us without that data. We can tell pandas to drop all rows that have a missing value in either the ``stop_date`` or ``stop_time`` column.

In [6]:
ri.dropna(subset=['stop_date', 'stop_time'], inplace=True)

## Exercise

### Dropping More Columns and Rows

In [8]:
# Examine the shape of the DataFrame
print(ri.shape)

# Drop the 'county_name' and 'state' columns
ri.drop('state', axis='columns', inplace=True)

# Examine the shape of the DataFrame (again)
print(ri.shape)

(91741, 14)
(91741, 13)


In [10]:
# Count the number of missing values in each column
print(ri.isnull().sum())
print('-------------------------')

# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)

# Count the number of missing values in each column (again)
print(ri.isnull().sum())

# Examine the shape of the DataFrame
print(ri.shape)

stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64
-------------------------
stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64
(86536, 13)


# Lesson II

## Using proper Data Types

We'll continue cleaning the datasets by ensuring that each of the columns has the proper data type.

First let's take a look at the ``dtypes`` attribute of the DataFrame:

In [11]:
ri.dtypes

stop_date             object
stop_time             object
driver_gender         object
driver_race           object
violation_raw         object
violation             object
search_conducted        bool
search_type           object
stop_outcome          object
is_arrested           object
stop_duration         object
drugs_related_stop      bool
district              object
dtype: object

Every series has a data type, which automatically inferred by pandas when it was reading the CSV file.

As we can see the only data types currently in use are ``object`` and ``bool``

* ``object``: datatype usually means ``Strings`` - though it can indicate other Python objects - 
     - Such as ``list``

* ``bool`` is short for 'Boolean'

* Pandas also supports other data types, such as :
    - ``int``, ``float``, ``datetime``, ``category``


## Why do data types matter?

* it affects which operations we can perform on a given Series.
* Avoid storing data as strings (when possible)
    - ``int``, ``float`` : enables mathematical operations
    - ``datetime``: enables date-based attributes and methods.
    - ``category``: uses less memory and runs faster.
    - ``bool``: enables logical and mathematical operations.


## Fixing a Data type

We'll imagine a DataFrame named ``apple``:

```python
    apple

    '''
        date        time    price
    0   2/13/18    16:00    164.34
    1   2/14/18    16:00    167.37
    2   2/15/18    16:00    172.99 
    '''

    apple.price.dtype

    '''
    dtype('O')
    '''
```

When we check the datatype, it reports dtype of 'O', which stands for object and means that numbers are actually stored as strings.

To change the data type from object to float, we can use ``astype()`` method.

```python
    apple['price'] = apple.price.astype('float')

    apple.price.dtype

    '''
    dtype('float64')
    '''
```

* Dot notation: ``apple.price`` same as ``apple['price']`` on the right side
* On the left side you need to use ``apple['price']``

## Exercise

### Fixing a Data type

In this exercise, we'll change the data type of ``is_arrested`` columng from ``object`` to ``bool``, which is the most suitable type for a column containing ``True`` and ``False`` values.

Fixing the data type will enable us to use mathematical operations on the ``is_arrested`` column that would not be possible otherwise.

In [12]:
# Examine the head of the 'is_arrested' column
print(ri.is_arrested.dtype)

# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype('bool')

# Check the data type of 'is_arrested' 
print(ri.is_arrested.dtype)

object
bool


# Lesson III

## Creating a DatetimeIndex

Now, we're going to build a DatetimeIndex for our DataFrame.

Lets take a look at the head of our dataset again.

In [13]:
ri.head(3)

Unnamed: 0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,2005-01-04,12:55,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,2005-02-17,04:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4


As we can see, the date and stop time of each traffic stop are stored in separate columns, both of which are 'object' columns.

In this Lesson we will combine these two columns into a single column and convert it to pandas' datetime format.

Remember the previous 'apple' dataframe. In that DataFrame date and time are stored in seperate columns., so first task is to combine these two columns using a string method.

To combine columns, we're going to use the ``str.cat()`` method, which is short for concatenate.

```python
    combined = apple.date.str.cat(apple.time, sep=' ')

    combined
    '''
    0   2/13/18 16:00
    1   2/14/18 16:00
    2   2/15/18 16:00   
    '''
```

### Converting to datetime format

To convert the combined Series to datetime format, we simply pass it to the ``to_datetime()`` function, and store the result in a new column. We don't even need to specify the original data is in which format, instead pandas just figured it out.

```python
    apple['date_and_time'] = pd.to_datetime(combined)
```

### Setting the index

To set the datetime column as the index, we'll use the ``set_index()`` method.

* Index makes it: 
    - Easier to filter DataFrame
    - Plot the data by date

***When an existing column becomes the index, it is no longer considered to be one of the DataFrame Columns.***

```python
    apple.set_index('date_and_time', inplace=True)
```

## Exercise

### Combining Object Columns

Currently, the date and time of each traffic stop are stored in separate object columns: ``stop_date`` and ``stop_time``.

In this exercise, you'll combine these two columns into a single column, and then convert it to ``datetime`` format. This will enable convenient date-based attributes that we'll use later in the course.

In [17]:
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')

# Convert 'combined' to datetime format
ri['stop_datetime'] = pd.to_datetime(combined)

# Examine the data types of the DataFrame
print(ri.dtypes)

stop_date                     object
stop_time                     object
driver_gender                 object
driver_race                   object
violation_raw                 object
violation                     object
search_conducted                bool
search_type                   object
stop_outcome                  object
is_arrested                     bool
stop_duration                 object
drugs_related_stop              bool
district                      object
stop_datetime         datetime64[ns]
dtype: object


### Setting the index

The last step that you'll take in this chapter is to set the ``stop_datetime`` column as the DataFrame's index. By replacing the default index with a ``DatetimeIndex``, you'll make it easier to analyze the dataset by date and time, which will come in handy later in the course!

In [19]:
# Set 'stop_datetime' as the index
ri.set_index('stop_datetime', inplace=True)

# Examine the index
print(ri.index)

# Examine the columns
print(ri.columns)

DatetimeIndex(['2005-01-04 12:55:00', '2005-01-23 23:15:00',
               '2005-02-17 04:15:00', '2005-02-20 17:15:00',
               '2005-02-24 01:20:00', '2005-03-14 10:00:00',
               '2005-03-29 21:55:00', '2005-04-04 21:25:00',
               '2005-07-14 11:20:00', '2005-07-14 19:55:00',
               ...
               '2015-12-31 13:23:00', '2015-12-31 18:59:00',
               '2015-12-31 19:13:00', '2015-12-31 20:20:00',
               '2015-12-31 20:50:00', '2015-12-31 21:21:00',
               '2015-12-31 21:59:00', '2015-12-31 22:04:00',
               '2015-12-31 22:09:00', '2015-12-31 22:47:00'],
              dtype='datetime64[ns]', name='stop_datetime', length=86536, freq=None)
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_race',
       'violation_raw', 'violation', 'search_conducted', 'search_type',
       'stop_outcome', 'is_arrested', 'stop_duration', 'drugs_related_stop',
       'district'],
      dtype='object')
