<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/037__Working_with_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 5/6: DATA CLEANING IN PYTHON: ADVANCED

# MISSION 4: Working with Missing Data

Identify and deal with missing and incorrect data.


## 1. Introduction

In the last mission of this course, we're going to learn more about working with missing data. As we learned in [Working with Missing and Duplicate Data](https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data), data can be missing for a variety of reasons.

In this mission, we'll learn how to handle missing data without having to drop rows and columns using data on motor vehicle collisions released by [New York City](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95) and published on the NYC OpenData website. There is data on over 1.5 million collisions dating back to 2012, with additional data continuously added.

We'll work with an extract of the full data: Crashes from the year 2018. We made several modifications to the data for teaching purposes, including randomly sampling the data to reduce its size. You can download the data set from this mission by using the data set preview tool at the top of the "script.py" codebox on the right.

Our data set is in a CSV called `nypd_mvc_2018.csv`. We can read our data into a pandas dataframe and inspect the first few rows of the data:

In [1]:

# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [2]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
# https://drive.google.com/file/d/137_5T2t59aksuPV2aLP5VldshtCb9vN1/view?usp=sharing
id = "137_5T2t59aksuPV2aLP5VldshtCb9vN1"

In [3]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('nypd_mvc_2018.csv')

In [4]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
# https://drive.google.com/file/d/111gD0MnU_ekqTMK-KYgNt7uP5WEVZJCl/view?usp=sharing
id = "111gD0MnU_ekqTMK-KYgNt7uP5WEVZJCl"

In [5]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('supplemental_data.csv')

In [6]:
import pandas as pd
import numpy as np
mvc = pd.read_csv("nypd_mvc_2018.csv")

print(mvc)

       unique_key        date  ... cause_vehicle_4 cause_vehicle_5
0         3869058  2018-03-23  ...             NaN             NaN
1         3847947  2018-02-13  ...             NaN             NaN
2         3914294  2018-06-04  ...             NaN             NaN
3         3915069  2018-06-05  ...             NaN             NaN
4         3923123  2018-06-16  ...             NaN             NaN
...           ...         ...  ...             ...             ...
57859     3835191  2018-01-26  ...             NaN             NaN
57860     3890674  2018-04-29  ...             NaN             NaN
57861     3946458  2018-07-21  ...             NaN             NaN
57862     3914574  2018-06-04  ...             NaN             NaN
57863     4034882  2018-11-29  ...             NaN             NaN

[57864 rows x 26 columns]


A summary of the columns and their data is below:

- `unique_key`: A unique identifier for each collision.
- `date`, `time`: Date and time of the collision.
- `borough`: The [borough](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City), or area of New York City, where the collision occurred.
- `location`: Latitude and longitude coordinates for the collision.
- `on_street`, `cross_street`, `off_street`: Details of the street or intersection where the collision occurred.
- `pedestrians_injured`: Number of pedestrians who were injured.
- `cyclist_injured`: Number of people traveling on a bicycle who were injured.
- `motorist_injured`: Number of people traveling in a vehicle who were injured.
- `total_injured`: Total number of people injured.
- `pedestrians_killed`: Number of pedestrians who were killed.
- `cyclist_killed`: Number of people traveling on a bicycle who were killed.
- `motorist_killed`: Number of people traveling in a vehicle who were killed.
- `total_killed`: Total number of people killed.
- `vehicle_1` through `vehicle_5`: Type of each vehicle involved in the accident.
- `cause_vehicle_1` through `cause_vehicle_5`: Contributing factor for each vehicle in the accident.

Let's quickly recap how to count missing values. We'll start by creating a dataframe with random null values:

In [7]:
data = np.random.choice([1.0, np.nan],
                        size=(3, 3),
                        p=[.3, .7])
df = pd.DataFrame(data, columns=['A','B','C'])
print(df)

     A    B    C
0  1.0  1.0  1.0
1  NaN  NaN  NaN
2  1.0  NaN  NaN


Next, we can use the `DataFrame.isnull()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) to identify which values are null:

In [8]:
print(df.isnull())

       A      B      C
0  False  False  False
1   True   True   True
2  False   True   True


We can chain the result to `DataFrame.sum()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) to count the number of null values in each column:

In [9]:
print(df.isnull().sum())

A    1
B    2
C    2
dtype: int64


Let's use this technique to count the null values in our data set.

**Instructions:**

We have read the CSV file into a pandas dataframe called `mvc`.

1. Create a series that counts the number of null values in each of the columns in the `mvc` dataframe. Assign the result to `null_counts`.

In [10]:
#### Below the solution to make the next chapter work
data = np.random.choice([1.0, np.nan],
 size=(3, 3), p=[.3, .7])
df = pd.DataFrame(data, columns=['A','B','C'])
print(df)

null_counts = mvc.isnull().sum()

     A    B    C
0  1.0  NaN  NaN
1  NaN  1.0  1.0
2  NaN  NaN  NaN


## 2. Verifying the Total Columns

To give us a better picture of the null values in the data, let's calculate the percentage of null values in each column. Below, we divide the number of null values in each column by the total number of values in the data set:



In [11]:
null_counts_pct = null_counts / mvc.shape[0] * 100

We'll then add both the counts and percentages to a dataframe to make them easier to compare:

In [12]:
null_df = pd.DataFrame({'null_counts': null_counts, 'null_pct': null_counts_pct})
# Rotate the dataframe so that rows become columns and vice-versa
null_df = null_df.T.astype(int)

print(null_df)

             unique_key  date  ...  cause_vehicle_4  cause_vehicle_5
null_counts           0     0  ...            57111            57671
null_pct              0     0  ...               98               99

[2 rows x 26 columns]


About a third of the columns have no null values, with the rest ranging from less than 1% to 99%!

To make things easier, let's start by looking at the group of columns that relate to people killed in collisions.

We'll use list comprehension to reduce our summary dataframe to just those columns:



In [13]:
killed_cols = [col for col in mvc.columns if 'killed' in col]
print(null_df[killed_cols])

             pedestrians_killed  cyclist_killed  motorist_killed  total_killed
null_counts                   0               0                0             5
null_pct                      0               0                0             0


We can see that each of the individual categories have no missing values, but the `total_killed` column has five missing values.

One option for handling this would be to remove – or drop – those five rows. This would be a reasonably valid choice since it's a tiny portion of the data, but let's think about what other options we have first.

If you think about it, the total number of people killed should be the sum of each of the individual categories. We might be able to "fill in" the missing values with the sums of the individual columns for that row. The technical name for filling in a missing value with a replacement value is called **imputation**.



Let's look at how we could explore the values where the `total_killed` isn't equal to the sum of the other three columns. We'll illustrate this process using a series of diagrams. The diagrams won't contain values, they'll just show a grid to represent the values.

Let's start with a dataframe of just the four columns relating to people killed:
![img](https://s3.amazonaws.com/dq-content/370/verify_totals_1.svg)



We then select just the first three columns, and manually sum each row:

![img](https://s3.amazonaws.com/dq-content/370/verify_totals_2.svg)

We then compare the manual sum to the original total column to create a boolean mask where equivalent values are *not* equal:

![img](https://s3.amazonaws.com/dq-content/370/verify_totals_3.svg)

Lastly, we use the boolean mask to filter the original dataframe to include only rows where the manual sum and original aren't equal:

![img](https://s3.amazonaws.com/dq-content/370/verify_totals_4.svg)

Let's use this strategy to look at the rows that don't match up!



**Instructions:**

We created a dataframe `killed`, containing the five columns that relate to people killed in collisions.

1. Select the first three columns from `killed` and sum each row. Assign the result to `killed_manual_sum`.
2. Create a boolean mask that checks whether each value in `killed_manual_sum` is not equal to the values in the `total_killed` column. Assign the boolean mask to `killed_mask`.
3. Use `killed_mask` to filter the rows in `killed`. Assign the result to `killed_non_eq`.


In [14]:
killed_cols = [col for col in mvc.columns if 'killed' in col]
killed = mvc[killed_cols].copy()

In [15]:
### Part of the solution to make the next chapter work
killed_cols = [col for col in mvc.columns if 'killed' in col]
killed = mvc[killed_cols].copy()
killed_manual_sum = killed.iloc[:,:3].sum(axis=1)
killed_mask = killed_manual_sum != killed['total_killed']
killed_non_eq = killed[killed_mask]

## 3. Filling and Verifying the Killed and Injured Data

The `killed_non_eq` dataframe we created in the previous exercise contained six rows:

In [16]:
killed_non_eq.head(6)

Unnamed: 0,pedestrians_killed,cyclist_killed,motorist_killed,total_killed
3508,0,0,0,
20163,0,0,0,
22046,0,0,1,0.0
48719,0,0,0,
55148,0,0,0,
55699,0,0,0,


We can categorize these into two categories:

1. Five rows where the `total_killed` is not equal to the sum of the other columns because the total value is missing.
2. One row where the `total_killed` is less than the sum of the other columns.

From this, we can conclude that filling null values with the sum of the columns is a fairly good choice for our imputation, given that only six rows out of around 58,000 don't match this pattern.

We've also identified a row that has suspicious data - one that doesn't sum correctly. Once we have imputed values for all rows with missing values for `total_killed`, we'll mark this suspect row by setting its value to `NaN`.

In order to execute this, we'll learn to use the `Series.mask()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mask.html). `Series.mask()` is useful when you want to replace certain values in a series based off a boolean mask. The syntax for the method is:

```
Series.mask(bool_mask, val_to_replace)
```

Let's look at an example with some simple data. We'll start with a series called `fruits`:
![img](https://s3.amazonaws.com/dq-content/370/mask_1.svg)

Next, we create a boolean series that matches values equal to the string `Banana`:

![img](https://s3.amazonaws.com/dq-content/370/mask_2.svg)


Lastly, we use `Series.mask()` to replace all the values that match the boolean series with a new value, `Pear`:
![img](https://s3.amazonaws.com/dq-content/370/mask_3.svg)



If we wanted to describe the logic of the code above, we'd say *For each value in the "fruits" series, if the corresponding value in the "bool" series is true, update the value to "Pear," otherwise leave the original value.*



In the first example above, we updated a single value, but we can also update with the matching value from a series that has identical index labels, like this `nums` series:

![img](https://s3.amazonaws.com/dq-content/370/mask_4.svg)

Let's look at how we can update the matching values in `fruit` with the corresponding values in `nums`:
![img](https://s3.amazonaws.com/dq-content/370/mask_5.svg)



## 4. Assigning the Corrected Data Back to the Main Dataframe

## 5. Visualizing Missing Data with Plots

## 6. Analyzing Correlations in Missing Data

## 7. Finding the Most Common Values Across Multiple Columns

## 8. Filling Unknown Values with a Placeholder

## 9. Missing Data in the "Location" Columns

## 10. Imputing Location Data