<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/037__Working_with_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 5/6: DATA CLEANING IN PYTHON: ADVANCED

# MISSION 4: Working with Missing Data

Identify and deal with missing and incorrect data.


## 1. Introduction

In the last mission of this course, we're going to learn more about working with missing data. As we learned in [Working with Missing and Duplicate Data](https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data), data can be missing for a variety of reasons.

In this mission, we'll learn how to handle missing data without having to drop rows and columns using data on motor vehicle collisions released by [New York City](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95) and published on the NYC OpenData website. There is data on over 1.5 million collisions dating back to 2012, with additional data continuously added.

We'll work with an extract of the full data: Crashes from the year 2018. We made several modifications to the data for teaching purposes, including randomly sampling the data to reduce its size. You can download the data set from this mission by using the data set preview tool at the top of the "script.py" codebox on the right.

Our data set is in a CSV called `nypd_mvc_2018.csv`. We can read our data into a pandas dataframe and inspect the first few rows of the data:

In [1]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [2]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
# https://drive.google.com/file/d/137_5T2t59aksuPV2aLP5VldshtCb9vN1/view?usp=sharing
id = "137_5T2t59aksuPV2aLP5VldshtCb9vN1"

In [3]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('nypd_mvc_2018.csv')

In [4]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
# https://drive.google.com/file/d/111gD0MnU_ekqTMK-KYgNt7uP5WEVZJCl/view?usp=sharing
id = "111gD0MnU_ekqTMK-KYgNt7uP5WEVZJCl"

In [5]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('supplemental_data.csv')

In [6]:
import pandas as pd
mvc = pd.read_csv("nypd_mvc_2018.csv")

print(mvc)

       unique_key        date  ... cause_vehicle_4 cause_vehicle_5
0         3869058  2018-03-23  ...             NaN             NaN
1         3847947  2018-02-13  ...             NaN             NaN
2         3914294  2018-06-04  ...             NaN             NaN
3         3915069  2018-06-05  ...             NaN             NaN
4         3923123  2018-06-16  ...             NaN             NaN
...           ...         ...  ...             ...             ...
57859     3835191  2018-01-26  ...             NaN             NaN
57860     3890674  2018-04-29  ...             NaN             NaN
57861     3946458  2018-07-21  ...             NaN             NaN
57862     3914574  2018-06-04  ...             NaN             NaN
57863     4034882  2018-11-29  ...             NaN             NaN

[57864 rows x 26 columns]


A summary of the columns and their data is below:

- `unique_key`: A unique identifier for each collision.
- `date`, `time`: Date and time of the collision.
- `borough`: The [borough](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City), or area of New York City, where the collision occurred.
- `location`: Latitude and longitude coordinates for the collision.
- `on_street`, `cross_street`, `off_street`: Details of the street or intersection where the collision occurred.
- `pedestrians_injured`: Number of pedestrians who were injured.
- `cyclist_injured`: Number of people traveling on a bicycle who were injured.
- `motorist_injured`: Number of people traveling in a vehicle who were injured.
- `total_injured`: Total number of people injured.
- `pedestrians_killed`: Number of pedestrians who were killed.
- `cyclist_killed`: Number of people traveling on a bicycle who were killed.
- `motorist_killed`: Number of people traveling in a vehicle who were killed.
- `total_killed`: Total number of people killed.
- `vehicle_1` through `vehicle_5`: Type of each vehicle involved in the accident.
- `cause_vehicle_1` through `cause_vehicle_5`: Contributing factor for each vehicle in the accident.

Let's quickly recap how to count missing values. We'll start by creating a dataframe with random null values:

In [None]:
data = np.random.choice([1.0, np.nan],
                        size=(3, 3),
                        p=[.3, .7])
df = pd.DataFrame(data, columns=['A','B','C'])
print(df)

Next, we can use the `DataFrame.isnull()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) to identify which values are null:

In [None]:
print(df.isnull())

We can chain the result to `DataFrame.sum()` [method](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) to count the number of null values in each column:

In [None]:
print(df.isnull().sum())

Let's use this technique to count the null values in our data set.