### All days of the challange:

* [Day 1: Handling missing values](https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values)
* [Day 2: Scaling and normalization](https://www.kaggle.com/rtatman/data-cleaning-challenge-scale-and-normalize-data)
* [Day 3: Parsing dates](https://www.kaggle.com/rtatman/data-cleaning-challenge-parsing-dates/)
* [Day 4: Character encodings](https://www.kaggle.com/rtatman/data-cleaning-challenge-character-encodings/)
* [Day 5: Inconsistent Data Entry](https://www.kaggle.com/rtatman/data-cleaning-challenge-inconsistent-data-entry/)
___
Welcome to day 1 of the 5-Day Data Challenge! Today, we're going to be looking at how to deal with missing values. To get started, click the blue "Fork Notebook" button in the upper, right hand corner. This will create a private copy of this notebook that you can edit and play with. Once you're finished with the exercises, you can choose to make your notebook public to share with others. :)

> **Your turn!** As we work through this notebook, you'll see some notebook cells (a block of either code or text) that has "Your Turn!" written in it. These are exercises for you to do to help cement your understanding of the concepts we're talking about. Once you've written the code to answer a specific question, you can run the code by clicking inside the cell (box with code in it) with the code you want to run and then hit CTRL + ENTER (CMD + ENTER on a Mac). You can also click in a cell and then click on the right "play" arrow to the left of the code. If you want to run all the code in your notebook, you can use the double, "fast forward" arrows at the bottom of the notebook editor.

Here's what we're going to do today:

* [Take a first look at the data](#Take-a-first-look-at-the-data)
* [See how many missing data points we have](#See-how-many-missing-data-points-we-have)
* [Figure out why the data is missing](#Figure-out-why-the-data-is-missing)
* [Drop missing values](#Drop-missing-values)
* [Filling in missing values](#Filling-in-missing-values)

Let's get started!

# Take a first look at the data
________

The first thing we'll need to do is load in the libraries and datasets we'll be using. For today, I'll be using a dataset of events that occured in American Football games for demonstration, and you'll be using a dataset of building permits issued in San Francisco.

> **Important!** Make sure you run this cell yourself or the rest of your code won't work!

In [None]:
!ls -ltr ../input

In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0) 

The first thing I do when I get a new dataset is take a look at some of it. This lets me see that it all read in correctly and get an idea of what's going on with the data. In this case, I'm looking to see if I see any missing values, which will be reprsented with `NaN` or `None`.

In [None]:
# look at a few rows of the nfl_data file. I can see a handful of missing data already!
nfl_data.sample(5)

Yep, it looks like there's some missing values. What about in the sf_permits dataset?

In [None]:
# your turn! Look at a couple of rows from the sf_permits dataset. Do you notice any missing data?

# your code goes here :)
sf_permits.sample(10)

# See how many missing data points we have
___

Ok, now we know that we do have some missing values. Let's see how many we have in each column. 

In [None]:
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

In [None]:
missing_values_count.sort_values(ascending=False)[0:10]

That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:


**drbwa:** Arrays in numpy are used to represent matrices. As such, each row is expected to have the same length. Otherwise, the numpy array degenerates to an array of lists.

Running np.product(numpyarray.shape) gives you the total number of cells.

In [None]:
a1 = np.array([2, 4, 6])
print(f"the product of all array elements: {np.product(a1)}")
a2 = np.array([[1, 2, 3], [4, 5, 6]])
print(f"shape of array a2: {a2.shape}")
print(f"the product of the diemensions of the 2-d array a2: {np.product(a2.shape)}")
print("this is the total number of elements or cells in this 2-d array")

In [None]:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
nfl_missing_percent = (total_missing/total_cells) * 100
print(f"nfl_missing_percent: {nfl_missing_percent:.2f}%")

Wow, almost a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.

In [None]:
# your turn! Find out what percent of the sf_permits dataset is missing

# get the number of missing data points per column
sf_missing_values_count = sf_permits.isnull().sum()
#sf_missing_values_count

# look at the # of missing points in the first ten columns
sf_missing_values_count[0:10]

# total number of cells in sf permits dataset
sf_total_cells = np.product(sf_permits.shape)

# total number of cells with missing values in sf permits dataset
sf_total_missing = sf_missing_values_count.sum()

# percent of data missing in sf permits dataset
sf_missing_percent = (sf_total_missing / sf_total_cells) * 100
print(f"sf_missing_percent: {sf_missing_percent:.2f}%")

# Figure out why the data is missing
____
 
This is the point at which we get into the part of data science that I like to call "data intution", by which I mean "really looking at your data and trying to figure out why it is the way it is and how that will affect your analysis". It can be a frustrating part of data science, especially if you're newer to the field and don't have a lot of experience. For dealing with missing values, you'll need to use your intution to figure out why the value is missing. One of the most important question you can ask yourself to help figure this out is this:

> **Is this value missing becuase it wasn't recorded or becuase it dosen't exist?**

If a value is missing becuase it doens't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probalby do want to keep as NaN. On the other hand, if a value is missing becuase it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row. (This is called "imputation" and we'll learn how to do it next! :)

Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, I notice that the column `TimesSec` has a lot of missing values in it: 

In [None]:
# look at the # of missing points in the first ten columns
missing_values_count[0:10]

By looking at [the documentation](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016), I can see that this column has information on the number of seconds left in the game when the play was made. This means that these values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there are other fields, like `PenalizedTeam` that also have lot of missing fields. In this case, though, the field is missing because if there was no penalty then it doesn't make sense to say *which* team was penalized. For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

> **Tip:** This is a great place to read over the dataset documentation if you haven't already! If you're working with a dataset that you've gotten from another person, you can also try reaching out to them to get more information.

If you're doing very careful data analysis, this is the point at which you'd look at each column individually to figure out the best strategy for filling those missing values. For the rest of this notebook, we'll cover some "quick and dirty" techniques that can help you with missing values but will probably also end up removing some useful information or adding some noise to your data.

## Your turn!

* Look at the columns `Street Number Suffix` and `Zipcode` from the `sf_permits` datasets. Both of these contain missing values. Which, if either, of these are missing because they don't exist? Which, if either, are missing because they weren't recorded?

In [None]:
sf_missing_values_count[['Street Number Suffix', 'Zipcode']]

**drbwa:** Every street has a zip code (I think). So if we find an entry without a zip code, we can assume that this is due to a data entry / capture problem. It would make sense to try and ammend the missing information (e.g., look up zip code for a given address).

On the other hand, not every street number does have a suffix (e.g. '44A', '44B'). If we encounter an entry without a Street Number Suffix, it may be due to this address not having a suffix.

# Drop missing values
___

If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values. (Note: I don't generally recommend this approch for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.)  

If you're sure you want to drop rows with missing values, pandas does have a handy function, `dropna()` to help you do this. Let's try it out on our NFL dataset!

In [None]:
# remove all the rows that contain a missing value
nfl_data.dropna()

Oh dear, it looks like that's removed all our data! 😱 This is because every row in our dataset had at least one missing value. We might have better luck removing all the *columns* that have at least one missing value instead.

In [None]:
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

In [None]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

We've lost quite a bit of data, but at this point we have successfully removed all the `NaN`'s from our data. 

In [None]:
# Your turn! Try removing all the rows from the sf_permits dataset that contain missing values. How many are left?
sf_permits_dropped_rows_with_nans = sf_permits.dropna()
sf_permits_dropped_rows_with_nans

In [None]:
# Now try removing all the columns with empty values. Now how much of your data is left?
sf_permits_dropped_cols_with_nans = sf_permits.dropna(axis=1)
print(f"Number of columns in original dataset: {sf_permits.shape[1]}")
print(f"Number of columns after dropping colums with nas: {sf_permits_dropped_cols_with_nans.shape[1]}")

# Filling in missing values automatically
_____

Another option is to try and fill in the missing values. For this next bit, I'm getting a small sub-section of the NFL data so that it will print well.

In [None]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the `NaN` values to be replaced with. Here, I'm saying that I would like to replace all the `NaN` values with 0.

In [None]:
# replace all NA's with 0
subset_nfl_data.fillna(0)

I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)

In [None]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)

Filling in missing values is also known as "imputation", and you can find more exercises on it [in this lesson, also linked under the "More practice!" section](https://www.kaggle.com/dansbecker/handling-missing-values). First, however, why don't you try replacing some of the missing values in the sf_permit dataset?

In [None]:
# Your turn! Try replacing all the NaN's in the sf_permits data with the one that
# comes directly after it and then replacing any remaining NaN's with 0
sf_permits.fillna(method="bfill", axis=0).fillna(0)

And that's it for today! If you have any questions, be sure to post them in the comments below or [on the forums](https://www.kaggle.com/questions-and-answers). 

Remember that your notebook is private by default, and in order to share it with other people or ask for help with it, you'll need to make it public. First, you'll need to save a version of your notebook that shows your current work by hitting the "Commit & Run" button. (Your work is saved automatically, but versioning your work lets you go back and look at what it was like at the point you saved it. It also let's you share a nice compiled notebook instead of just the raw code.) Then, once your notebook is finished running, you can go to the Settings tab in the panel to the left (you may have to expand it by hitting the [<] button next to the "Commit & Run" button) and setting the "Visibility" dropdown to "Public".

# More practice!
___

If you're looking for more practice handling missing values, check out these extra-credit\* exercises:

* [Handling Missing Values](https://www.kaggle.com/dansbecker/handling-missing-values): In this notebook Dan shows you several approaches to imputing missing data using scikit-learn's imputer. 
* Look back at the `Zipcode` column in the `sf_permits` dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the [Datasets listing](https://www.kaggle.com/datasets).) 

\* no actual credit is given for completing the challenge, you just learn how to clean data real good :P

# My (drbwa's) attempt at fixing some of the missing zip codes
I can think of two approaches to reduce the number of null zip codes in the `sf_permits` dataset.

Use a Web service API to look up zip codes
There are Web services that allow you to look up zip codes according to a street address or latitude and longitude coordinates. This should result in fairly accurate data and is simple to implement.

Use another dataset that contains (some of) the missing zip codes
Use another dataset that contains (most of) the missing data. You need to find some way to match on relevant columns in both datasets. It is likely that performing the matching correctly will require some wizardry due to different formats being used, etc. The accuracy depends not only on the quality of supplemental dataset, but also on the accuracy with which you can match relevant rows in both datasets.

## Idea #1: Use a Web service API to look up zip codes
The idea is to use a Web service that can return a zip code given a partial address (state, city, street, and so on). As it turns out, there are plenty Web services that sell programmatic access to address data including zip codes. And all of them charge for lookups beyond a small rate of lookups per day.

Need a plan B.

## Idea #2: Use another dataset to find missing zip codes
Plan B is to use another dataset to find (most of) the missing zip codes.

I found a dataset on data.gov about San Francisco Fire Department (SFFD) service calls. It looks like this would require a lot of work and still leaves plenty of opportunities to introduce inaccuracies that are difficult to quantify and control. First, the sffd_service_calls.address attribute describes the mid-block of a reported incident. It may thus give a zip code that is different from the one that would match the street number of the same street in sf_permits. Second, if the way I go about matching is too permissive, we might match the wrong streets. Third, it does not help that I do not really know San Francisco and thus have no intuition that would allow me to come up with more promising approaches or be able to spot errors in the results easily.

Another dataset that looks more promising is the latest address data for San Francisco from openaddresses.io: [San Francisco](https://s3.amazonaws.com/data.openaddresses.io/runs/506567/us/ca/san_francisco.zip). Need to decide whether it makes more sense to match on longitude / latitude or on street name and number. For that, we check which of the two is more complete.

In [None]:
oa_sf_df = pd.read_csv('../input/openaddress-san-francisco/san_francisco/us/ca/san_francisco.csv')
oa_sf_df.sample(5)

Let's check how many longitude / latitude and street number / name columns have missing values in both datasets

In [None]:
# count of null values by column
oa_missing_values_count = oa_sf_df.isnull().sum()
permits_missing_values_count = sf_permits.isnull().sum()

print(f"OpenAddress number of rows with null longitude: {oa_missing_values_count['LON']}")
print(f"OpenAddress number of rows with null latitude: {oa_missing_values_count['LAT']}")
print(f"OpenAddress number of rows with null street number: {oa_missing_values_count['NUMBER']}")
print(f"OpenAddress number of rows with null street name: {oa_missing_values_count['STREET']}")
print("*" * 3)
# Location contains a tuple longitude and latitude values
print(f"sfpermits number of rows with null location: {permits_missing_values_count['Location']}")
print(f"sfpermits number of rows with null street number: {permits_missing_values_count['Street Number']}")
print(f"sfpermits number of rows with null street name: {permits_missing_values_count['Street Name']}")
print(f"sfpermits number of rows with null street name suffix: {permits_missing_values_count['Street Suffix']}")

It appears that in this case we have more complete data for street names and numbers than for coordinates.

We note that the street number suffix in the OpenAddress dataset is part of the street number. For example:

In [None]:
oa_sf_df[oa_sf_df["NUMBER"].str.contains('A')].sample(5)

A little experiment to figure out how I might approach 'merging' the missing data from one data frame into another one.

In [None]:
streets = pd.Series(['BUSH ST', 'BUSH ST', 'SUTTER ST', 'PACIFIC AVE'])
numbers = pd.Series(['100', '200', '60', '80'])
zipcodes = pd.Series(['4339320', np.NaN, np.NaN, np.NaN])
a = {'Street': streets, 'Number': numbers, 'Zipcode': zipcodes}
a_df = pd.DataFrame.from_dict(a)
a_df['Street-Number'] = a_df['Street'] + '-' + a_df['Number']

streets2 = pd.Series(['CLAY ST', 'BUSH ST', 'BERRY ST', 'SUTTER ST', 'PACIFIC AVE'])
numbers2 = pd.Series(['20', '100', '40', '60', '80'])
zipcodes2 = pd.Series(['40549', '4339320', '40545', '60213', '12345'])
irrelevant = pd.Series(['A', 'B', 'C', 'D', 'E'])
b = {'Street': streets2, 'Number': numbers2, 'Zipcode': zipcodes2, 'Superfluous': irrelevant}
b_df = pd.DataFrame.from_dict(b)
b_df['Street-Number'] = b_df['Street'] + '-' + b_df['Number']

In [None]:
print(a_df)
print("\n")
print(b_df)

In [None]:
a_df['Zipcode'] = a_df['Zipcode'].fillna(a_df['Street-Number'].map(b_df.set_index('Street-Number')['Zipcode']))

In [None]:
print(a_df)

This approach seems to do the trick. Now, let's apply this to the actual datasets.

Let us examine step-by-step how our approach of filling the missing zipcode values in one DataFrame with the values from another DataFrame works.

First, we note that the formats of street names in the permits and in OA data frames do not match. The former has separate columns for street names and street suffix (e.g., St, Av). The latter combines these into one column whose values are in uppercase and uses 'AVE' instead 'Av' to denote avenues. In addition, the street number suffix is a separate column in sf_permits, but it is simply part of the NUMBER column in the OA data frame. We will deal with this last difference later.

For now, we add a STREET column to the sf_permits dataframe that is a concatenation of its Street Name and Street Suffix columns, converts all characters to uppercase and replaces occurrences of 'Av' with 'AVE'.

In [None]:
# Let's first load the sf_permits DF from scratch to overwrite any changes
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

In [None]:
# Number and proportion of missing Zipcode values before...
# get the number of missing data points in the Zipcode column of the sf_permits data frame
sf_missing_zipcodes_count_before = sf_permits['Zipcode'].isnull().sum()
# total number of Zipcode rows
sf_total_zipcodes = len(sf_permits['Zipcode'])
# percent of missing zipcodes data in sf permits dataset
sf_missing_zipcodes_percent_before = (sf_missing_zipcodes_count_before / sf_total_zipcodes) * 100

In [None]:
# Add the new STREET column to the sf_permits data frame as explained above.
sf_permits['STREET'] = (sf_permits['Street Name'] + ' ' + sf_permits['Street Suffix']).str.upper().str.replace('AV', 'AVE')

Second, we create a new data frame based on the oa_address data frame that contains only the columns we are interested in for this problem.

In [None]:
oa_update_df = pd.DataFrame()
oa_update_df[['STREET', 'NUMBER', 'POSTCODE']] = oa_sf_df[['STREET', 'NUMBER', 'POSTCODE']]

Third, we create a column called STREET-NUMBER in both data frames that is a concatentation of the street name and number separated by a hyphen. In the sf_permits data frame, we append any non-NaN values from its Street Number Suffix column. In the oa_update_df data frame, the number suffix is already part of its NUMBER column.

I will explain shortly how we make use of this column to match rows in both data frames as needed to get the missing zip codes, but basically I am going to use the STREET-NUMBER column as a temporary index. As there are likely to be duplicate entries in this column (i.e. same street name and number) and as all of those should have the same postcode, we only keep unique values around (for oa_update_df which will sport the temporary index).

In [None]:
sf_permits['STREET-NUMBER'] = sf_permits['STREET'] + '-' + sf_permits['Street Number'].astype(str) + sf_permits['Street Number Suffix'].str.upper().fillna('')
oa_update_df['STREET-NUMBER'] = oa_update_df['STREET'] + '-' + oa_update_df['NUMBER']
oa_update_df.drop_duplicates('STREET-NUMBER', inplace=True)

Fourth, it is time for some magic to get the missing values from the oa_update data frame to update the correct rows in the sf_permits data frame. This involves a couple of Pandas methods.

[Series.fillna](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html) takes either a value (e.g., a scalar such as the number 0) or a dict or Series or DataFrame of values to use to fill the missing values in the Series upon which it is invoked. We are going to make use of this to fill the missing zip code values in the `sf_permits` data frame. However, there is a slight complication as `fillna` will by default map values based on matching indices. The indices in our two data frames do not really match in any meaningful way.

In order to overcome this issue, we will make use of [Series.map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html). 

The `map` function creates a mapping from the index values of the first Series onto the values of the second Series and it does so by matching the values of the first Series to the index values of the second Series. For example:

In [None]:
x = pd.Series([1,2,3], index=['one', 'two', 'three'])
y = pd.Series(['foo', 'bar', 'baz'], index=[1,2,3])
x.map(y)

If one or more of the values in x do not match the index in y, then the map method does not find a corresponding mapping:

In [None]:
x = pd.Series([0,2,3], index=['one', 'two', 'three'])
y = pd.Series(['foo', 'bar', 'baz'], index=[1,2,3])
x.map(y)

Another example to illustrate how map works:

In [None]:
x = pd.Series([3,2,1], index=['one', 'two', 'three'])
y = pd.Series(['foo', 'bar', 'baz'], index=[1,2,3])
x.map(y)

We want to fill the NaNs in the `Zipcode` column in `sf_permits` with the values from the `POSTCODE` column in the `oa_update_df` data frame. We do so by making use of two tricks. The first trick is to use the `map` method to supply values to `fillna`. How do we parameterise `map` to give us the right `Zipcode` values? The second trick consists of asking `map` to match the `sf_permits.Street-Number` column values to a temporary index we set on `oa_update_df` consisting of the `Street-Number` values. And for each row that matches (value from series 1 matches index values of series 2), we want `map` to use the `oa_update_df.POSTCODE` value. That is, map is going to match rows from both data frames on `STREET-NUMBER` (column to temporary index) and use the corresponding `POSTCODE` values from matching rows in `oa_update_df` to set the `Zipcode` column in `sf_permits`.

There is another way to explain this. We remind ourselves that the `STREET-NUMBER` columns in both data frames are a concatenation of street names, numbers and suffixes. So what we are doing here is to ask Pandas to match the street names, numbers and suffixes in both data frames and use the postcode values from the second data frame for matching rows.

Here is what all of this logic looks like in code. Magic.

In [None]:
sf_permits['Zipcode'] = sf_permits['Zipcode'].fillna(sf_permits['STREET-NUMBER'].map(oa_update_df.set_index('STREET-NUMBER')['POSTCODE']))

Next, let's check to see, if a street address that had a NaN value before, contains a zip code after this operation.

In [None]:
sf_permits[(sf_permits['Street Name'] == 'Washington') & (sf_permits['Street Number'] == 3191)]

Finally, let us compare the before and after statistics of how many zip codes are missing.

In [None]:
# Number and proportion of missing Zipcode values after...
# get the number of missing data points in the Zipcode column of the sf_permits data frame
sf_missing_zipcodes_count_after = sf_permits['Zipcode'].isnull().sum()
# percent of missing zipcodes data in sf permits dataset
sf_missing_zipcodes_percent_after = (sf_missing_zipcodes_count_after / sf_total_zipcodes) * 100

print("BEFORE")
print(f"Zipcode rows: {len(sf_permits.Zipcode)}")
print(f"Missing zipcodes count: {sf_missing_zipcodes_count_before}")
print(f"Missing zipcodes %: {sf_missing_zipcodes_percent_before:.2f}%")
print("*" * 3)
print("AFTER")
print(f"Zipcode rows: {len(sf_permits.Zipcode)}")
print(f"Missing zipcodes count: {sf_missing_zipcodes_count_after}")
print(f"Missing zipcodes %: {sf_missing_zipcodes_percent_after:.2f}%")

We would need to run some additional tests to ensure that we did not introduce too many errors and that we end up with high quality, useful data, but the situation of missing values looks much improved.

*The End*