The first thing we'll need to do is load in the libraries and datasets we'll be using.

In [None]:
import numpy as np
import pandas as pd

# Reading the data set
data = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

The first thing I do when I get a new dataset is take a look at some of it. This lets me see that it all read in correctly and get an idea of what's going on with the data. In this case, I'm looking to see if I see any missing values, which will be reprsented with NaN or None.

In [None]:
# look at a few rows of the data file. I can see a handful of missing data already!
data.head()

Yep, it looks like there's some missing values.

#  See how many missing data points we have 

Ok, now we know that we do have some missing values. Let's see how many we have in each column.

In [None]:
# get the number of missing data points per column
missing_value_count = data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_value_count[0:10]

Street Number Suffix column has huge Nan values.
That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:

In [None]:
# how many total missing values do we have?
total_cells = np.product(data.shape)
total_missing = missing_value_count.sum()

(total_missing/total_cells)*100

Wow, almost a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them

# Figure out why the data is missing

This is the point at which we get into the part of data science that I like to call "data intution", by which I mean "really looking at your data and trying to figure out why it is the way it is and how that will affect your analysis". It can be a frustrating part of data science, especially if you're newer to the field and don't have a lot of experience. For dealing with missing values, you'll need to use your intution to figure out why the value is missing. One of the most important question you can ask yourself to help figure this out is this:

    Is this value missing becuase it wasn't recorded or becuase it dosen't exist?

If a value is missing becuase it doens't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probalby do want to keep as NaN. On the other hand, if a value is missing becuase it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row. (This is called "imputation" and we'll learn how to do it next! :)

Let's work through an example. Looking at the number of missing values in the data dataframe, I notice that the column Street Number Suffix has a lot of missing values in it:

In [None]:
# look at the # of missing points in the first ten columns
missing_value_count

If you're doing very careful data analysis, this is the point at which you'd look at each column individually to figure out the best strategy for filling those missing values. For the rest of this notebook, we'll cover some "quick and dirty" techniques that can help you with missing values but will probably also end up removing some useful information or adding some noise to your data.

Look at the columns Street Number Suffix and Zipcode from the sf_permits datasets. Both of these contain missing values. Which, if either, of these are missing because they don't exist? Which, if either, are missing because they weren't recorded?

# Drop missing values

If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values. (Note: I don't generally recommend this approch for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.)

If you're sure you want to drop rows with missing values, pandas does have a handy function, dropna() to help you do this. Let's try it out on our Building_Permits dataset!

In [None]:
# remove all the rows that contain a missing value
data.dropna()

Oh dear, it looks like that's removed all our data! 😱 This is because every row in our dataset had at least one missing value. We might have better luck removing all the columns that have at least one missing value instead.


In [None]:
# remove all columns with at least one missing value
column_with_na_dropped = data.dropna(axis=1)
column_with_na_dropped.head()

In [None]:
# just how much data did we lose?
print('Columns in original dataset: %d \n' % data.shape[1])
print("Columns with na's dropped: %d \n" % column_with_na_dropped.shape[1])

We've lost quite a bit of data, but at this point we have successfully removed all the NaN's from our data.

# Filling in missing values automatically

Another option is to try and fill in the missing values. For this next bit, I'm getting a small sub-section of the Building_permit data so that it will print well.

In [None]:
# look at a few rows of the data file.
data.head()



We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the NaN values to be replaced with. Here, I'm saying that I would like to replace all the NaN values with 0.


In [None]:
# replace all NA's with 0
data.fillna(0)


I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)


In [None]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
data.fillna(method='bfill', axis=0).fillna(0)


Filling in missing values is also known as "imputation".


If you like my Notebook, please encourage me by upvoting....Thank you in advance!