# Missing Values

We've seen a preview of how Pandas handles missing values using the None type and NumPy NaN values. Missing values are pretty common in data cleaning activities. And, missing values can be there for any number of reasons, and I just want to touch on a few here.

For instance, if you are running a survey and a respondant didn't answer a question the missing value is actually an omission. This kind of missing data is called **Missing at Random** if there are other variables that might be used to predict the variable which is missing. In my work when I delivery surveys I often find that missing data, say the interest in being involved in a follow up study, often has some correlation with another data field, like gender or ethnicity. If there is no relationship to other variables, then we call this data **Missing Completely at Random (MCAR)**.

These are just two examples of missing data, and there are many more. For instance, data might be missing because it wasn't collected, either by the process responsible for collecting that data, such as a researcher, or because it wouldn't make sense if it were collected. This last example is extremely common when you start joining DataFrames together from multiple sources, such as joining a list of people at a university with a list of offices in the university (students generally don't have offices).

Let's look at some ways of handling missing data in pandas.

Lets start with importing pandas

In [1]:
import pandas as pd

Lets open a CSV file to work with.

In [2]:
csvDataFrame=pd.read_csv("class_grades.csv")

In [3]:
csvDataFrame

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.50
1,8,95.05,105.49,67.50,99.07,68.33
2,8,83.70,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.00,107.41,73.89
...,...,...,...,...,...,...
94,8,,103.71,45.00,93.52,61.94
95,7,,80.54,41.25,93.70,39.72
96,8,89.94,102.77,87.50,90.74,87.78
97,7,95.60,76.13,66.25,99.81,85.56


Lets firstly just see the first 10 row of the dataframe only.

In [4]:
csvDataFrame.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


###### isnull()
Now lets take about this. The general syntax of using this is
###### < Name of dataframe >.isnull()
What this command generally does is that its goes to every dataset stored in a dataframe and checks wheather its null or NaN. The return value is boolean mask with true being placed wherever we find None\NaN.

Lets see an example.

In [5]:
csvDataFrame.isnull().head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,True,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


Now we already know about the dropna() method. Lets simply apply it here and delete the row containing NaN.

In [6]:
csvDataFrame.dropna().head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61
10,7,80.44,90.2,75.0,91.48,39.72
12,8,97.16,103.71,72.5,93.52,63.33
13,7,91.28,83.53,81.25,99.81,92.22


Note in none of these operations the changes are affecting the original dataframe.

###### fillna(< number >)
This method takes a number. It then iterates through the dataframe, and replaces all the NaN and None value with that value that is passed in fillna.

In [7]:
csvDataFrame.fillna(99).head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,99.0,63.15,48.89
3,7,99.0,99.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,99.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


Note that whatever changes are being made here via fillna or dropan methods do not affect the orignal dataframe. In order to have any effect on original datframe we use the inplace attribute.

In [8]:
csvDataFrame.fillna(0,inplace=True)

In [9]:
csvDataFrame.head()

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89


Lets import a new datafile to manipulate.

In [10]:
csvDataFrame2=pd.read_csv("log.csv",index_col=0)

In [11]:
csvDataFrame2.head(20)

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974454,cheryl,intro.html,6,,
1469974544,cheryl,intro.html,9,,
1469974574,cheryl,intro.html,10,,
1469977514,bob,intro.html,1,,
1469977544,bob,intro.html,1,,
1469977574,bob,intro.html,1,,
1469977604,bob,intro.html,1,,
1469974604,cheryl,intro.html,11,,
1469974694,cheryl,intro.html,14,,


In this data the first column is a timestamp in the Unix epoch format. The next column is the user name followed by a web page they're visiting and the video that they're playing. Each row of the DataFrame has a playback position. And we can see that as the playback position increases by one, the time stamp increases by about 30 seconds.

Except for user Bob. It turns out that Bob has paused his playback so as time increases the playback position doesn't change. Note too how difficult it is for us to try and derive this knowledge from the data, because it's not sorted by time stamp as one might expect. This is actually not uncommon on systems which have a high degree of parallelism. There are a lot of missing values in the paused and volume columns. It's not efficient to send this information across the network if it hasn't changed. So this articular system just inserts null values into the database if there's no changes.


So lets try to fix this problem of ours. We in this notebook have previously dicussed about fillna method. Now there two special inputs that fillna could take to perform its tasks.
###### 1)ffill
###### 2)bfill
The general syntax of their usage will be 
###### < Name of DatFrame >.fillna(method="ffill/bfill")
Lets talk about how these inputs actually work. ffill expands to forward fill. This when called fills a row having NaN as its element with the value stored in the previous row.

On the other hand bfill expands to backward filling. This when called fills the row containing NaN as its element with the value stored in the next row.

Lets now use this input to fix our dataframe.

In [12]:
csvDataFrame2=csvDataFrame2.reset_index()

In [13]:
csvDataFrame2=csvDataFrame2.set_index(["user","playback position"])

In [14]:
csvDataFrame2.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,time,video,paused,volume
user,playback position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
cheryl,5,1469974424,intro.html,False,10.0
cheryl,6,1469974454,intro.html,,
cheryl,9,1469974544,intro.html,,
cheryl,10,1469974574,intro.html,,
bob,1,1469977514,intro.html,,
bob,1,1469977544,intro.html,,
bob,1,1469977574,intro.html,,
bob,1,1469977604,intro.html,,
cheryl,11,1469974604,intro.html,,
cheryl,14,1469974694,intro.html,,


In [15]:
csvDataFrame2=csvDataFrame2.fillna(method="bfill")

In [16]:
csvDataFrame2

Unnamed: 0_level_0,Unnamed: 1_level_0,time,video,paused,volume
user,playback position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
cheryl,5,1469974424,intro.html,False,10.0
cheryl,6,1469974454,intro.html,False,10.0
cheryl,9,1469974544,intro.html,False,10.0
cheryl,10,1469974574,intro.html,False,10.0
bob,1,1469977514,intro.html,False,10.0
bob,1,1469977544,intro.html,False,10.0
bob,1,1469977574,intro.html,False,10.0
bob,1,1469977604,intro.html,False,10.0
cheryl,11,1469974604,intro.html,False,10.0
cheryl,14,1469974694,intro.html,False,10.0


Above we have just tried to fix our dataframe. If its worked perfect, or else lets move on

Lets talk about another keyword in this set. 
###### replace
The general syntax of using replace is
###### replace(a,b)
where a is what we want to replace and b is what we want replace a with.

Lets see a few examples to understand this.

In [17]:
#Lets first declare an new dataframe.
newDataframe=pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],
                          "B":["A","B","C","D","E","F","G","H","I","J"],
                          "C":["i","ii","iii","iv","v","vi","vii","viii","ix","x"]})

In [18]:
newDataframe

Unnamed: 0,A,B,C
0,1,A,i
1,2,B,ii
2,3,C,iii
3,4,D,iv
4,5,E,v
5,6,F,vi
6,7,G,vii
7,8,H,viii
8,9,I,ix
9,10,J,x


In [19]:
#Example1: Let say we wanna replace 2 with 200.
newDataframe.replace(2,200)

Unnamed: 0,A,B,C
0,1,A,i
1,200,B,ii
2,3,C,iii
3,4,D,iv
4,5,E,v
5,6,F,vi
6,7,G,vii
7,8,H,viii
8,9,I,ix
9,10,J,x


In [20]:
newDataframe

Unnamed: 0,A,B,C
0,1,A,i
1,2,B,ii
2,3,C,iii
3,4,D,iv
4,5,E,v
5,6,F,vi
6,7,G,vii
7,8,H,viii
8,9,I,ix
9,10,J,x


As observable there is no change in the original dataframe.

In [21]:
#Example2: Replacing A with Z
newDataframe.replace("A","Z")

Unnamed: 0,A,B,C
0,1,Z,i
1,2,B,ii
2,3,C,iii
3,4,D,iv
4,5,E,v
5,6,F,vi
6,7,G,vii
7,8,H,viii
8,9,I,ix
9,10,J,x


What else could replace do? well it can support regexes. Funny? Yes regex are back again. 
Lets see the general syntax of how do we combine replace with regexes.
###### replace(to_replace="< Regex pattern you wanna replace >",value="< What you want to but in place of the replaced value", regex=True)

Lets see an example to understand in a better way.

In [22]:
regexDatabase=pd.read_csv("log.csv")

In [23]:
regexDatabase.head(10)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


Here is what we are going to do. We will simply replace all the websites names in video column with the word "fu*k regexes". Lets do it.

In [24]:
regexDatabase.replace(to_replace="\w*.html$",value="fu*k regexes",regex=True).head()

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,fu*k regexes,5,False,10.0
1,1469974454,cheryl,fu*k regexes,6,,
2,1469974544,cheryl,fu*k regexes,9,,
3,1469974574,cheryl,fu*k regexes,10,,
4,1469977514,bob,fu*k regexes,1,,


One last note on missing values. When you use statistical functions on DataFrames, these functions typically ignore missing values. For instance if you try and calculate the mean value of a DataFrame, the underlying NumPy function will ignore missing values. This is usually what you want but you should be aware that values are being excluded. Why you have missing values really matters depending upon the problem you are trying to solve. It might be unreasonable to infer missing values, for instance, if the data shouldn't exist in the first place.