# Missing Data


## 1. What is Missing Data
There are many ways that can cause missing data in our data points. One of the intuitive way to look at this is to look at data we enter in online forms. As we fill online (survey) forms or we create a new account, there may be **required fields** and **option fields**.  The fields that we didn't enter data in will be missing compared to other people who enter data into all fields. In the old days when we have missing data, people may use other numbers to represent that. This number acks as a symbol for missing data. For example, if the age is missing, they may put -1 or 999 instead. The problem of this is that the chosen symbol may matched the real data in the (far) future. In Pandas, the symbol that represents a missing data is **NaN**.

**NaN** actually comes with NumPy.

In [1]:
# We can write NaN in many ways
from numpy import NaN, NAN, nan

We have to understand that missing means no data. This is not the same as having data and the value of data is nothing (such as 0). Missing value is **neither True or False**.

In [2]:
print(NaN == True)

False


In [3]:
print(NaN == False)

False


Missing data is not the same as having a value. However, that value represents nothing (such as 0).

In [4]:
print(NaN == 0)

False


Missing data is not the same as having a string object where the string is emptying string. 

** An empty chair is not the same as missing a chair. **

In [8]:
print(NaN == '')

False


NaN is also a state than a value. 

** A good person is not equal to another good person. **

In [7]:
print(NaN == NaN)

False


In [6]:
print(NaN == nan)

False


In [5]:
print(NaN == NAN)

False


## 2. How do we check that the value is NaN?
There is a way to test data can be test by calling a function _**is**null(x)_ that comes with pandas. If the data is missing, isnull will return **True**.

In [9]:
import pandas as pd

In [10]:
print(pd.isnull(NaN))

True


In [11]:
print(pd.isnull(nan))

True


In [12]:
print(pd.isnull(NAN))

True


Another opposite function is **not**null( x ). This function will return **True** when it is not missing.

In [13]:
print(pd.notnull(NaN))

False


In [14]:
print(pd.notnull(5))

True


In [15]:
print(pd.notnull(''))

True


In [16]:
print(pd.notnull('NaN'))

True


## VS Code : TIPS ##
### Let's install extensions of **VSCode** to help us work the csv effectively.
(1) Excel Viewer

(2) Rainbow CSV

## 3. Importaing CSV with missing data
If we have to read csv data, the chance if that some fields many be missing. (unless we work with a pre-process/clean data.)

** read_csv ** already has mechanism to detect and mark the missing data as **NaN**.

In [23]:
df = pd.read_csv("./datasets/muicNaN.csv")
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,ICCS,3.1


First, let's see the content in the csv file (**muicNaN.csv**) in **preview**.

We can see that there are 2 spots that the data is missing. 

![CSV File](./assets/01PreviewCSV.png)

Now, we read_csv into our notebook. Notice how pandas handle the missing values. We can see that those two spots are set as **NaN**.

In [9]:
df = pd.read_csv("./datasets/muicNaN.csv")
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,ICCS,3.1


If we don't want pandas to set them to **NaN** but we want it to stay as blank, we can set **keep_default_na=False**. The spots with empty text will be set to an empty string.

In [11]:
df = pd.read_csv("./datasets/muicNaN.csv", keep_default_na=False)
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,ICCS,3.1


If we try to test that the value of row:1 col:3 is null or not, we will get this.

In [14]:
x = df.iloc[1,3] # Set the field that has no data in the csv as x
print(pd.isnull(x))  # --> False because it is an empty string

False


In [13]:
print(x=='') # Test that x equals an empty string or not

True


In [104]:
y = df.iloc[3,2] # Set the field that has no data in the csv as y
print(pd.isnull(y))# --> False

False


In [105]:
print(y=='')# Test that y equals an empty string or not

True


### When the file use another symbol to represent NaN. 

We can ask pandas to mark those fields as NaN as well.

First, let's look at the csv file we want to import.

Assume that GPA as -1 is used to mark missing data (since GPA can never be negative).

![CSV File](./assets/02PreviewCSV2.png)

If we import it the usual way, we will get this. 

Notice that only the missing value is set to **NaN**.

In [106]:
df = pd.read_csv("./datasets/muicNaN2.csv")
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,-1.0
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,-1,2.4
4,6780005,Moe Otto,,-1.0


No, we can tell pandas to mark **-1** as **NaN** as well. This way, the feault way and -1 will both be marked as **NaN**.

In [107]:
df = pd.read_csv("./datasets/muicNaN2.csv", na_values=-1)
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,,


If we don't want pandas to mark the default missing value, we can set **keep_default_na=False**.

In [16]:
df = pd.read_csv("./datasets/muicNaN2.csv", na_values=-1, keep_default_na=False)
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,,


## 4. Inputing Missing Data Manually
If we want to enter data as missing data, we can do it by using the keywords: **NaN,NAN,nan**.

>```python
from numpy import NaN, NAN, nan
```

We can set **NaN** with series or dataframes.

In [109]:
ser = pd.Series([20,10,nan,5])
ser

0    20.0
1    10.0
2     NaN
3     5.0
dtype: float64

In [110]:
df = pd.DataFrame({"Name":["Tom","Ted","Tim","Time"],"Club":["Board Game Club",NaN,NAN,nan]})
df

Unnamed: 0,Name,Club
0,Tom,Board Game Club
1,Ted,
2,Tim,
3,Time,


We can also assign the new column to be all NaN.

In [111]:
df["Hobby"] = NaN

## 5. Counting Missing Data
### 5.1 Count number of missing data

It is useful to look at the dataframe that do we have missing data or not. If there we, we will see that the proportion of the missing data is big or not. Hence, we can **do something** with it properly.

We can use a **count** function to count data in each column.

In [30]:
ebola = pd.read_csv('./datasets/country_timeseries.csv')
ebola.head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,


**count()** will return a series of number of data values of all the columns. The column with 0 data value will also show up.

In [32]:
cnt = ebola.count() # count data per column
cnt

Date                   122
Day                    122
Cases_Guinea            93
Cases_Liberia           83
Cases_SierraLeone       87
Cases_Nigeria           38
Cases_Senegal           25
Cases_UnitedStates      18
Cases_Spain             16
Cases_Mali              12
Deaths_Guinea           92
Deaths_Liberia          81
Deaths_SierraLeone      87
Deaths_Nigeria          38
Deaths_Senegal          22
Deaths_UnitedStates     18
Deaths_Spain            16
Deaths_Mali             12
dtype: int64

In [14]:
# Let's see how many rows are there
num_rows = ebola.shape[0]
num_rows

NameError: name 'ebola' is not defined

In [37]:
# The delta of row & count is the number of missing data of that column
num_missing = num_rows - ebola.count()
print(num_missing)

Date                     0
Day                      0
Cases_Guinea            29
Cases_Liberia           39
Cases_SierraLeone       35
Cases_Nigeria           84
Cases_Senegal           97
Cases_UnitedStates     104
Cases_Spain            106
Cases_Mali             110
Deaths_Guinea           30
Deaths_Liberia          41
Deaths_SierraLeone      35
Deaths_Nigeria          84
Deaths_Senegal         100
Deaths_UnitedStates    104
Deaths_Spain           106
Deaths_Mali            110
dtype: int64


In [38]:
# If we sum them together, we can get the total number of missing data
print(num_missing.sum())

1214


### 5.2 Count total number of missing data
Another way to count total number of missing data is to check that the value is **null** or not.

**isnull(x)** works with a value or an array.

The the data is **NaN**, it will return True.



In [39]:
import numpy as np
flags = ebola.isnull()
flags

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,False,False,False,True,False,True,True,True,True,True,False,True,False,True,True,True,True,True
1,False,False,False,True,False,True,True,True,True,True,False,True,False,True,True,True,True,True
2,False,False,False,False,False,True,True,True,True,True,False,False,False,True,True,True,True,True
3,False,False,True,False,True,True,True,True,True,True,True,False,True,True,True,True,True,True
4,False,False,False,False,False,True,True,True,True,True,False,False,False,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117,False,False,False,False,False,True,True,True,True,True,False,False,False,True,True,True,True,True
118,False,False,False,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True
119,False,False,False,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True
120,False,False,False,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True


Each True means that it is **NaN**.

**In numpy, True is consider as nonzero**

In [40]:
np.count_nonzero(True)

1

In [41]:
np.count_nonzero(False)

0

Therefore, if we feed the **flags** dataframe into np.count_nonzero( x ), we can count total number of **NaN**.

In [45]:
print(np.count_nonzero(flags)) # Notice the number vs what we calcualted earlier

1214


We can also select to count any particular column.

In [48]:
# Another example, counting missing data in Cases_Nigeria
Ngr_flags = ebola["Cases_Nigeria"].isnull()
Ngr_flags

0      True
1      True
2      True
3      True
4      True
       ... 
117    True
118    True
119    True
120    True
121    True
Name: Cases_Nigeria, Length: 122, dtype: bool

In [49]:
print(np.count_nonzero(Ngr_flags))

84


## 6. Dealing wiht Missing Data
### Clean fields with the Missing data
There are many ways to deal with missing data. We may remove the rows with missing data or we may to replace a missing data with a value that is sensible to be there. 

Let's create a simple data frame to test.

In [51]:
df = pd.DataFrame({"X":[10,20,30,NaN,NaN,60,70,80,NaN,NaN],"Y":[NaN,NaN,15,20,25,NaN,NaN,NaN,45,50]})
df

Unnamed: 0,X,Y
0,10.0,
1,20.0,
2,30.0,15.0
3,,20.0
4,,25.0
5,60.0,
6,70.0,
7,80.0,
8,,45.0
9,,50.0


### 6.1 Replace NaN with a value
This method is to replace NaN with a fix value.
We can use this when we know that there should be a default value for that datapoint.

Ex: The amount of sugar in the coffee has a default value. If the barista didn't key it in, we may assume that it is a regular cup of coffee.

If we replace the missing value with **1.23**, we will get this.

![replace](./assets/03replace.png)

In [52]:
df_clean1 = df.fillna(1.23)
df_clean1

Unnamed: 0,X,Y
0,10.0,1.23
1,20.0,1.23
2,30.0,15.0
3,1.23,20.0
4,1.23,25.0
5,60.0,1.23
6,70.0,1.23
7,80.0,1.23
8,1.23,45.0
9,1.23,50.0


### 6.2 Fill Forward
Replace NaN with the latest value before that.

![Fill Forward](./assets/04ffill.png)

In [53]:
df_clean2 = df.fillna(method='ffill')
df_clean2

Unnamed: 0,X,Y
0,10.0,
1,20.0,
2,30.0,15.0
3,30.0,20.0
4,30.0,25.0
5,60.0,25.0
6,70.0,25.0
7,80.0,25.0
8,80.0,45.0
9,80.0,50.0


### 6.3 Fill Backward
Replace NaN with the value in the next available data point.
![back fill](./assets/05bfill.png)

In [54]:
df_clean3 = df.fillna(method='bfill')
df_clean3

Unnamed: 0,X,Y
0,10.0,15.0
1,20.0,15.0
2,30.0,15.0
3,60.0,20.0
4,60.0,25.0
5,60.0,45.0
6,70.0,45.0
7,80.0,45.0
8,,45.0
9,,50.0


### 6.4 Interpolate
This method will linearly interpolate the missing values.

The situation that we may use **interpolate** is the situation that we know that the missing data shall have data within the range of the surrounding data points.

Ex1:
If we know that a group of friends always come to the restaurant together, the interpolated data of the missing value shouldn't be far off the real age of that person.

Ex2: If the temperature of 9:00am is missing, we can guess that it should be something in between 8:00am to 10:00am.

Note: the default direction of interpolate is **forward**

![interpolate](./assets/06interpolate.png)

In [55]:
df_clean4 = df.interpolate()
df_clean4

Unnamed: 0,X,Y
0,10.0,
1,20.0,
2,30.0,15.0
3,40.0,20.0
4,50.0,25.0
5,60.0,30.0
6,70.0,35.0
7,80.0,40.0
8,80.0,45.0
9,80.0,50.0


### 6.5 Drop the rows with missing value
Just drop(remove) the rows with missing value.
If we have a lot of missing values, this method will significantly affect the number of observations.

In [73]:
df_clean5 = df.dropna()
df_clean5

Unnamed: 0,X,Y
2,30.0,15.0
