# **Lecture 9A**
# **Missing Values in DataFrames**


Another common problem in real data is missing value. It means that some of data are not collected or unavailable. They have to be represented correctly in Python, otherwise it will create troubles when performing data analysis.

Before running examples in this notebook, make sure that you execute the 2 cells below to connect to Google Drive and load pandas module.

In [2]:
# Run the code below to access files in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# We also need Panadas module in this lecture
# Import Pandas module
import pandas as pd

---
**Example 1:** In the data file quizscores.xlsx, some of the cells are empty. Those empty cells will be showing up as **NaN** in DataFrames.
* **NaN** means **n**ot **a** **n**umber.
* It is has a numeric type **numpy.float64** (defined by the numpy module).
* Pandas will use **NaN** for both numeric and string missing data.

In [None]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")
display(quiz)
print(type(quiz.loc[0,"quiz1"]))

Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,,20,46.0,23.0,
1,1,,66.0,64,90.0,70.0,
2,2,Susan,0.0,95,,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,,,


<class 'numpy.float64'>


In [None]:
# Read quizscores.csv data file
# Pandas handles missing values similarly in CSV files
quiz = pd.read_csv("/content/drive/MyDrive/Data/quizscores.csv")
display(quiz)
print(type(quiz.loc[0,"quiz1"]))

Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,,20,46.0,23.0,
1,1,,66.0,64,90.0,70.0,
2,2,Susan,71.0,95,,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,,,


<class 'numpy.float64'>


In [4]:
# Read quizscores2.csv data file
# Pandas handles missing values similarly in CSV files
quiz = pd.read_csv("/content/drive/MyDrive/Data/quizscores2.csv")
display(quiz)
print(type(quiz.loc[0,"quiz1"]))
print(type(quiz.loc[0,"quiz5"]))


Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,,20,46.0,23.0,
1,1,,66.0,64,90.0,70.0,
2,2,Susan,71.0,95,,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,,,


<class 'numpy.float64'>
<class 'str'>


---
**Example 2** - Counting non-missing values in a DataFrame can easily be done by the **df.describe()** or **df.count()** methods.


In [None]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")

# Count the no. of non-missing cases

display(quiz.describe())
display(quiz.count())

Unnamed: 0,studentid,quiz1,quiz2,quiz3,quiz4,quiz5
count,5.0,4.0,5.0,3.0,4.0,0.0
mean,2.0,32.0,55.0,78.0,51.25,
std,1.581139,26.981475,36.925601,28.0,38.27423,
min,0.0,0.0,13.0,46.0,16.0,
25%,1.0,22.5,20.0,68.0,21.25,
50%,2.0,31.0,64.0,90.0,46.5,
75%,3.0,40.5,83.0,94.0,76.5,
max,4.0,66.0,95.0,98.0,96.0,


studentid    5
name         4
quiz1        4
quiz2        5
quiz3        3
quiz4        4
quiz5        0
dtype: int64

---
**Example 3:** If you know that there are missing values in your DataFrame **df**, you can find them by using **df.isnull()**
* The **df.isnull()** method will return a DataFrame with the same number of rows and columns of **df**. The values in the DataFrame are True/False values,indicating if the corresponding values in **df** is missing or not.
* The method can also be applied to a column (i.e. a Series) as well.
* Since True is equivalent to 1 and False is equivalent to 0, summing these True/False values by the **df.sum()** method will give you the number of missing values.

In [6]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")

# Check for missing values
display(quiz)
miss = quiz.isnull()
display(miss)
display(quiz.isnull().sum())

Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,,20,46.0,23.0,
1,1,,66.0,64,90.0,70.0,
2,2,Susan,0.0,95,,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,,,


Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,False,False,True,False,False,False,True
1,False,True,False,False,False,False,True
2,False,False,False,False,True,False,True
3,False,False,False,False,False,False,True
4,False,False,False,False,True,True,True


studentid    0
name         1
quiz1        1
quiz2        0
quiz3        2
quiz4        1
quiz5        5
dtype: int64

In [None]:
# Count missing values
display(miss.sum())

studentid    0
name         1
quiz1        1
quiz2        0
quiz3        2
quiz4        1
quiz5        5
dtype: int64

---
**Example 4:** When you have **NaN** in your DataFrame, it will affect the calculation of various summary statistics.
* **sum()** function will sum up all non-missing values in a column. If all of the values are missing. It will return 0.
* **mean()** function will calculate the average of all non-missing values in a column. If all of the values are missing. it will return **NaN**.
* Most other functions behave similar to the **mean()** when there are missing values. See the output of **describe()**.

In [None]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")
display(quiz)

# The sum() function will sum all non-missing values
print(quiz["quiz1"].sum())

# To calculate mean, we need to divide it by 4 (not counting the missing value)
print(quiz["quiz1"].sum()/4)

# The mean() function can do it automatically
print(quiz["quiz1"].mean())


Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,,20,46.0,23.0,
1,1,,66.0,64,90.0,70.0,
2,2,Susan,0.0,95,,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,,,


128.0
32.0
32.0


In [None]:
# The sum() function will return 0 if all data are missing
print(quiz["quiz5"].sum())

# The mean() function will return NaN (because the denominator is 0)
# most other summary fuctions will also return NaN
print(quiz["quiz5"].mean())

0.0
nan


In [None]:
# Statistics for quiz5 are mostly NaN because there are no valid data
display(quiz[["quiz1","quiz2","quiz3","quiz4","quiz5"]].describe())

Unnamed: 0,quiz1,quiz2,quiz3,quiz4,quiz5
count,4.0,5.0,3.0,4.0,0.0
mean,32.0,55.0,78.0,51.25,
std,26.981475,36.925601,28.0,38.27423,
min,0.0,13.0,46.0,16.0,
25%,22.5,20.0,68.0,21.25,
50%,31.0,64.0,90.0,46.5,
75%,40.5,83.0,94.0,76.5,
max,66.0,95.0,98.0,96.0,


---
**Example 5:** You have to pay extra attention when creating new columns from other columns using basic math operations. If there are missing values in your data, you may end up with missing result.
* When a calculation involves a NaN, the result will become NaN. 
* Since the data for quiz5 are all NaN, the resulting avg is also NaN

In [None]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")
quiz["avg"] = (quiz["quiz1"]+quiz["quiz2"]+quiz["quiz3"]+quiz["quiz4"]+quiz["quiz5"])/5
display(quiz)

Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5,avg
0,0,John,,20,46.0,23.0,,
1,1,,66.0,64,90.0,70.0,,
2,2,Susan,0.0,95,,16.0,,
3,3,Clementine,32.0,83,98.0,96.0,,
4,4,David,30.0,13,,,,


---
**Example 6:** In some situation, the NaN can be handled by Pandas functions directly, such as min(), max(), mean, std(() and etc.
* When we use the argument **axis=1**, it means that we are calculating row statistics.
* In earlier examples, we don't have any arguments to these function. Actually we are relying on the default **axis=0**, which means calculating column statistics.
* When using these summary statistics function, the calculation is done by ignoring the NaN. For example, in the calculation of avg variable, the mean() function is calculating the average of quiz1 to quiz4 (denominator is 4).

In [None]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")
quiz["best"] = quiz[["quiz1","quiz2","quiz3","quiz4","quiz5"]].max(axis=1)
quiz["worst"] = quiz[["quiz1","quiz2","quiz3","quiz4","quiz5"]].min(axis=1)
quiz["avg"] = quiz[["quiz1","quiz2","quiz3","quiz4","quiz5"]].mean(axis=1)
display(quiz)

Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5,best,worst,avg
0,0,John,,20,46.0,23.0,,46.0,20.0,29.666667
1,1,,66.0,64,90.0,70.0,,90.0,64.0,72.5
2,2,Susan,0.0,95,,16.0,,95.0,0.0,37.0
3,3,Clementine,32.0,83,98.0,96.0,,98.0,32.0,77.25
4,4,David,30.0,13,,,,30.0,13.0,21.5


---
**Example 7:** In some situations, missing value or infinity can be produced from the calculation of non-missing data.
* When calculation **improve** variable for row 2, we divided 95 by 0, which produce **inf**, which means infinity.
* **inf** is different from NaN that it can still be used in calculation. e.g. 1/inf will be equal to zero.
* Typically you want to watch out for situations that will produce **inf** and double check if there is any wrong logic in your program.


In [None]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")
quiz["improve"] = quiz["quiz2"]/quiz["quiz1"]
quiz["improve_pct"] = quiz["improve"]*100
quiz["improve_inv"] = 1/quiz["improve"]
display(quiz)

Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5,improve,improve_pct,improve_inv
0,0,John,,20,46.0,23.0,,,,
1,1,,66.0,64,90.0,70.0,,0.969697,96.969697,1.03125
2,2,Susan,0.0,95,,16.0,,inf,inf,0.0
3,3,Clementine,32.0,83,98.0,96.0,,2.59375,259.375,0.385542
4,4,David,30.0,13,,,,0.433333,43.333333,2.307692


---
**Example 8**: In some situations, we will can replace missing values with some "sensible" values, e.g. mean, mode or median, so that we have a complete data set.

In [None]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")
print("Before replacing missing values:")
display(quiz)

# Fill all NaN with a given value
# In this case, we replace all NaN with 0
quiz_filled = quiz.fillna(0)
print()
print("After replacing missing values:")
display(quiz_filled)


Before replacing missing values:


Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,,20,46.0,23.0,
1,1,,66.0,64,90.0,70.0,
2,2,Susan,0.0,95,,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,,,



After replacing missing values:


Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,0.0,20,46.0,23.0,0.0
1,1,0,66.0,64,90.0,70.0,0.0
2,2,Susan,0.0,95,0.0,16.0,0.0
3,3,Clementine,32.0,83,98.0,96.0,0.0
4,4,David,30.0,13,0.0,0.0,0.0


---
**Example 9:** Usually, you want to replace missing values in different columns with the same value. In that case, you can use a dictionary to specify the values to be used in filling missing cells in different columns. In the dictionary, the keys are the column names, and the values are the values used for filling.

In [None]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")
print("Before replacing missing values:")
display(quiz)

# Fill each column with a separate value
# Values used for each column will be put in a dictionary
fvalues = {"name":"UNKNOWN","quiz1":50,"quiz2":60,"quiz3":70,"quiz4":80}
print()
print("After replacing missing values:")
display(quiz.fillna(fvalues))

Before replacing missing values:


Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,,20,46.0,23.0,
1,1,,66.0,64,90.0,70.0,
2,2,Susan,0.0,95,,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,,,



After replacing missing values:


Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,50.0,20,46.0,23.0,
1,1,UNKNOWN,66.0,64,90.0,70.0,
2,2,Susan,0.0,95,70.0,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,70.0,80.0,


---
**Example 10:** The method in Example 9 can further be improved to use values computed from non-missing values. In this example, we are replacing missing values in quiz3 by the mean of the non-missing values. The missing values in quiz4 are replaced by the median of the non-missing values.

In [None]:
# Read quizscores.xlsx data file
quiz = pd.read_excel("/content/drive/MyDrive/Data/quizscores.xlsx",sheet_name="Sheet1")
display(quiz)

# Fill numeric column with calculated statistics
# Replace missing value of quiz3 with mean
# Replace missing value of quiz4 with median
fvalues = {"quiz1":quiz["quiz1"].mean(),"quiz3":quiz["quiz3"].mean(),
           "quiz4":quiz["quiz4"].median()}
display(quiz.fillna(fvalues))

Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,,20,46.0,23.0,
1,1,,66.0,64,90.0,70.0,
2,2,Susan,0.0,95,,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,,,


Unnamed: 0,studentid,name,quiz1,quiz2,quiz3,quiz4,quiz5
0,0,John,32.0,20,46.0,23.0,
1,1,,66.0,64,90.0,70.0,
2,2,Susan,0.0,95,78.0,16.0,
3,3,Clementine,32.0,83,98.0,96.0,
4,4,David,30.0,13,78.0,46.5,
