# Missing Data

*Originally by B. Mathayomchan<br>
Modified by K. Bunchongchit*

In the old days when we have missing data, people may use others numbers to represent that. For example, if the age is missing, they may put -1 instead. In Pandas, the symbol that represents a missing data is **NaN**.

**NaN** actually comes with NumPy.

In [1]:
from numpy import NaN, NAN, nan

In [2]:
print(NaN == True)

False


In [3]:
print(NaN == False)

False


In [4]:
print(NaN == 0)

False


In [5]:
print(NaN == '')

False


In [6]:
print(NaN == NaN)

False


In [7]:
print(NaN == nan)

False


In [8]:
print(NaN == NAN)

False


### How do we check that the value is NaN?

In [9]:
import pandas as pd

In [10]:
print(pd.isnull(NaN))

True


In [11]:
print(pd.isnull(nan))

True


In [12]:
print(pd.isnull(NAN))

True


In [13]:
print(pd.notnull(NaN))

False


In [14]:
print(pd.notnull(5))

True


In [15]:
print(pd.notnull(''))

True


In [16]:
print(pd.notnull('NaN'))

True


Let's try to import a csv file with NaN

In [17]:
df = pd.read_csv("./datasets/muicNaN.csv")
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,ICCS,3.1


In [18]:
df = pd.read_csv("./datasets/muicNaN.csv", keep_default_na=False)
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,ICCS,3.1


In [19]:
x = df.iloc[1,3]
print(pd.isnull(x))

False


In [20]:
print(x=='')

False


In [21]:
y = df.iloc[3,2]
print(pd.isnull(y))

False


In [22]:
print(y=='')

False


If the data file already use something as NaN, we can also take that.

In [23]:
df = pd.read_csv("./datasets/muicNaN2.csv")
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,-1.0
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,-1,2.4
4,6780005,Moe Otto,,-1.0


In [24]:
df = pd.read_csv("./datasets/muicNaN2.csv", na_values=-1)
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,,


In [25]:
df = pd.read_csv("./datasets/muicNaN.csv", na_values=-1, keep_default_na=False)
df

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,ICCS,3.1


## Dealing with missing values

First, we need to know which columns have missing values. There are a number of ways to do this.

In [26]:
# https://datatofish.com/count-nan-pandas-dataframe/
null_counts = df.isna().sum()
null_counts[null_counts > 0]

Series([], dtype: int64)

In [27]:
# https://pandas.pydata.org/docs/user_guide/missing_data.html
df_clean = df.dropna()
df_clean.head()

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,ICCS,3.1


In [28]:
df.dtypes

ID        int64
Name     object
Major    object
GPA      object
dtype: object

In [29]:
# https://pandas.pydata.org/docs/user_guide/missing_data.html
df_filled = df.interpolate()
df_filled

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,,2.4
4,6780005,Moe Otto,ICCS,3.1


In [30]:
# https://www.makeuseof.com/fill-missing-data-with-pandas/
df.fillna({"Major":df['Major'].mode()[0], 
           "GPA": df['GPA'].mean()})

TypeError: Could not convert 3.2NaN3.32.43.1 to numeric