## Missing Data
Real data sets are often missing data. Many models and ML methods do not work with missing data.
When reading in missing values, it will display as NaN.
There are several options to deal with missing data. 1) Keep it. 2) Remove it. 3) Replace it.


In [8]:
import numpy as np
import pandas as pd
import os
os.chdir("C:\\Users\\sfrie\\Python\pandas\\udemy_pandas_files")

In Numpy and Pandas, *np.nan* is how to reference a null data type.
In newer versions of Pandas, it is also pd.NA

In [4]:
print(np.nan)
print(pd.NA)

nan
<NA>


In [None]:
#Typical comparisons should be avoided with missing values. Meaning, if you check np.nan == np.nan, it will return false. The reason being, is that we don't truly know what their values are. 
#Therefore, to check if something is a missing value, you should use the Is statement:
np.nan is np.nan

In [55]:
df = pd.read_csv('movie_scores.csv')
df.head()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


### .isnull, / .notnull

In [12]:
#Lets check and select for null values. isnull will return boolean values for all the null values
df.isnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [14]:
#To check for the opposite, we can use notnull
df.notnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [19]:
#Select all of the actors not missing their pre-movie score. Then pass it into the data frame to convert the boolean values.
df[df['pre_movie_score'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [23]:
#To select for multiple conditions of null or not null, we use normal multiple condition syntax of () xombined iwth & or |
df[(df['pre_movie_score'].isnull()) & (df['first_name'].notnull())]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


### Drop Data

In [26]:
#by just running dropna, it will remove any rows with missing values
df.dropna()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [27]:
#If we use the "threshhold" arguement though, it will allow us to drop nulls conditionally on how many values are filed..
df.dropna(thresh=1)
#Here, Hugh Jackman does not get dropped, because he has at least 1 nonnull value.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [28]:
#if we set axis = 1, this will drop based off the rows. Here, it will drop everything, as each column has an instance of a missing value.
df.dropna(axis = 1)
#This is why typically, default axis is equal to 0.

0
1
2
3
4


In [29]:
#The subset arguement allows us to drop based off the contents of a specific field.
df.dropna(subset = ['last_name'])
#This will not drop Hugh Jackman, because Last Name has a filled value.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


### Fill Values

In [30]:
#Fillna will just set all null values to a specific filler.
df.fillna('New Value!')
#This typically isn't used, because some columns are numeric, and some are other objects.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,New Value!,New Value!,New Value!,New Value!,New Value!,New Value!
2,Hugh,Jackman,51.0,m,New Value!,New Value!
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [57]:
#Most commonly is filling for a specific column.
df['pre_movie_score'].fillna(0)
#To make this permenent, we would just assign it: df['pre_movie_score'] = df['pre_movie_score'].fillna(0)

0    8.0
1    0.0
2    0.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [58]:
#When we aren't sure what to set the null values to, sometimes the best option is to fill it in with the average value.
df['pre_movie_score'].fillna(df['pre_movie_score'].mean())

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64