In [1]:
import pandas as pd
import numpy as np


In [3]:
df = pd.read_csv('movie_scores.csv')
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


How to check and select for NULL values?


In [4]:
df.isnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [5]:
df.notnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


We Want to know which column has valid data which is not null. so we use the 'notnull()' method to do it. 


In [6]:
df['pre_movie_score'].notnull()


0     True
1    False
2    False
3     True
4     True
Name: pre_movie_score, dtype: bool

now we make a list of the not nulls by simply putting the not null items in [] :

In [7]:
df[df['pre_movie_score'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [8]:
df['pre_movie_score'].isnull()

0    False
1     True
2     True
3    False
4    False
Name: pre_movie_score, dtype: bool

In [9]:
df[df['pre_movie_score'].isnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
1,,,,,,
2,Hugh,Jackman,51.0,m,,


How we only exclude the row which has all the `NaN` data? 
-------------
we know that only hugh jackman has one NaN cell. so we say exclude the data which is null (to get a list of all null datas) then, we say: because Hugh Jackman is less missing so we exclude the most NaN with a simple `and` operator : `&`

In [10]:
df[(df['pre_movie_score'].isnull()) & (df['first_name'].notnull())]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


How to drop a DATA? 
------
use `help(df.dropna)` command:
```
df.dropna()
df.dropna(thresh=1) ---- drop all except for the ones that only have 1 non-null value
df.dropna(thresh=5) ---- drop all except for the ones that at least have 5 non-null value
```


In [11]:
df.dropna(thresh=1)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [12]:
df.dropna(axis=1) # it drops everything. why? because in the table we have at least one NaN value so it drops all columns:

0
1
2
3
4


In [13]:
df.dropna(axis=0) # it drops only rows which have NaN.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


We use `subset` to find in specific columns: 

In [15]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [16]:
df.dropna(subset=['last_name']) # it only considers the column which has a missing value (row = 1)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


How to fill the data? 
----
`df.fillna('NEW VALUE!')`

In [None]:
df.fillna('NEW VALUE!')

In [20]:
df['pre_movie_score'].fillna(0) # it is going to fill in NaN with 0.0

0    8.0
1    0.0
2    0.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

How to add a value with the Average amount of the list? 
-----
we first find the Average with `mean` function:
...

`df['pre_movie_score'].mean()`

...

then we say fill in the missing data with the Average: 

...

`df['pre_movie_score'].fillna(df['pre_movie_score'].mean())`

In [21]:
df['pre_movie_score'].fillna(df['pre_movie_score'].mean())

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64