## What Null/NA/nan objects look like:

Source: https://github.com/pandas-dev/pandas/issues/28095

A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type

In [2]:
import numpy as np
import pandas as pd

In [3]:
np.nan # not-a-number

nan

In [4]:
pd.NA

<NA>

In [5]:
pd.NaT

NaT

----
------
## Note! Typical comparisons should be avoided with Missing Values

* https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b
* https://stackoverflow.com/questions/20320022/why-in-numpy-nan-nan-is-false-while-nan-in-nan-is-true

This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other.

# Summary Table
| Placeholder | Library   | Type                               | Data Type Context                   |
|-------------|-----------|------------------------------------|-------------------------------------|
| `np.nan`    | NumPy     | `float`                            | Floating-point numbers              |
| `pd.NA`     | Pandas    | Nullable `Int64`, `boolean`, `string` | Any nullable type (int, bool, str) |
| `pd.NaT`    | Pandas    | `datetime64`, `timedelta64`        | DateTime and TimeDelta data         |

In [6]:
np.nan == np.nan

False

In [7]:
np.nan in [np.nan]

True

In [8]:
np.nan is np.nan

True

In [9]:
pd.NA == pd.NA

<NA>

## Data

People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing.

In [11]:
df = pd.read_csv('movie_scores.csv')

In [12]:
df.shape

(5, 6)

In [13]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


## Checking and Selecting for Null Values

In [14]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [15]:
df.isnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [16]:
df.notnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [17]:
df['first_name']

0      Tom
1      NaN
2     Hugh
3    Oprah
4     Emma
Name: first_name, dtype: object

In [18]:
df[df['first_name'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [19]:
df[(df['pre_movie_score'].isnull()) & df['sex'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


## Drop Data

In [20]:
# help(df.dropna)

In [21]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [22]:
df.dropna()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [23]:
df.dropna(thresh=4)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [24]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [25]:
df.dropna(axis=1) # All columns has NaN values, hence all columns are deleted

0
1
2
3
4


In [26]:
df.dropna(thresh=4,axis=1)

Unnamed: 0,first_name,last_name,age,sex
0,Tom,Hanks,63.0,m
1,,,,
2,Hugh,Jackman,51.0,m
3,Oprah,Winfrey,66.0,f
4,Emma,Stone,31.0,f


## Fill Data

In [27]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [28]:
df.fillna("NEW VALUE!")  

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!
2,Hugh,Jackman,51.0,m,NEW VALUE!,NEW VALUE!
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [29]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [30]:
df['first_name'].fillna("Empty")

0      Tom
1    Empty
2     Hugh
3    Oprah
4     Emma
Name: first_name, dtype: object

In [31]:
df['first_name'] = df['first_name'].fillna("Empty")

In [32]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,Empty,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [33]:
df['pre_movie_score'].mean()

np.float64(7.0)

In [34]:
df['pre_movie_score'].fillna(df['pre_movie_score'].mean())

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   first_name        5 non-null      object 
 1   last_name         4 non-null      object 
 2   age               4 non-null      float64
 3   sex               4 non-null      object 
 4   pre_movie_score   3 non-null      float64
 5   post_movie_score  3 non-null      float64
dtypes: float64(3), object(3)
memory usage: 372.0+ bytes


In [36]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,Empty,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [37]:
df.mean(numeric_only=True)

age                 52.75
pre_movie_score      7.00
post_movie_score     9.00
dtype: float64

In [38]:
df.fillna(df.mean(numeric_only=True))

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,Empty,,52.75,,7.0,9.0
2,Hugh,Jackman,51.0,m,7.0,9.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


## Filling with Interpolation

Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.

Full Docs on this Method:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html

In [39]:
airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}

In [40]:
ser = pd.Series(airline_tix)

In [41]:
ser

first           100.0
business          NaN
economy-plus     50.0
economy          30.0
dtype: float64

In [42]:
ser.interpolate()

first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64

In [43]:
ser.interpolate(method='linear')

first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64

In [44]:
df = pd.DataFrame(ser,columns=['Price'])

In [45]:
df

Unnamed: 0,Price
first,100.0
business,
economy-plus,50.0
economy,30.0


In [46]:
df.interpolate()

Unnamed: 0,Price
first,100.0
business,75.0
economy-plus,50.0
economy,30.0


In [47]:
df = df.reset_index()

In [48]:
df

Unnamed: 0,index,Price
0,first,100.0
1,business,
2,economy-plus,50.0
3,economy,30.0


In [49]:
df.interpolate(method='linear',order=2)

  df.interpolate(method='linear',order=2)


Unnamed: 0,index,Price
0,first,100.0
1,business,75.0
2,economy-plus,50.0
3,economy,30.0


In [50]:
df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
                   (np.nan, 2.0, np.nan, np.nan),
                   (2.0, 3.0, np.nan, 9.0),
                   (np.nan, 4.0, -4.0, 16.0)],
                  columns=list('abcd'))

In [51]:
df

Unnamed: 0,a,b,c,d
0,0.0,,-1.0,1.0
1,,2.0,,
2,2.0,3.0,,9.0
3,,4.0,-4.0,16.0


In [52]:
df.interpolate(method='linear', limit_direction='forward', axis=0)

Unnamed: 0,a,b,c,d
0,0.0,,-1.0,1.0
1,1.0,2.0,-2.0,5.0
2,2.0,3.0,-3.0,9.0
3,2.0,4.0,-4.0,16.0


Refer this link<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html