## What Null/NA/nan objects look like:

Source: https://github.com/pandas-dev/pandas/issues/28095

A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type

In [1]:
import numpy as np
import pandas as pd

In [5]:
print(None)

None


In [2]:
np.nan # not-a-number

nan

In [3]:
pd.NA

<NA>

In [4]:
pd.NaT

NaT

----
------
## Note! Typical comparisons should be avoided with Missing Values

* https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b
* https://stackoverflow.com/questions/20320022/why-in-numpy-nan-nan-is-false-while-nan-in-nan-is-true

This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other.

# Summary Table
| Placeholder | Library   | Type                               | Data Type Context                   |
|-------------|-----------|------------------------------------|-------------------------------------|
| `np.nan`    | NumPy     | `float`                            | Floating-point numbers              |
| `pd.NA`     | Pandas    | Nullable `Int64`, `boolean`, `string` | Any nullable type (int, bool, str) |
| `pd.NaT`    | Pandas    | `datetime64`, `timedelta64`        | DateTime and TimeDelta data         |

In [6]:
np.nan == np.nan

False

In [7]:
np.nan in [np.nan]

True

In [8]:
np.nan is np.nan

True

In [9]:
pd.NA == pd.NA

<NA>

In [10]:
pd.NA is pd.NA

True

In [11]:
pd.NaT is pd.NaT

True

In [12]:
not pd.NA

TypeError: boolean value of NA is ambiguous

In [13]:
not pd.NaT

False

In [14]:
not np.nan

False

-------

## Data

People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing.

In [51]:
df = pd.read_csv('data/movie_scores.csv')

In [17]:
df.shape

(5, 6)

In [18]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


## Checking and Selecting for Null Values

In [19]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [20]:
df.isnull()

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [21]:
df.notnull()

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [22]:
df['first_name']

0      Tom
1      NaN
2     Hugh
3    Oprah
4     Emma
Name: first_name, dtype: object

In [23]:
df[df['first_name'].notnull()]

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [24]:
df[(df['pre_movie_score'].isnull()) & df['gender'].notnull()]

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


## Drop Data

In [27]:
help(df.dropna)

Help on method dropna in module pandas.core.frame:

dropna(*, axis: 'Axis' = 0, how: 'AnyAll | lib.NoDefault' = <no_default>, thresh: 'int | lib.NoDefault' = <no_default>, subset: 'IndexLabel | None' = None, inplace: 'bool' = False, ignore_index: 'bool' = False) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Remove missing values.

    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.

    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.

        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.

        Only a single axis is allowed.

    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.

       

In [25]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [None]:
df.dropna()

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [30]:
df.dropna(thresh=4)

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [33]:
df2 = df.copy()

In [34]:
df2

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [37]:
df2.loc[3:,'gender']

3    f
4    f
Name: gender, dtype: object

In [38]:
df2.loc[3:,'gender'] = np.nan

In [39]:
df2

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,,6.0,8.0
4,Emma,Stone,31.0,,7.0,9.0


In [40]:
df2.dropna(subset=['gender'])

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,


In [41]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [42]:
df.dropna(axis=1) # All columns has NaN values, hence all columns are deleted

0
1
2
3
4


In [43]:
df.dropna(thresh=4,axis=1)

Unnamed: 0,first_name,last_name,age,gender
0,Tom,Hanks,63.0,m
1,,,,
2,Hugh,Jackman,51.0,m
3,Oprah,Winfrey,66.0,f
4,Emma,Stone,31.0,f


## Fill Data

In [44]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [45]:
df.fillna("NEW VALUE!")  

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!
2,Hugh,Jackman,51.0,m,NEW VALUE!,NEW VALUE!
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [52]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [54]:
df['first_name'].fillna("NO NAME")

0        Tom
1    NO NAME
2       Hugh
3      Oprah
4       Emma
Name: first_name, dtype: object

In [55]:
df['first_name'] = df['first_name'].fillna("NO NAME")
df['last_name'] = df['last_name'].fillna("NO NAME")

In [56]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,NO NAME,NO NAME,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [63]:
df['age'].mean()

np.float64(52.75)

In [62]:
(63+51+66+31)/4

52.75

In [66]:
df['age'].fillna(df['age'].mean())
# df['age'].fillna(52.75)

0    63.00
1    52.75
2    51.00
3    66.00
4    31.00
Name: age, dtype: float64

In [67]:
df['pre_movie_score']

0    8.0
1    NaN
2    NaN
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [68]:
df['pre_movie_score'].fillna(df['pre_movie_score'].mean())

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [70]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,NO NAME,NO NAME,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [71]:
df['age'] = df['age'].fillna(df['age'].mean())
df['pre_movie_score'] = df['pre_movie_score'].fillna(df['pre_movie_score'].mean())
df['post_movie_score'] = df['post_movie_score'].fillna(df['post_movie_score'].mean())

In [72]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,NO NAME,NO NAME,52.75,,7.0,9.0
2,Hugh,Jackman,51.0,m,7.0,9.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [75]:
df['gender'].fillna('m')

0    m
1    m
2    m
3    f
4    f
Name: gender, dtype: object

In [76]:
df['gender'] = df['gender'].fillna('m')

In [77]:
df

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,NO NAME,NO NAME,52.75,m,7.0,9.0
2,Hugh,Jackman,51.0,m,7.0,9.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [78]:
df.mean(numeric_only=True)

age                 52.75
pre_movie_score      7.00
post_movie_score     9.00
dtype: float64

In [79]:
df3 = pd.read_csv('data/movie_scores.csv')

In [81]:
df3

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [80]:
df3.mean(numeric_only=True)

age                 52.75
pre_movie_score      7.00
post_movie_score     9.00
dtype: float64

In [None]:
df3 = df3.fillna(df3.mean(numeric_only=True))

In [85]:
df3['gender'] = df3['gender'].fillna('m')

In [86]:
df3

Unnamed: 0,first_name,last_name,age,gender,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,52.75,m,7.0,9.0
2,Hugh,Jackman,51.0,m,7.0,9.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


## Filling with Interpolation

Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.

Full Docs on this Method:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html

In [91]:
np.linspace(100,50,3)

array([100.,  75.,  50.])

In [87]:
airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}

In [88]:
ser = pd.Series(airline_tix)

In [89]:
ser

first           100.0
business          NaN
economy-plus     50.0
economy          30.0
dtype: float64

In [90]:
ser.interpolate()

first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64

In [92]:
ser.interpolate(method='linear')

first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64

In [96]:
ser

first           100.0
business          NaN
economy-plus     50.0
economy          30.0
dtype: float64

In [97]:
# ser.interpolate(method='polynomial',order=2)

In [98]:
ser

first           100.0
business          NaN
economy-plus     50.0
economy          30.0
dtype: float64

In [99]:
df = pd.DataFrame(ser,columns=['Price'])

In [100]:
df

Unnamed: 0,Price
first,100.0
business,
economy-plus,50.0
economy,30.0


In [None]:
df.interpolate()  # method - linear

Unnamed: 0,Price
first,100.0
business,75.0
economy-plus,50.0
economy,30.0


In [103]:
df = df.reset_index()

In [104]:
df

Unnamed: 0,index,Price
0,first,100.0
1,business,
2,economy-plus,50.0
3,economy,30.0


In [107]:
# df.interpolate(method='polynomial',order=2)

In [112]:
list('abcd')

['a', 'b', 'c', 'd']

In [108]:
df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
                   (np.nan, 2.0, np.nan, np.nan),
                   (2.0, 3.0, np.nan, 9.0),
                   (np.nan, 4.0, -4.0, 16.0)],
                  columns=list('abcd'))

In [109]:
df

Unnamed: 0,a,b,c,d
0,0.0,,-1.0,1.0
1,,2.0,,
2,2.0,3.0,,9.0
3,,4.0,-4.0,16.0


In [114]:
df.interpolate(method='linear', limit_direction='forward')

Unnamed: 0,a,b,c,d
0,0.0,,-1.0,1.0
1,1.0,2.0,-2.0,5.0
2,2.0,3.0,-3.0,9.0
3,2.0,4.0,-4.0,16.0


Refer this link<br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html