#### Handling Missing Values
## Missing Values occur when no data is stored for a particular cell in a dataset.
They can appear as:-
* NaN(Not a Number).
* None.
* Blank string "".
* Custom Marker "?", "N/A", "null".

In [26]:
import pandas as pd
data={'Name':['Mickey', 'Rengoku', 'Tengen', 'Sanemi', 'Obanai',None], 'Breathing':[None, 'Fire', 'Sound', 'Wind', 'Serpent','Water'],'Age':[21, 34, None, 24, 25, None]}
df=pd.DataFrame(data)
print(df)

      Name Breathing   Age
0   Mickey      None  21.0
1  Rengoku      Fire  34.0
2   Tengen     Sound   NaN
3   Sanemi      Wind  24.0
4   Obanai   Serpent  25.0
5     None     Water   NaN


In [20]:
print(df.isnull())

    Name  Breathing    Age
0  False       True  False
1  False      False  False
2  False      False   True
3  False      False  False
4  False      False  False
5   True      False   True


In [21]:
print(df.isnull().sum())

Name         1
Breathing    1
Age          2
dtype: int64


#### There are three major mechanism giving rise to Missing Values

* MCAR:- Missing Completely at Random is a mechanism in which probability of missing value is completely unrelated to both observed and missed data.
    * The missingness is entirely random — no pattern at all.
    * Example :- Survey data when someone accidently skipped a question.
* MAR:- Missing At Random is a mechanism in which missingness depends on other variables and not the missing value itself.
    * Income is missing for young people because they don't have a job and not because of income itself.
* MNAR :- Missingness depends on the value itself.
    * Example :- People with high income don't disclose therefore the incoem is missing due to income itself 

#### Three Major strategies to Handle Missing Values

* Removing Values

In [22]:
df_cleaned = df.dropna()
print(df)
print(df_cleaned)

      Name Breathing   Age
0   Mickey      None  21.0
1  Rengoku      Fire  34.0
2   Tengen     Sound   NaN
3   Sanemi      Wind  24.0
4   Obanai   Serpent  25.0
5     None     Water   NaN
      Name Breathing   Age
1  Rengoku      Fire  34.0
3   Sanemi      Wind  24.0
4   Obanai   Serpent  25.0


* Fill with a constant

In [23]:
df['Name'].fillna("Unkown", inplace=True)
df["Age"].fillna(df['Age'].mean(), inplace=True)
print(df)

      Name Breathing   Age
0   Mickey      None  21.0
1  Rengoku      Fire  34.0
2   Tengen     Sound  26.0
3   Sanemi      Wind  24.0
4   Obanai   Serpent  25.0
5   Unkown     Water  26.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Name'].fillna("Unkown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df['Age'].mean(), inplace=True)


df

In [27]:
df

Unnamed: 0,Name,Breathing,Age
0,Mickey,,21.0
1,Rengoku,Fire,34.0
2,Tengen,Sound,
3,Sanemi,Wind,24.0
4,Obanai,Serpent,25.0
5,,Water,


In [29]:
df['Age']=df['Age'].interpolate()
print(df)

      Name Breathing   Age
0   Mickey      None  21.0
1  Rengoku      Fire  34.0
2   Tengen     Sound  29.0
3   Sanemi      Wind  24.0
4   Obanai   Serpent  25.0
5     None     Water  25.0


* ####  Mean based imputation works well when distribution is normally distributed
* ####  Meadian based imputation is used when there are outliers in the distributions.
* ####  Mode based imputation is used when there are categorical values in the distribution.


In [34]:
df[df['Breathing'].notna()]['Breathing'].mode()[0]

'Fire'

In [35]:
df['Breathing'].fillna(df[df['Breathing'].notna()]['Breathing'].mode()[0], inplace=True)
df

Unnamed: 0,Name,Breathing,Age
0,Mickey,Fire,21.0
1,Rengoku,Fire,34.0
2,Tengen,Sound,29.0
3,Sanemi,Wind,24.0
4,Obanai,Serpent,25.0
5,,Water,25.0
