# Handling Missing values

Choosing between Mean, Median, and Mode to fill in missing values in a dataset depends on the data you are working with. Below are some valuable guidelines that will help you decide what to choose between mean, median, and mode to fill in missing values in a dataset:

1. **Mean:** When your dataset is in a normal distribution, you can use mean to fill in the missing values.

2. **Median:** When your dataset is not in a normal distribution, you can use the median value to fill in the missing values.

3. **Mode:** When the missing values in your data are categorical and discrete, you can use the mode value to fill in the missing values.

So the first step is to see if your data has missing values. If your data has missing values, you need to check the distribution of each numerical variable (with missing values). If the values in the numerical variables are missing, use the Mean value if the variable is in a normal distribution. Otherwise, choose Median. And if the variable is categorical or discrete, you can select mode. So you need to choose a different measure for each variable.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = {'A': [1, 2, 3, 4, np.nan, 6, 7, 8, 9, np.nan],
        'B': [2, 4, 6, 8, np.nan, 12, 14, 16, 18, np.nan],
        'C': ['red', 'blue', np.nan, 'green', 'green', 
              'blue', 'red', 'blue', 'green', np.nan]}
df = pd.DataFrame(data)
print(df)

     A     B      C
0  1.0   2.0    red
1  2.0   4.0   blue
2  3.0   6.0    NaN
3  4.0   8.0  green
4  NaN   NaN  green
5  6.0  12.0   blue
6  7.0  14.0    red
7  8.0  16.0   blue
8  9.0  18.0  green
9  NaN   NaN    NaN


## using Mean

In [3]:
mean_A = df['A'].mean()
df['A'].fillna(mean_A, inplace=True)
print(df)

     A     B      C
0  1.0   2.0    red
1  2.0   4.0   blue
2  3.0   6.0    NaN
3  4.0   8.0  green
4  5.0   NaN  green
5  6.0  12.0   blue
6  7.0  14.0    red
7  8.0  16.0   blue
8  9.0  18.0  green
9  5.0   NaN    NaN


## using Median

In [4]:
median_B = df['B'].median()
df['B'].fillna(median_B, inplace=True)
print(df)

     A     B      C
0  1.0   2.0    red
1  2.0   4.0   blue
2  3.0   6.0    NaN
3  4.0   8.0  green
4  5.0  10.0  green
5  6.0  12.0   blue
6  7.0  14.0    red
7  8.0  16.0   blue
8  9.0  18.0  green
9  5.0  10.0    NaN


## using Mode

In [5]:
mode_C = df['C'].mode()[0]
df['C'].fillna(mode_C, inplace=True)
print(df)

     A     B      C
0  1.0   2.0    red
1  2.0   4.0   blue
2  3.0   6.0   blue
3  4.0   8.0  green
4  5.0  10.0  green
5  6.0  12.0   blue
6  7.0  14.0    red
7  8.0  16.0   blue
8  9.0  18.0  green
9  5.0  10.0   blue
