## Missing Values in datasets
### Missing values types in datasets
 1. MCAR - Missing completely at random
 2. MAR - Missing at random
 3. MNAR - Missing data not at random

In [2]:
# import packages
import seaborn as sns
import pandas as pd
import numpy as np

In [3]:
# creating a dataset / titanic dataset
df = sns.load_dataset('titanic')

# view the data headers
print(df.head())

# how to check for null values 
df.isnull() # shows if data is null
df.isnull().sum() # shows count of data is null

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [4]:
# shape before the values dropped from the dataframe
print(f'before dropping null data : {df.shape}')

# remove na values from dataframe
df.dropna()

# checking shape of the dataframe
print(f'after dropping null data : {df.dropna().shape}')

before dropping null data : (891, 15)
after dropping null data : (182, 15)


In [5]:
# before dropping columns shape
print(f'before dropping {df.shape}')

# remove columns that having null valeus
df.dropna(axis=1, inplace=True)

# showing head after drop columns
# df.head

# after dropping shape 
# print(f'after dropping columns {df.shape}')

before dropping (891, 15)


### Imputation of missing values
1. Mean value imputation
2. Median value imputation
3. Mode imputation
4. Random sample imputation

### Mean imputation replaces missing values with the mean (average) of the non-missing values in that column.

When to use:
-  Numerical data only
-  Data is missing at random
-  Want to preserve dataset size

Pros: Simple, preserves sample size
Cons: Reduces variance, can distort distributions

In [6]:
# creating a data set 
df = pd.DataFrame({
    'age': [25, 30, np.nan, 50, np.nan, 35, np.nan, 40],
    'salary': [1000, 2000, 5000, np.nan, 3000, np.nan, 1500, 3500]
})

print('before imputation')
print(df)

df['age_mean_imputed'] = df['age'].fillna(df['age'].mean())
df['salary_mean_imputed'] = df['salary'].fillna(df['salary'].mean())

print('after imputation')
print(df)


before imputation
    age  salary
0  25.0  1000.0
1  30.0  2000.0
2   NaN  5000.0
3  50.0     NaN
4   NaN  3000.0
5  35.0     NaN
6   NaN  1500.0
7  40.0  3500.0
after imputation
    age  salary  age_mean_imputed  salary_mean_imputed
0  25.0  1000.0              25.0          1000.000000
1  30.0  2000.0              30.0          2000.000000
2   NaN  5000.0              36.0          5000.000000
3  50.0     NaN              50.0          2666.666667
4   NaN  3000.0              36.0          3000.000000
5  35.0     NaN              35.0          2666.666667
6   NaN  1500.0              36.0          1500.000000
7  40.0  3500.0              40.0          3500.000000


### Median imputation replaces missing values with the median (middle value) of the non-missing values in that column.

When to use:
-  Numerical data only
-  Data has outliers
-  Distribution is skewed
-  More robust than mean imputation

Pros: Not affected by outliers, preserves sample size
Cons: Still reduces variance, can distort distributions

In [7]:
# creating a data set 
df = pd.DataFrame({
    'age': [25, 30, np.nan, 50, np.nan, 35, np.nan, 40],
    'salary': [1000, 2000, 5000, np.nan, 3000, np.nan, 1500, 3500]
})

print('before imputation')
print(df)

df['age_mean_imputed'] = df['age'].fillna(df['age'].median())
df['salary_mean_imputed'] = df['salary'].fillna(df['salary'].median())

print('after imputation')
print(df)


before imputation
    age  salary
0  25.0  1000.0
1  30.0  2000.0
2   NaN  5000.0
3  50.0     NaN
4   NaN  3000.0
5  35.0     NaN
6   NaN  1500.0
7  40.0  3500.0
after imputation
    age  salary  age_mean_imputed  salary_mean_imputed
0  25.0  1000.0              25.0               1000.0
1  30.0  2000.0              30.0               2000.0
2   NaN  5000.0              35.0               5000.0
3  50.0     NaN              50.0               2500.0
4   NaN  3000.0              35.0               3000.0
5  35.0     NaN              35.0               2500.0
6   NaN  1500.0              35.0               1500.0
7  40.0  3500.0              40.0               3500.0


### Mode imputation is most useful for categorical variables with missing data.
Pros:
-  Simple and fast to implement
-  Works well with categorical data
-  Preserves the distribution of frequently occurring values

Cons:
-  Loses information about the original missing data
-  Can introduce bias if many values are missing
-  Reduces variance in the dataset
-  Not ideal for continuous/numerical data


In [8]:
# creating dataframe with categorical data
df = pd.DataFrame({
    'product': ['Laptop', 'Phone', np.nan, 'Tablet', 'Laptop', np.nan, 'Phone', 'Laptop'],
    'quantity': [1, 2, 1, np.nan, 3, 2, np.nan, 1],
    'region': ['US', 'EU', 'US', 'US', np.nan, 'EU', 'US', 'EU']
})

In [9]:
print('before mode imputation')
print(df)

df['product_mode_imputated'] = df['product'].fillna(df['product'].mode().iloc[0])
df['quantity_mode_imputated'] = df['quantity'].fillna(df['quantity'].mode().iloc[0])
df['region_mode_imputated'] = df['region'].fillna(df['region'].mode().iloc[0])
print('after mode imputation')
print(df)

before mode imputation
  product  quantity region
0  Laptop       1.0     US
1   Phone       2.0     EU
2     NaN       1.0     US
3  Tablet       NaN     US
4  Laptop       3.0    NaN
5     NaN       2.0     EU
6   Phone       NaN     US
7  Laptop       1.0     EU
after mode imputation
  product  quantity region product_mode_imputated  quantity_mode_imputated  \
0  Laptop       1.0     US                 Laptop                      1.0   
1   Phone       2.0     EU                  Phone                      2.0   
2     NaN       1.0     US                 Laptop                      1.0   
3  Tablet       NaN     US                 Tablet                      1.0   
4  Laptop       3.0    NaN                 Laptop                      3.0   
5     NaN       2.0     EU                 Laptop                      2.0   
6   Phone       NaN     US                  Phone                      1.0   
7  Laptop       1.0     EU                 Laptop                      1.0   

  region_

### Random sample imputation is a technique where missing values are replaced with randomly selected values from the existing (non-missing) values in the same column.

How it works:
1. Identify all non-missing values in a column
2. Randomly sample from these values
3. Replace missing values with the randomly sampled values

Pros and Cons:

Pros:
-  Preserves the original distribution of the data
-  Maintains variance better than mean/median/mode imputation
-  Works for both numerical and categorical data
-  Introduces less bias than using a single value

Cons:
-  Adds randomness (different results each run unless seed is set)
-  Doesn't consider relationships between variables
-  Can underestimate standard errors in statistical analysis
-  Not suitable when missing data isn't random (MCAR assumption)

In [10]:
# creating a dataframe
df = pd.DataFrame({
    'product': ['Laptop', 'Phone', np.nan, 'Tablet', 'Laptop', np.nan, 'Phone', 'Laptop'],
    'quantity': [1, 2, 1, np.nan, 3, 2, np.nan, 1],
    'region': ['US', 'EU', 'US', 'US', np.nan, 'EU', 'US', 'EU']
})

In [18]:
print('before random sampling')
print(df)
for column in df.columns:
    missing_mask = df[column].isna()
    n_missing = missing_mask.sum()

    if n_missing > 0:
        available_values = df[column].dropna()
        random_samples = np.random.choice(available_values, size=n_missing, replace=True)
        df[f'{column}_random_sampled'] = df[column].copy()
        df.loc[missing_mask, f'{column}_random_sampled'] = random_samples

print('after random sampling')
print(df)

before random sampling
  product  quantity region
0  Laptop       1.0     US
1   Phone       2.0     EU
2     NaN       1.0     US
3  Tablet       NaN     US
4  Laptop       3.0    NaN
5     NaN       2.0     EU
6   Phone       NaN     US
7  Laptop       1.0     EU
after random sampling
  product  quantity region product_random_sampled  quantity_random_sampled  \
0  Laptop       1.0     US                 Laptop                      1.0   
1   Phone       2.0     EU                  Phone                      2.0   
2     NaN       1.0     US                 Laptop                      1.0   
3  Tablet       NaN     US                 Tablet                      2.0   
4  Laptop       3.0    NaN                 Laptop                      3.0   
5     NaN       2.0     EU                 Laptop                      2.0   
6   Phone       NaN     US                  Phone                      2.0   
7  Laptop       1.0     EU                 Laptop                      1.0   

  region_