# **Random Imputation**

Random imputation involves replacing missing values with a **randomly selected value** from the same variable. This method preserves the variable's distribution but can introduce noise and bias if not used carefully.

- **Advantages**:
    - Simple to implement
    - Preserves the variable's distribution
    - Quick to compute
    - Well suited for linear models, as it **doesn't distort the distribution**, regardless of the % of NA
- **Disadvantages**:
    - Can introduce noise and bias
    - Does not account for relationships between variables
    - May not be appropriate for all variables
    - Affect the covariance with other features
    - **Can use it only by pandas, not available in Sklearn**
    - Memory heavy for deployment, as we need to store the original trainig set to extract values from and replace the NA in coming observations




In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('train.csv')[['Age','Fare','Survived']]
df.head()

Unnamed: 0,Age,Fare,Survived
0,22.0,7.25,0
1,38.0,71.2833,1
2,26.0,7.925,1
3,35.0,53.1,1
4,35.0,8.05,0


In [3]:
np.round(df.isnull().mean()*100,2)

Age         19.87
Fare         0.00
Survived     0.00
dtype: float64

In [4]:
x = df.drop('Survived', axis=1)
y = df['Survived']  

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

x_train['Age_imputed'] = x_train['Age']
x_test['Age_imputed'] = x_test['Age']

print('MISSING VALUES IN Age_imputed : ', x_train['Age_imputed'].isnull().sum(), '\n')
x_train.sample(5)

MISSING VALUES IN Age_imputed :  140 



Unnamed: 0,Age,Fare,Age_imputed
370,25.0,55.4417,25.0
339,45.0,35.5,45.0
823,27.0,12.475,27.0
589,,8.05,
861,21.0,11.5,21.0


In [5]:
x_train['Age_imputed'][x_train['Age_imputed'].isnull()] = x_train['Age'].dropna().sample(x_train['Age'].isnull().sum(), random_state=42)
x_test['Age_imputed'][x_test['Age_imputed'].isnull()] = x_test['Age'].dropna().sample(x_test['Age'].isnull().sum(), random_state=42)


In [6]:
"""
x_train['Age_imputed'][x_train['Age_imputed'].isnull()] = x_train['Age'].dropna().sample(x_train['Age'].isnull().sum(), random_state=42)

Breakdown:

1.  x_train['Age'].dropna().sample(..., random_state=42):
    -   x_train['Age'].dropna(): First, we get all the VALID (non-missing) age values. We need these to pick random samples from.
    -   .sample(...): Then, we randomly pick values from this pool of valid ages.
    -   x_train['Age'].isnull().sum(): This calculates HOW MANY random values we need. It counts the total number of missing values in the 'Age' column.
    -   random_state=42: Ensures we pick the exact same random numbers every time we run this code (for reproducibility).

2.  x_train['Age_imputed'][x_train['Age_imputed'].isnull()] = ... :
    -   x_train['Age_imputed']: We are modifying the 'Age_imputed' column (which is likely a copy of 'Age').
    -   [x_train['Age_imputed'].isnull()]: This is a filter. It selects ONLY the rows where the value is currently NaN (missing).
    -   = ...: We assign the random values we generated in step 1 into these specific empty slots.

Summary:
This line fills the missing values in 'Age_imputed' by randomly picking real, existing age values from the 'Age' column. It preserves the original distribution of the data better than filling with the mean or median.
"""

"\nx_train['Age_imputed'][x_train['Age_imputed'].isnull()] = x_train['Age'].dropna().sample(x_train['Age'].isnull().sum(), random_state=42)\n\nBreakdown:\n\n1.  x_train['Age'].dropna().sample(..., random_state=42):\n    -   x_train['Age'].dropna(): First, we get all the VALID (non-missing) age values. We need these to pick random samples from.\n    -   .sample(...): Then, we randomly pick values from this pool of valid ages.\n    -   x_train['Age'].isnull().sum(): This calculates HOW MANY random values we need. It counts the total number of missing values in the 'Age' column.\n    -   random_state=42: Ensures we pick the exact same random numbers every time we run this code (for reproducibility).\n\n2.  x_train['Age_imputed'][x_train['Age_imputed'].isnull()] = ... :\n    -   x_train['Age_imputed']: We are modifying the 'Age_imputed' column (which is likely a copy of 'Age').\n    -   [x_train['Age_imputed'].isnull()]: This is a filter. It selects ONLY the rows where the value is current