# Random sample imputation
is a technique used to handle missing values by randomly selecting values from the observed data and using them to fill in the missing values. It involves taking a random sample of non-missing values from the variable and assigning those values to the missing entries.

In this approach, missing values are replaced with values randomly drawn from the existing data, preserving the overall distribution and variability of the variable. By introducing randomness, random sample imputation avoids biases that could arise from imputing missing values with fixed values or summary statistics.

Random sample imputation is useful when missing values are believed to be missing at random, and the missingness does not carry any specific pattern or meaning. It allows for a more flexible and realistic estimation of the missing values based on the available information.

However, it's important to note that random sample imputation does not take into account any relationships or patterns in the data, and it may not accurately reflect the true values of the missing data. Additionally, the imputation results may vary if the random sampling is performed multiple times.

When applying random sample imputation, it's crucial to consider the limitations and potential impact on subsequent analysis or modeling. It may be more appropriate for exploratory analysis or as a sensitivity analysis rather than for final imputation in critical scenarios. Advanced imputation techniques that incorporate more sophisticated modeling or consider the specific characteristics of the data may be preferred in certain cases.

## Here are some of the advantages of random sample imputation:

1. It is simple and easy to implement.
2. It does not require any special software.
3. It can be used with any type of variable.

## Here are some of the disadvantages of random sample imputation:

1. It can introduce bias into the dataset if the missing values are not MCAR.
2. It does not take into account the relationships between different variables.
3. It can be computationally expensive if the dataset is large.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split

In [3]:
df = pd.read_csv("train.csv", usecols=["Age", "Fare", "Survived"])

In [4]:
df

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.2500
1,1,38.0,71.2833
2,1,26.0,7.9250
3,1,35.0,53.1000
4,0,35.0,8.0500
...,...,...,...
886,0,27.0,13.0000
887,1,19.0,30.0000
888,0,,23.4500
889,1,26.0,30.0000


In [5]:
df.isnull().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [6]:
df.isnull().mean()*100

Survived     0.00000
Age         19.86532
Fare         0.00000
dtype: float64

In [7]:
X = df.drop(columns=["Survived"])
y = df["Survived"]

In [9]:
X.head()

Unnamed: 0,Age,Fare
0,22.0,7.25
1,38.0,71.2833
2,26.0,7.925
3,35.0,53.1
4,35.0,8.05


In [10]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [20]:
X_train["Age_imputed"] = X_train["Age"]
X_test["Age_imputed"] = X_test["Age"]

In [22]:
X_train.head()

Unnamed: 0,Age,Fare,Age_imputed
30,40.0,27.7208,40.0
10,4.0,16.7,4.0
873,47.0,9.0,47.0
182,9.0,31.3875,9.0
876,20.0,9.8458,20.0


In [21]:
X_test.head()

Unnamed: 0,Age,Fare,Age_imputed
707,42.0,26.2875,42.0
37,21.0,8.05,21.0
615,24.0,65.0,24.0
169,28.0,56.4958,28.0
68,17.0,7.925,17.0


In [None]:
X_train["Age_imputed"]