# Data imputation

Data imputation is a statistical technique used to fill in missing or incomplete data points within a dataset. In real-world datasets, missing data is a common occurrence due to various reasons such as human error, equipment malfunction, or incomplete responses in surveys.

Data imputation methods involve estimating the missing values based on the information available in the dataset.

## Mean/Median/Mode Imputation

Use Case: Imagine you have a dataset containing information about customer ages, but some age values are missing. Mean, median, or mode imputation can be used to fill in these missing values.


Mean, median, or mode imputation can be useful in certain situations, particularly when dealing with missing data in numerical or categorical variables. Here's when you might consider using each method:

### Mean Imputation:



When the data is normally distributed: Mean imputation works well when the data follows a normal distribution. In such cases, replacing missing values with the mean can preserve the overall distribution of the data.

When the missing values are random or missing completely at random (MCAR): Mean imputation is often suitable when the missingness is random because it preserves the mean of the observed data.

### Median Imputation:



When the data is skewed or contains outliers: Median imputation is more robust than mean imputation to outliers and skewed distributions because it's less affected by extreme values.

When the distribution is not normal: If the data is not normally distributed, using the median can be a better representation of the central tendency than the mean.

### Mode Imputation:



When dealing with categorical variables: Mode imputation is appropriate for categorical variables where the data is represented by categories rather than numerical values.

When the data has frequent values: Mode imputation can be useful when there are frequent or dominant values within a categorical variable.

**Loss of variability**: Imputing missing values with the mean, median, or mode may reduce the variability of the dataset, leading to underestimation of standard errors and potentially biasing subsequent analyses.

**Distortion of relationships**: Imputation with central tendency measures can distort relationships between variables, especially if missingness is related to the values of other variables.

**Assumption of normality**: Mean imputation assumes that the data is normally distributed, which may not be valid in all cases.

**Handling of categorical data**: Mean and median imputation are not suitable for categorical variables since they only work with numerical values. Mode imputation can be used for categorical data but may not capture the full complexity of the variable.

### limitations to mean, median, or mode imputation

In summary, mean, median, or mode imputation can be quick and simple methods for handling missing data, but they should be used judiciously, considering the characteristics of the data and the potential impact on subsequent analyses. It's always important to assess the appropriateness of imputation methods based on the specific context of the dataset.

### Example

In [1]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values
data = {'A': [1, 2, np.nan, 4, 5],
        'B': [5, np.nan, 7, 8, 9],
        'C': [np.nan, 12, 13, 14, 15]}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Mean imputation
mean_imputed_df = df.fillna(df.mean())
print("\nMean-imputed DataFrame:")
print(mean_imputed_df)

# Median imputation
median_imputed_df = df.fillna(df.median())
print("\nMedian-imputed DataFrame:")
print(median_imputed_df)

# Mode imputation (for categorical variables)
# Let's create a sample DataFrame with a categorical variable
data = {'A': ['red', 'blue', np.nan, 'green', 'blue']}
df_categorical = pd.DataFrame(data)

print("\nOriginal DataFrame with categorical variable:")
print(df_categorical)

# Mode imputation
mode_imputed_df = df_categorical.fillna(df_categorical.mode().iloc[0])
print("\nMode-imputed DataFrame:")
print(mode_imputed_df)


Original DataFrame:
     A    B     C
0  1.0  5.0   NaN
1  2.0  NaN  12.0
2  NaN  7.0  13.0
3  4.0  8.0  14.0
4  5.0  9.0  15.0

Mean-imputed DataFrame:
     A     B     C
0  1.0  5.00  13.5
1  2.0  7.25  12.0
2  3.0  7.00  13.0
3  4.0  8.00  14.0
4  5.0  9.00  15.0

Median-imputed DataFrame:
     A    B     C
0  1.0  5.0  13.5
1  2.0  7.5  12.0
2  3.0  7.0  13.0
3  4.0  8.0  14.0
4  5.0  9.0  15.0

Original DataFrame with categorical variable:
       A
0    red
1   blue
2    NaN
3  green
4   blue

Mode-imputed DataFrame:
       A
0    red
1   blue
2   blue
3  green
4   blue


## (LOCF)Last observation carried forward : 

Last Observation Carried Forward (LOCF) is a method of imputing missing values by carrying forward the last observed value to fill in the missing data points. It is commonly used in longitudinal or time-series data where observations are made at regular intervals.



### When to Use LOCF

**Longitudinal Data**: LOCF is particularly useful in longitudinal studies where measurements are taken over time. In such studies, it's often reasonable to assume that the last observed value is a good approximation of the missing value, especially if the data is relatively stable over time.

**Clinical Trials**: LOCF is frequently employed in clinical trials and medical research where subjects may drop out or miss follow-up visits. It provides a conservative approach to handling missing data, ensuring that the observed treatment effects are not underestimated.

**Situations with Informative Dropout**: If missingness in the data is related to the outcome variable or other covariates, LOCF can help maintain the integrity of the data by preserving the observed trends.

### Limitations of LOCF

**Assumption of Continuity**: LOCF assumes that the last observed value remains valid and representative of the missing values until the next observation. However, this assumption may not hold true in all cases, especially if there are significant changes or fluctuations in the data between observations.

**Potential Bias**: LOCF can introduce bias, particularly in studies with informative dropout patterns. If the reason for missingness is related to the outcome variable or other factors being studied, carrying forward the last observation may artificially inflate or deflate the observed trends.

**Underestimation of Variability**: LOCF tends to underestimate the variability in the data since it essentially duplicates the last observed value for all missing points. This can lead to underestimation of standard errors and potentially affect the precision of statistical estimates.

**Misrepresentation of Data**: LOCF may not accurately represent the true underlying trends in the data, especially if there are systematic changes or trends between observations. Imputing missing values with the last observed value may obscure important temporal patterns or variations in the data.

In summary, while LOCF can be a convenient and conservative approach to handling missing data, especially in longitudinal studies with minimal missingness, it's essential to be aware of its limitations and potential biases. Researchers should carefully consider the appropriateness of LOCF in the context of their data and research questions, and explore alternative imputation methods if necessary.

### Example

In [2]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values
data = {'A': [1, np.nan, 3, np.nan, 5],
        'B': [np.nan, 7, np.nan, 9, np.nan],
        'C': [11, 12, np.nan, np.nan, 15]}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Perform LOCF imputation using fillna() with method='ffill'
locf_imputed_df = df.fillna(method='ffill')
print("\nLOCF-imputed DataFrame:")
print(locf_imputed_df)


Original DataFrame:
     A    B     C
0  1.0  NaN  11.0
1  NaN  7.0  12.0
2  3.0  NaN   NaN
3  NaN  9.0   NaN
4  5.0  NaN  15.0

LOCF-imputed DataFrame:
     A    B     C
0  1.0  NaN  11.0
1  1.0  7.0  12.0
2  3.0  7.0  12.0
3  3.0  9.0  12.0
4  5.0  9.0  15.0


  locf_imputed_df = df.fillna(method='ffill')


## K-nearest neighbors (KNN) imputation:

K-nearest neighbors (KNN) imputation is a non-parametric method used to impute missing values in a dataset. It estimates missing values based on the values of their nearest neighbors in the feature space. 

### How it works:

Finding Nearest Neighbors: For each missing value, the algorithm identifies the k nearest neighbors (data points with the most similar features) with complete data.

Imputation: The missing value is then imputed based on the values of its nearest neighbors. This can be done by taking the mean, median, or mode of the values of the nearest neighbors.

### When to Use KNN Imputation:

**Complex Relationships**: KNN imputation is particularly useful when the relationships between variables are complex or nonlinear. Unlike simpler imputation methods like mean or median imputation, KNN can capture more intricate patterns in the data.

**Missingness Mechanism**: KNN imputation can be effective when missingness is not completely random (MCAR) and there is some structure or pattern to the missing data. By leveraging information from similar data points, KNN can provide more accurate imputations in such cases.

**High-Dimensional Data**: KNN imputation can handle high-dimensional datasets with many features. It's robust to the curse of dimensionality and can still perform well even when dealing with a large number of variables.

**Small to Medium-Sized Datasets**: KNN imputation can be computationally expensive for large datasets due to the need to calculate distances between data points. It's more suitable for smaller to medium-sized datasets where computational resources are less constrained.

### Limitations of KNN Imputation

**Computationally Intensive**: Calculating distances between data points can be computationally intensive, especially for large datasets or datasets with high dimensionality. This can make KNN imputation impractical for some applications.

**Sensitive to Choice of k**: The performance of KNN imputation can be sensitive to the choice of the number of neighbors (k). Choosing an inappropriate value for k can lead to biased or inaccurate imputations.

**Need for Preprocessing**: KNN imputation relies on the notion of distance between data points, so it's important to preprocess the data and scale the features appropriately to ensure that all variables contribute equally to the distance calculations.

**Potential for Bias**: KNN imputation assumes that similar data points have similar missing values. However, this may not always hold true, especially in heterogeneous datasets where relationships between variables vary across different subgroups.

In summary, KNN imputation can be a powerful tool for handling missing data, especially in cases where relationships between variables are complex and missingness is structured. However, researchers should be mindful of its computational requirements, sensitivity to parameter choices, and potential limitations when applying it to their datasets.


### Example

In [8]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

# Create a sample DataFrame with missing values
data = {'A': [1, 2, np.nan, 4, 5],
        'B': [5, np.nan, 7, 8, 9],
        'C': [np.nan, 12, 13, 14, 15]}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Initialize KNNImputer with k=2
imputer = KNNImputer(n_neighbors=2)

# Perform KNN imputation
knn_imputed = imputer.fit_transform(df)

# Convert the imputed array back to a DataFrame
knn_imputed_df = pd.DataFrame(knn_imputed, columns=df.columns)

print("\nKNN-imputed DataFrame:")
print(knn_imputed_df)


Original DataFrame:
     A    B     C
0  1.0  5.0   NaN
1  2.0  NaN  12.0
2  NaN  7.0  13.0
3  4.0  8.0  14.0
4  5.0  9.0  15.0

KNN-imputed DataFrame:
     A    B     C
0  1.0  5.0  12.5
1  2.0  6.0  12.0
2  3.0  7.0  13.0
3  4.0  8.0  14.0
4  5.0  9.0  15.0


## Always neccessary to perform data imputation?

No, it's not always necessary to perform data imputation. Whether or not data imputation is necessary depends on various factors including the nature of the missing data, the goals of the analysis, and the potential impact of missing data on the validity and interpretability of the results. Here are some scenarios where data imputation may not be necessary:

**Minimal Missingness**: If the dataset has very few missing values and the missingness is negligible in relation to the overall dataset, imputation may not be necessary. In such cases, it may be feasible to simply exclude observations with missing values from the analysis without significantly compromising the validity of the results.

**Missing Completely at Random (MCAR)**: If the missing data can be assumed to be missing completely at random (MCAR), meaning that the probability of missingness is unrelated to the observed or unobserved data, then the missingness is ignorable and may not require imputation. However, it's essential to assess the randomness of missingness using appropriate statistical tests.

**Complete Case Analysis**: In some cases, complete case analysis, where only observations with complete data are included in the analysis, may be sufficient and appropriate. This approach avoids the need for imputation but may result in a reduction in sample size and statistical power.

**Sensitive Analyses**: In sensitive analyses or situations where imputation may introduce bias or uncertainty, such as in causal inference studies or when dealing with high-dimensional data, researchers may opt to perform sensitivity analyses to assess the robustness of their results to different missing data handling strategies, including imputation.

**Exploratory Data Analysis**: During exploratory data analysis or preliminary investigations, researchers may choose to explore the patterns and characteristics of missing data before deciding whether imputation is necessary. Understanding the nature and mechanisms of missingness can inform the choice of appropriate imputation methods or alternative strategies for handling missing data.

In summary, while data imputation can be a useful technique for handling missing data, it's not always necessary or appropriate. Researchers should carefully consider the characteristics of the dataset, the assumptions underlying the missing data mechanism, and the implications of missing data on the validity and interpretation of their results before deciding whether to perform imputation.