# Imputing Missing Values

Imputation is the process of replacing missing data with substituted values to maintain the integrity of the dataset and enable accurate analysis. This technique helps to avoid the loss of valuable information and ensures that the dataset remains usable for statistical analysis and machine learning models.

## Objectives

1. **Understand the Importance of Handling Missing Data**:
   - Explain why missing data can be problematic in data analysis and machine learning.
   - Discuss the potential impact of missing values on model performance and data integrity.

2. **Identify Missing Data**:
   - Demonstrate how to detect missing values in a dataset using various methods.
   - Use visualizations to identify patterns and the extent of missing data.

3. **Explore Different Imputation Techniques**:
   - **Simple Imputation**:
     - Mean, median, and mode imputation.
     - Forward fill and backward fill methods.
   - **Advanced Imputation**:
     - K-Nearest Neighbors (KNN) imputation.
     - Multivariate imputation by chained equations (MICE).
     - Using machine learning models for imputation.

4. **Implement Imputation Techniques**:
   - Provide code examples for each imputation method using popular libraries such as Pandas, Scikit-learn, and fancyimpute.
   - Compare the results of different imputation techniques.

5. **Evaluate the Impact of Imputation**:
   - Assess how different imputation methods affect the dataset.
   - Use statistical measures and visualizations to compare the distributions before and after imputation.

6. **Best Practices and Considerations**:
   - Discuss when to use each imputation method.
   - Highlight potential pitfalls and how to avoid them.
   - Provide guidelines for choosing the appropriate imputation technique based on the dataset and problem context.

7. **Practical Application**:
   - Apply imputation techniques to a real-world dataset.
   - Demonstrate the end-to-end process from identifying missing values to evaluating the impact of imputation on model performance.

8. **Conclusion and Next Steps**:
   - Summarize key takeaways from the notebook.
   - Suggest further reading and advanced topics related to missing data imputation.

## Background

The notebook provides an in-depth tutorial on various methods for imputing missing values in data using Python. 

## Datasets Used

- **Sample DataFrame of Students**: This dataframe consists of artificially created missing values in the 'Age', 'Sex', and 'GPA' fields to illustrate basic imputation techniques.
- **Time Series Data**: A dataset with dates and production values used to demonstrate interpolation methods suitable for time series data.
- **Product Data**: It illustrates advanced imputation methods like KNN, showcasing how to handle missing values more contextually.

## Basic Imputation

In [1]:
import numpy as np
import pandas as pd
pd.set_eng_float_format(accuracy=2, use_eng_prefix=True)

Based on similar data, we will impute (guess) the missing values (using the mean, median, mode, etc.)

In [2]:
# Create a sample DataFrame
students = [
    ['st_100', 17, 'M', 3.7],
    ['st_101', 17, 'M', np.nan],
    ['st_102', np.nan, 'M', 2.4],
    ['st_103', np.nan, 'F', np.nan],
    ['st_104', 19, np.nan, 3]
]
df = pd.DataFrame(students, columns=['studentID', 'Age', 'Sex', 'GPA'])
df

Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.0,M,3.7
1,st_101,17.0,M,
2,st_102,,M,2.4
3,st_103,,F,
4,st_104,19.0,,3.0


In [3]:
# Fill NaN entries with zero
df.fillna(0)

Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.0,M,3.7
1,st_101,17.0,M,0.0
2,st_102,0.0,M,2.4
3,st_103,0.0,F,0.0
4,st_104,19.0,0,3.0


In [4]:
# The previous instruction does not affect the DataFrame df
df

Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.0,M,3.7
1,st_101,17.0,M,
2,st_102,,M,2.4
3,st_103,,F,
4,st_104,19.0,,3.0


We can replace the missing values with any value.

In [5]:
# Fill NaN entries with 100
df.fillna(100)

Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.0,M,3.7
1,st_101,17.0,M,100.0
2,st_102,100.0,M,2.4
3,st_103,100.0,F,100.0
4,st_104,19.0,100,3.0


You can specify a forward-fill (`ffill`) to propagate the previous value forward. Notice that the result changes if you previously sort the DataFrame!

In [6]:
# Using ffill to fill forward
print('Original DataFrame\n', df)
df.ffill()

Original DataFrame
   studentID    Age  Sex   GPA
0    st_100  17.00    M  3.70
1    st_101  17.00    M   NaN
2    st_102    NaN    M  2.40
3    st_103    NaN    F   NaN
4    st_104  19.00  NaN  3.00


Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.0,M,3.7
1,st_101,17.0,M,3.7
2,st_102,17.0,M,2.4
3,st_103,17.0,F,2.4
4,st_104,19.0,F,3.0


When `ffill` finds a `NaN` value, it replaces `NaN` for the value in the previous row of the same column. Notice that `NaN` values in the first row remain the same. You are propagating the values down row to row.


You can specify `axis=1` to propagate the values to the right column to column.

In [7]:
# Using forward-fill to propagate the previous value forward. 
print('Original DataFrame\n', df)
df.ffill(axis=1)

Original DataFrame
   studentID    Age  Sex   GPA
0    st_100  17.00    M  3.70
1    st_101  17.00    M   NaN
2    st_102    NaN    M  2.40
3    st_103    NaN    F   NaN
4    st_104  19.00  NaN  3.00


Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.00,M,3.70
1,st_101,17.00,M,M
2,st_102,st_102,M,2.40
3,st_103,st_103,F,F
4,st_104,19.00,19.00,3.00


When `ffill` finds `NaN` values and `axis=1`, it replaces `NaN` for the value in the previous column of the same row. 

Notice that: 
- `NaN` values in the first column and the subsequent columns with preceding NaN values remain the same.
- In our case, this behavior does not make sense because we get the value `st_102` in the `Age` column, and 'F' in `GPA` column. 

You can use back-fill to propagate the next values backward

In [8]:
# back-fill to propagate the next values backward
print('Original DataFrame\n', df)
df.bfill()

Original DataFrame
   studentID    Age  Sex   GPA
0    st_100  17.00    M  3.70
1    st_101  17.00    M   NaN
2    st_102    NaN    M  2.40
3    st_103    NaN    F   NaN
4    st_104  19.00  NaN  3.00


Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.0,M,3.7
1,st_101,17.0,M,2.4
2,st_102,19.0,M,2.4
3,st_103,19.0,F,3.0
4,st_104,19.0,,3.0


In [9]:
# back-fill to propagate the next values to the left
print('Original DataFrame\n', df)
df.bfill(axis=1)

Original DataFrame
   studentID    Age  Sex   GPA
0    st_100  17.00    M  3.70
1    st_101  17.00    M   NaN
2    st_102    NaN    M  2.40
3    st_103    NaN    F   NaN
4    st_104  19.00  NaN  3.00


Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.00,M,3.7
1,st_101,17.00,M,
2,st_102,M,M,2.4
3,st_103,F,F,
4,st_104,19.00,3.00,3.0


Again, in our case, this option is nonsense. Notice that `Age` column has values of `M` and `F`!

## Replacing Missing Values with a Central Tendency Measure

There are a lot of techniques for replacing missing values. For instance, we can replace missing values in a numeric column by using a measure of central tendency.

### Replacing Missing Values with the Mean

Suppose we want to replace all the missing values of the DataFrame with its mean value. Let's do it with the numeric columns: `Age` and `GPA`.

In [10]:
# Calculating the mean
df[['Age','GPA']].mean()

Age    17.67
GPA     3.03
dtype: float64

We can replace the missing values with the mean in each column. Let's do it!

In [11]:
# filling missing values with mean column values
df_mean = df[['Age','GPA']].fillna(df[['Age','GPA']].mean())
df_mean

Unnamed: 0,Age,GPA
0,17.0,3.7
1,17.0,3.03
2,17.67,2.4
3,17.67,3.03
4,19.0,3.0


In [12]:
# Completing with the other columns
df_mean['studentID'] = df.studentID
df_mean['Sex'] = df.Sex
df_mean

Unnamed: 0,Age,GPA,studentID,Sex
0,17.0,3.7,st_100,M
1,17.0,3.03,st_101,M
2,17.67,2.4,st_102,M
3,17.67,3.03,st_103,F
4,19.0,3.0,st_104,


In [13]:
# Reordering attributes
df_mean = df_mean[['studentID', 'Age', 'Sex', 'GPA']]
df_mean

Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.0,M,3.7
1,st_101,17.0,M,3.03
2,st_102,17.67,M,2.4
3,st_103,17.67,F,3.03
4,st_104,19.0,,3.0


### Replacing Missing Values with the Median

Now we want to replace the missing values in the oridinal DataFrame with the median. Let's do it!

In [14]:
# Original DataFrame
df

Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.0,M,3.7
1,st_101,17.0,M,
2,st_102,,M,2.4
3,st_103,,F,
4,st_104,19.0,,3.0


In [15]:
# Calculating the median
df[['Age','GPA']].median()

Age    17.00
GPA     3.00
dtype: float64

In [16]:
# filling missing values with the median by column
df_median = df[['Age','GPA']].fillna(df[['Age','GPA']].median())
df_median

Unnamed: 0,Age,GPA
0,17.0,3.7
1,17.0,3.0
2,17.0,2.4
3,17.0,3.0
4,19.0,3.0


In [17]:
# Completing with the other columns
df_median['studentID'] = df.studentID
df_median['Sex'] = df.Sex
df_median

Unnamed: 0,Age,GPA,studentID,Sex
0,17.0,3.7,st_100,M
1,17.0,3.0,st_101,M
2,17.0,2.4,st_102,M
3,17.0,3.0,st_103,F
4,19.0,3.0,st_104,


In [18]:
# Reordering attributes
df_median = df_median[['studentID', 'Age', 'Sex', 'GPA']]
df_median

Unnamed: 0,studentID,Age,Sex,GPA
0,st_100,17.0,M,3.7
1,st_101,17.0,M,3.0
2,st_102,17.0,M,2.4
3,st_103,17.0,F,3.0
4,st_104,19.0,,3.0


### Replacing Missing Values with the Mode

We want to replace the missing values of `Sex` with the mode, `M` in this case.
This example is for academic purposes only. If you have a missing `Sex` value, you should find the original data. 

In [19]:
df.Sex.value_counts()

Sex
M    3
F    1
Name: count, dtype: int64

In [20]:
df.Sex.fillna('M')

0    M
1    M
2    M
3    F
4    M
Name: Sex, dtype: object

## Replacing Missing Values with other value

Suppose you want to replace missing values of `GPA`, for instance, with a certain value you must calculate. Let us start by creating a lambda function.

In [21]:
# It computes the mean of the extreme values: (min + max)/2
mean_ext = lambda l: (pd.Series(l).min() + pd.Series(l).max()) / 2

In [22]:
# Calculating the mean of extreme values for df.GPA
mean_ext(df.GPA)

3.05

In [23]:
# replacing NaN values in df.GPA
print('Original GPA column\n', df.GPA)
df.GPA.fillna(mean_ext(df.GPA))

Original GPA column
 0    3.70
1     NaN
2    2.40
3     NaN
4    3.00
Name: GPA, dtype: float64


0    3.70
1    3.05
2    2.40
3    3.05
4    3.00
Name: GPA, dtype: float64

In [24]:
# Using replace function
print('Original GPA column\n', df.GPA)
df.GPA.replace(np.nan, mean_ext(df.GPA))

Original GPA column
 0    3.70
1     NaN
2    2.40
3     NaN
4    3.00
Name: GPA, dtype: float64


0    3.70
1    3.05
2    2.40
3    3.05
4    3.00
Name: GPA, dtype: float64

### Using interpolation for replacing Missing Values

Filling missing data is an active research area. There are other techniques for filling missing values like interpolation. This method is very useful in time series data.

In [25]:
# Creating a sample DataFrame
dates = pd.date_range(start="2022-05-05", periods=10).to_pydatetime().tolist()
prodA = [16, 21, np.nan, 21, 12, np.nan, 12, 20, np.nan, 30]
prodB = [36, 38, np.nan, 42, np.nan, 60, 47, 67, 73, 55]
dfd = pd.DataFrame({'Date': dates, 'ProdA': prodA, 'ProdB': prodB})
dfd

Unnamed: 0,Date,ProdA,ProdB
0,2022-05-05,16.0,36.0
1,2022-05-06,21.0,38.0
2,2022-05-07,,
3,2022-05-08,21.0,42.0
4,2022-05-09,12.0,
5,2022-05-10,,60.0
6,2022-05-11,12.0,47.0
7,2022-05-12,20.0,67.0
8,2022-05-13,,73.0
9,2022-05-14,30.0,55.0


In [26]:
# Counting the number of NaN values
dfd.isnull().sum()

Date     0
ProdA    3
ProdB    2
dtype: int64

In [27]:
# Using interpolate method to fill the missing values
print('Original data\n', dfd.ProdA)
dfd.ProdA.interpolate()

Original data
 0    16.00
1    21.00
2      NaN
3    21.00
4    12.00
5      NaN
6    12.00
7    20.00
8      NaN
9    30.00
Name: ProdA, dtype: float64


0    16.00
1    21.00
2    21.00
3    21.00
4    12.00
5    12.00
6    12.00
7    20.00
8    25.00
9    30.00
Name: ProdA, dtype: float64

The `interpolate()` method replaces the `NaN` values based on a interpolating technique. Defaulf: `method='linear'`

Notice `ProdA` at index 2 has `NaN` initially and 21 after the interpolation. Notice that 21 is the average of values at indexes 1 and 3.

In [28]:
# Using interpolate linear method to fill the missing values
print('Original data\n', dfd.ProdA)
dfd.ProdA.interpolate(method='linear')

Original data
 0    16.00
1    21.00
2      NaN
3    21.00
4    12.00
5      NaN
6    12.00
7    20.00
8      NaN
9    30.00
Name: ProdA, dtype: float64


0    16.00
1    21.00
2    21.00
3    21.00
4    12.00
5    12.00
6    12.00
7    20.00
8    25.00
9    30.00
Name: ProdA, dtype: float64

When we have data from a time series, we have the data sorted by date and we can successfully use the interpolation method.

Note that if we order the data differently (which we cannot do with time series data), the result of applying the interpolation method will be different.

In [29]:
# Using interpolate quadratic method to fill the missing values
print('Original data\n', dfd.ProdA)
dfd.ProdA.interpolate(method='polynomial', order=2)

Original data
 0    16.00
1    21.00
2      NaN
3    21.00
4    12.00
5      NaN
6    12.00
7    20.00
8      NaN
9    30.00
Name: ProdA, dtype: float64


0    16.00
1    21.00
2    23.67
3    21.00
4    12.00
5     8.66
6    12.00
7    20.00
8    26.35
9    30.00
Name: ProdA, dtype: float64

The interpolate method is different  in this example. Notice the value at index 2 is now `23.67`.

### Imputing with k Nearest Neighbors

The sklearn library provides a function `KNNImputer()` to replace missing values. This allows us to specify the value to replace the missing values with the mean value from nearest neighbors (`n_neighbors`) of the data point.

In [30]:
from sklearn.impute import KNNImputer

In [31]:
dfd

Unnamed: 0,Date,ProdA,ProdB
0,2022-05-05,16.0,36.0
1,2022-05-06,21.0,38.0
2,2022-05-07,,
3,2022-05-08,21.0,42.0
4,2022-05-09,12.0,
5,2022-05-10,,60.0
6,2022-05-11,12.0,47.0
7,2022-05-12,20.0,67.0
8,2022-05-13,,73.0
9,2022-05-14,30.0,55.0


Let us work only with the columns with `NaN` values: `ProdA` and `ProdB`.

In [32]:
# Using n_neighbors=1
imputer = KNNImputer(n_neighbors=1)
impute_data = imputer.fit_transform(dfd[['ProdA','ProdB']])
dfd['ProdA_1'] = impute_data[:,0]
dfd['ProdB_1'] = impute_data[:,1]
dfd[['ProdA','ProdB','ProdA_1','ProdB_1']]

Unnamed: 0,ProdA,ProdB,ProdA_1,ProdB_1
0,16.0,36.0,16.0,36.0
1,21.0,38.0,21.0,38.0
2,,,18.86,52.25
3,21.0,42.0,21.0,42.0
4,12.0,,12.0,47.0
5,,60.0,30.0,60.0
6,12.0,47.0,12.0,47.0
7,20.0,67.0,20.0,67.0
8,,73.0,20.0,73.0
9,30.0,55.0,30.0,55.0


- index 4 has `ProdA` : `12.00` and `ProdB` : `NaN` To impute the missing value, the algorithm finds the nearest value to `12.00` (`n_neighbors=1`). The closest value is `12.00` at index 6. There, `ProdB` is `47.00`, therefore the estimate value at index 4 for `ProdB` is `47.00` 

- index 5 has `ProdA` : `NaN` and `ProdB` : `60.00` To impute the missing value, the algorithm finds the nearest value to `60.00` (`n_neighbors=1`). The closest value is `55.00` at index 9. There, `ProdA` is `30.00`, therefore the estimate value at index 5 for `ProdA` is `30.00`

- index 2 has two `NaN` values. It is an extreme case. The estimations here are the average by column of all valid values.

In [33]:
# Using n_neighbors=2
imputer = KNNImputer(n_neighbors=2)
impute_data = imputer.fit_transform(dfd[['ProdA','ProdB']])
dfd['ProdA_2'] = impute_data[:,0]
dfd['ProdB_2'] = impute_data[:,1]
dfd[['ProdA','ProdB','ProdA_1','ProdB_1', 'ProdA_2','ProdB_2']]

Unnamed: 0,ProdA,ProdB,ProdA_1,ProdB_1,ProdA_2,ProdB_2
0,16.0,36.0,16.0,36.0,16.0,36.0
1,21.0,38.0,21.0,38.0,21.0,38.0
2,,,18.86,52.25,18.86,52.25
3,21.0,42.0,21.0,42.0,21.0,42.0
4,12.0,,12.0,47.0,12.0,41.5
5,,60.0,30.0,60.0,25.0,60.0
6,12.0,47.0,12.0,47.0,12.0,47.0
7,20.0,67.0,20.0,67.0,20.0,67.0
8,,73.0,20.0,73.0,25.0,73.0
9,30.0,55.0,30.0,55.0,30.0,55.0


The `n_neighbors=2`, values are different. Notice that at index 2, the estimated values are the same.

In [34]:
# Using n_neighbors=3
imputer = KNNImputer(n_neighbors=3)
impute_data = imputer.fit_transform(dfd[['ProdA','ProdB']])
dfd['ProdA_3'] = impute_data[:,0]
dfd['ProdB_3'] = impute_data[:,1]
dfd[['ProdA','ProdB','ProdA_1','ProdB_1', 'ProdA_2','ProdB_2', 'ProdA_3','ProdB_3']]

Unnamed: 0,ProdA,ProdB,ProdA_1,ProdB_1,ProdA_2,ProdB_2,ProdA_3,ProdB_3
0,16.0,36.0,16.0,36.0,16.0,36.0,16.0,36.0
1,21.0,38.0,21.0,38.0,21.0,38.0,21.0,38.0
2,,,18.86,52.25,18.86,52.25,18.86,52.25
3,21.0,42.0,21.0,42.0,21.0,42.0,21.0,42.0
4,12.0,,12.0,47.0,12.0,41.5,12.0,50.0
5,,60.0,30.0,60.0,25.0,60.0,20.67,60.0
6,12.0,47.0,12.0,47.0,12.0,47.0,12.0,47.0
7,20.0,67.0,20.0,67.0,20.0,67.0,20.0,67.0
8,,73.0,20.0,73.0,25.0,73.0,20.67,73.0
9,30.0,55.0,30.0,55.0,30.0,55.0,30.0,55.0


## Conclusions

Key Takeaways:
- Simple imputation techniques, like filling missing values with zeros, are quick but may not be suitable for maintaining the integrity of the data.
- Statistical imputations (mean, median, mode) provide more realistic substitutions but can introduce bias if not aligned with the data's distribution.
- Advanced techniques like forward and backward filling are proper for ordered data (e.g., time series) where the sequence of data points is relevant.
- Interpolation offers a sophisticated approach to estimating missing values by considering the trend in data points, which is ideal for sequential data.
- KNN imputation leverages the similarity between data points, providing a more informed and context-aware method for handling missing data, especially in datasets with solid patterns or relationships between variables.

## References

- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O’Reilly Media, Inc. chapter 3