# Imputing Missing Values with Faiss Imputer

Welcome to this notebook that demonstrates how to use the `faiss-imputer` library to impute missing values in a DataFrame using the Faiss algorithm.

## Introduction

In the field of data preprocessing, handling missing values is a common challenge. The `faiss-imputer` library provides an innovative solution to this problem using the power of Faiss, a high-performance similarity search and clustering library developed by Facebook AI Research (FAIR).

Faiss, the cornerstone of `faiss-imputer`, deserves special recognition. Developed by Facebook AI Research (FAIR), Faiss is a high-performance library designed for similarity search and clustering. It underpins `faiss-imputer`’s innovative approach to missing value imputation, providing a robust framework for efficient computations.

## Faiss-Imputer: A Python Library for Missing Data Imputation

`faiss-imputer` leverages the capabilities of Faiss to perform k-nearest neighbors imputation for missing values. This technique is particularly useful when dealing with datasets containing incomplete information. By imputing missing values based on similar data points, `faiss-imputer` enables data scientists and analysts to enhance the quality of their datasets and improve downstream analyses.

## Example Overview

In this notebook, I will walk through an example of using the `faiss-imputer` library to impute missing values in a synthetic dataset. I will generate a DataFrame with missing values, utilize the Faiss algorithm for imputation, and then compare the results with the original data. The aim is to showcase the effectiveness and efficiency of this innovative approach.

## Library Source

The `faiss-imputer` library used in this example can be found on GitHub: [FaissImputer Repository](https://github.com/ScionKim/FaissImputer).

---

Feel free to explore and experiment with the provided example code to gain a deeper understanding of how `faiss-imputer` can be a valuable addition to your data preprocessing toolkit. Let's dive into the example and witness the power of Faiss for missing data imputation!


## Imports
Let's start by importing the required libraries.

In [1]:
import numpy as np
import pandas as pd
from faiss_imputer import FaissImputer

## Data Preparation
Next, I'll generate a random DataFrame with missing values and prepare it for imputation.

In [2]:
# Set the random seed for reproducibility
np.random.seed(42)

# Generate a random data frame with 10 rows and 5 columns
df = pd.DataFrame(np.random.randint(0, 100, size=(10, 5)), columns=list('ABCDE'))

# Print the original data frame
print("Original data frame:")
print(df)

# Introduce some missing values randomly
df_missing = df.copy()
df_missing.iloc[np.random.randint(0, 10, size=3), np.random.randint(0, 5, size=3)] = np.nan

# Print the data frame with missing values
print("Data frame with missing values:")
print(df_missing)

Original data frame:
    A   B   C   D   E
0  51  92  14  71  60
1  20  82  86  74  74
2  87  99  23   2  21
3  52   1  87  29  37
4   1  63  59  20  32
5  75  57  21  88  48
6  90  58  41  91  59
7  79  14  61  61  46
8  61  50  54  63   2
9  50   6  20  72  38
Data frame with missing values:
    A     B   C     D   E
0  51  92.0  14  71.0  60
1  20   NaN  86   NaN  74
2  87  99.0  23   2.0  21
3  52   NaN  87   NaN  37
4   1  63.0  59  20.0  32
5  75  57.0  21  88.0  48
6  90  58.0  41  91.0  59
7  79  14.0  61  61.0  46
8  61   NaN  54   NaN   2
9  50   6.0  20  72.0  38


## Imputation with Faiss Imputer
Now, I'll create an instance of FaissImputer and use it to impute missing values.

In [3]:
# Create an instance of FaissImputer with default parameters
imputer = FaissImputer(5, strategy='median')

# Fit the imputer on the data frame with missing values
imputer.fit(df_missing)

# Transform the data frame with missing values
df_imputed = imputer.transform(df_missing)

## Results
Finally, let's compare the imputed data frame with the original data frame to see how well the imputation worked.

In [4]:
# Print the imputed data frame
print("Imputed data frame:")
print(df_imputed)

# Compare the imputed data frame with the original data frame
print("Comparison:")
print(np.where(df_imputed == df, 'O', 'X'))

Imputed data frame:
[[51.  92.  14.  71.  60. ]
 [20.  77.5 86.  45.5 74. ]
 [87.  99.  23.   2.  21. ]
 [52.  63.  87.  71.  37. ]
 [ 1.  63.  59.  20.  32. ]
 [75.  57.  21.  88.  48. ]
 [90.  58.  41.  91.  59. ]
 [79.  14.  61.  61.  46. ]
 [61.  63.  54.  71.   2. ]
 [50.   6.  20.  72.  38. ]]
Comparison:
[['O' 'O' 'O' 'O' 'O']
 ['O' 'X' 'O' 'X' 'O']
 ['O' 'O' 'O' 'O' 'O']
 ['O' 'X' 'O' 'X' 'O']
 ['O' 'O' 'O' 'O' 'O']
 ['O' 'O' 'O' 'O' 'O']
 ['O' 'O' 'O' 'O' 'O']
 ['O' 'O' 'O' 'O' 'O']
 ['O' 'X' 'O' 'X' 'O']
 ['O' 'O' 'O' 'O' 'O']]


## Conclusion
In this example, I demonstrated how to use the faiss-imputer library to impute missing values in a DataFrame using the Faiss algorithm. This technique can be helpful in various data preprocessing tasks.