In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_absolute_error, accuracy_score

# Load the dataset
data = pd.read_csv("FastFoodNutritionMenuV2.csv")

# Clean column names to remove special characters and spaces
data.columns = data.columns.str.replace(r"[^\w\s]", "", regex=True).str.replace(r"\s+", "_", regex=True)

# Helper function to evaluate imputation for continuous data
def evaluate_continuous_imputation(original, imputed):
    # Remove NaN values from original and imputed data
    mask = original.notna()  # Mask to only include non-NaN values
    original_clean = original[mask]
    imputed_clean = imputed[mask]
    
    # Calculate the Mean Absolute Error (MAE) between original and imputed values
    mae = mean_absolute_error(original_clean, imputed_clean)
    print(f"Mean Absolute Error: {mae:.4f}")
    return mae

# Helper function to evaluate imputation for categorical data
def evaluate_categorical_imputation(original, imputed):
    # Remove NaN values from original and imputed data
    mask = original.notna()  # Mask to only include non-NaN values
    original_clean = original[mask]
    imputed_clean = imputed[mask]
    
    # Calculate the accuracy for categorical imputation
    accuracy = accuracy_score(original_clean, imputed_clean)
    print(f"Accuracy: {accuracy:.4f}")
    return accuracy

### Test 1: Univariate Imputation (Median) ###
# Select the attribute and simulate MCAR by introducing random missing values
data["Calories"] = pd.to_numeric(data["Calories"], errors="coerce")
original_calories = data["Calories"].copy()
data.loc[data.sample(frac=0.2, random_state=42).index, "Calories"] = np.nan

# Impute missing values using the median
calories_median = data["Calories"].median()
data["Calories"] = data["Calories"].fillna(calories_median)

# Evaluate the imputation
print(f"Median used for imputation: {calories_median}")
evaluate_continuous_imputation(original_calories, data["Calories"])

### Test 2: Bivariate Imputation (Regression) ###
# Select the attribute and simulate MAR
data["Total_Fat_g"] = pd.to_numeric(data["Total_Fat_g"], errors="coerce")
original_fat = data["Total_Fat_g"].copy()
data.loc[data.sample(frac=0.2, random_state=42).index, "Total_Fat_g"] = np.nan

# Ensure predictor ("Calories") has no missing values
data["Calories"] = data["Calories"].fillna(calories_median)

# Create the dataset for regression
train_data = data.dropna(subset=["Total_Fat_g"])
X = train_data[["Calories"]]  # Predictor variable
y = train_data["Total_Fat_g"]  # Target variable

# Train a regression model
regressor = LinearRegression()
regressor.fit(X, y)

# Predict missing values
missing_data = data[data["Total_Fat_g"].isna()]
data.loc[data["Total_Fat_g"].isna(), "Total_Fat_g"] = regressor.predict(missing_data[["Calories"]])

# Evaluate the imputation
print("Regression imputation completed.")
evaluate_continuous_imputation(original_fat, data["Total_Fat_g"])

### Test 3: Multivariate Imputation (KNN) ###
# Select the attribute and simulate MNAR
data["Sodium_mg"] = pd.to_numeric(data["Sodium_mg"], errors="coerce")
original_sodium = data["Sodium_mg"].copy()
data.loc[data.sample(frac=0.2, random_state=42).index, "Sodium_mg"] = np.nan

# Perform KNN imputation
imputer = KNNImputer(n_neighbors=5)
data_imputed = pd.DataFrame(imputer.fit_transform(data[["Calories", "Sodium_mg"]]), columns=["Calories", "Sodium_mg"])

# Replace imputed sodium values in the original data
data["Sodium_mg"] = data_imputed["Sodium_mg"]

# Evaluate the imputation
print("Similarity-based imputation completed.")
evaluate_continuous_imputation(original_sodium, data["Sodium_mg"])


Median used for imputation: 240.0
Mean Absolute Error: 37.2109
Regression imputation completed.
Mean Absolute Error: 2.2496
Similarity-based imputation completed.
Mean Absolute Error: 78.4231


78.42314487632508

# Imputation Experiment

In this experiment, we perform three different imputation tests on a dataset (`FastFoodNutritionMenuV2.csv`) using various imputation techniques. Each test involves introducing missing values in specific attributes and then applying an imputation method to replace those missing values. The imputation methods tested include:

1. **Univariate Imputation (Median)**
2. **Bivariate Imputation (Regression)**
3. **Multivariate Imputation (KNN)**

Each test uses a different attribute and a different method to evaluate how well the imputation techniques handle missing data.

## Test 1: Univariate Imputation (Median)
- **Attribute**: `Calories`
- **Missingness Type**: **MCAR (Missing Completely at Random)**
- **Imputation Method**: **Median Imputation**

### Steps:
1. The `Calories` attribute is selected, and 20% of its values are randomly replaced with missing values, simulating **MCAR** (the missingness is random and not dependent on any other variable).
2. Missing values are imputed using the **median** of the `Calories` attribute.
3. The imputation is evaluated using **Mean Absolute Error (MAE)**, comparing the imputed data to the original data.

### Purpose:
Evaluate how well the **median imputation** method works for continuous attributes when the missingness is completely random.

---

## Test 2: Bivariate Imputation (Regression)
- **Attribute**: `Total_Fat_g`
- **Missingness Type**: **MAR (Missing at Random)**
- **Imputation Method**: **Regression Imputation**

### Steps:
1. The `Total_Fat_g` attribute is selected, and 20% of its values are replaced with missing values. The missingness is simulated as **MAR**, where the probability of missingness depends on the value of another variable, `Calories`.
2. A **Linear Regression** model is trained using `Calories` as the predictor and `Total_Fat_g` as the target variable.
3. The trained regression model predicts the missing values for `Total_Fat_g`.
4. The imputation performance is evaluated using **Mean Absolute Error (MAE)**.

### Purpose:
Evaluate the performance of **regression imputation** for attributes where missingness depends on other variables (i.e., **MAR**).

---

## Test 3: Multivariate Imputation (KNN)
- **Attribute**: `Sodium_mg`
- **Missingness Type**: **MNAR (Missing Not at Random)**
- **Imputation Method**: **KNN (K-Nearest Neighbors)**

### Steps:
1. The `Sodium_mg` attribute is selected, and 20% of its values are randomly replaced with missing values. The missingness is simulated as **MNAR**, where missingness depends on the values of other variables, such as `Calories`.
2. **KNN Imputation** is performed, using the `Calories` attribute as a predictor to impute the missing values in the `Sodium_mg` attribute.
3. The imputation performance is evaluated using **Mean Absolute Error (MAE)**.

### Purpose:
Evaluate the performance of **KNN imputation** when missingness depends on other variables in the dataset (i.e., **MNAR**).

---

## Evaluation of Imputation Methods

The performance of each imputation method is evaluated by comparing the imputed values to the original (non-missing) values. The evaluation metric used is **Mean Absolute Error (MAE)**, which calculates the average absolute difference between the imputed and original values. Lower MAE values indicate better imputation performance.

### Output Example:
For each test, the script will output:
1. The value used for imputation (e.g., median, predicted value from regression, or imputed value using KNN).
2. The **MAE** result, which represents the accuracy of the imputation.

---

## Summary of Tests:
- **Test 1**: **Univariate Imputation (Median)** for the `Calories` attribute with **MCAR** missingness.
- **Test 2**: **Bivariate Imputation (Regression)** for the `Total_Fat_g` attribute with **MAR** missingness.
- **Test 3**: **Multivariate Imputation (KNN)** for the `Sodium_mg` attribute with **MNAR** missingness.

By comparing the MAE across these three methods, we can evaluate the effectiveness of different imputation techniques for different types of missingness.
