# Assignment 2 Data Science - Imputation

In this Notebook, we perform three different imputation tests on the dataset `FastFoodNutritionMenuV2`, which is hosted on github, using various imputation techniques. For each test, we:

- **(a)** Choose an attribute on which to test our imputation method.
- **(b)** Simulate missing values in that attribute (using a specific missingness mechanism: MCAR, MAR, or MNAR).
- **(c)** Apply an imputation approach to replace the missing values.
- **(d)** Evaluate the performance of the imputation by comparing the imputed values to the original (non-missing) values using Mean Absolute Error (MAE).

Below is a description of each test.

---

## Test 1: Univariate Imputation (Median)

- **Attribute**: `Calories`
- **Missingness Type**: **MCAR (Missing Completely at Random)**
- **Imputation Method**: **Median Imputation**

### Steps:
1. **Simulate Missingness**:  
   20% of the `Calories` values are randomly set to missing (NaN) to simulate MCAR—so the values are missing completely randomly without any dependancy on the other attributes
   
2. **Imputation**:  
   Missing values in the `Calories` column are replaced directly by the median of the observed values.
   
3. **Evaluation**:  
   The Mean Absolute Error is computed only on the rows where missingness was simulated (and which originally had valid values). This provides a measure of how close the imputed values are to the original values.

### Purpose:
This test evaluates the effectiveness of **median imputation** for continuous data when the missingness is completely random. A lower MAE indicates that the median is a good replacement for the missing values

---

## Test 2: Bivariate Imputation (Regression)

- **Attribute**: `Total_Fat_g`
- **Missingness Type**: **MAR (Missing at Random)**
- **Imputation Method**: **Regression Imputation** 

### Steps:
1. **Simulate Missingness**:  
   Missing values are introduced in `Total_Fat_g` for rows where `Calories` is below its median -- hence making it simualte MAR since the probability of missing values is directly related to another attribute.
2. **Imputation**:  
   A linear regression model is trained with `Calories` as the predictor and `Total_Fat_g` as the target. The model predicts the missing values, and any negative predictions are clipped to 0, since fat content cannot be negative. The predictions then replace the missing values.
   
3. **Evaluation**:  
   MAE is calculated on the subset of rows where missingness was simulated (and that originally had valid `Total_Fat_g` values), allowing us to assess the regression imputation’s performance.

### Purpose:
This test assesses the performance of regression imputation for attributes with MAR missingness. It helps determine how well the relationship between `Calories` and `Total_Fat_g` can be used to recover missing fat values. Like all tests, the effectiveness is determined by how low the MAE is, and in this case the difference should be less than 10 roughly for a decent estimate. When we ran our tests the MAE was around 3 which is a fairly good indicator of a well trained model.

---

## Test 3: Multivariate Imputation (KNN)

- **Attribute**: `Sodium_mg`
- **Missingness Type**: **MNAR (Missing Not at Random)**
- **Imputation Method**: **KNN Imputation** (a similarity-based approach)

### Steps:
1. **Simulate Missingness**:  
   To simulate MNAR, we identify the high-sodium rows (those above the 75th percentile) and randomly remove 30% of these values. This simulates a scenario where higher sodium values might be less likely to be reported.
   
2. **Imputation**:  
   KNN imputation is applied using `Calories` and `Total_Fat_g` as predictor variables. The KNN imputer finds similar observations based on these predictors and uses their values to impute the missing `Sodium_mg` entries. The missing values are replaced directly in the `Sodium_mg` column. 
   
3. **Evaluation**:  
   MAE is computed only for the rows that were set to missing (and originally had valid sodium values), providing a measure of how close the imputed values are to the original measurements.

### Purpose:
This test evaluates the performance of similarity-based imputation (using KNN) for attributes with MNAR missingness. It examines whether similar observations (in terms of `Calories` and `Total_Fat_g`) can provide reasonable estimates for missing `Sodium_mg` values.

---

## Evaluation of Imputation Methods

For all tests, the evaluation metric is Mean Absolute Error (MAE). MAE is calculated as the average absolute difference between the imputed values and the original values (only on the rows where missingness was simulated and which originally had valid data). Lower MAE values indicate better performance of the imputation method.

### Output Example:
- For **Test 1**, the script outputs the median used for imputation and the MAE for the imputed `Calories` values.Since the imputed value is the same across all rows to be imputed, we have chosen not to output those rows that were imputed.
- For **Test 2**, it prints the MAE for the regression-imputed `Total_Fat_g` values and shows a sample of rows with their original and imputed values. 
- For **Test 3**, it displays the MAE for the KNN-imputed `Sodium_mg` values and a sample of the modified rows.

By comparing the MAE values and inspecting the modified rows across the three tests, we can evaluate the effectiveness of different imputation techniques for various missingness mechanisms. Each test also produces it's own csv file that is modified to illustrate the change to the dataset. Compare this to the github dataset for comparison to see the imputation.


# Dataset Description: FastFoodNutritionMenuV2.csv

**Dataset Name:**  
Fast Food Nutrition Menu V2

**Source/Author:**  
Joakim Arvidsson

**Purpose:**  
The dataset is designed to offer comprehensive nutritional details for fast food items. It serves as a resource for nutritional analysis, public health research, and data science projects related to food and diet, enabling studies on caloric content, fat composition, sodium levels, and other key nutritional metrics.

**Shape:**  
- **Rows:** 1148  
- **Columns:** 14  

**Features and Descriptions:**

1. **Company** (Categorical)  
   - *Description*: The fast food chain or restaurant that offers the menu item.  
   

2. **Item** (Categorical)  
   - *Description*: The name or description of the menu item.  
  

3. **Calories** (Numerical)  
   - *Description*: Total energy content of the item in kilocalories (kcal).  
  

4. **Calories from Fat** (Numerical)  
   - *Description*: The number of calories provided by fat.
  

5. **Total Fat (g)** (Numerical)  
   - *Description*: Total fat content in grams.  
  

6. **Saturated Fat (g)** (Numerical)  
   - *Description*: Saturated fat content in grams. . 

7. **Trans Fat (g)** (Numerical)  
   - *Description*: Trans fat content in grams.  

8. **Cholesterol (mg)** (Numerical)  
   - *Description*: Cholesterol content measured in milligrams. 

9. **Sodium (mg)** (Numerical)  
   - *Description*: Sodium content in milligrams.  

10. **Carbs (g)** (Numerical)  
    - *Description*: Total carbohydrate content in grams.   

11. **Fiber (g)** (Numerical)  
    - *Description*: Dietary fiber content in grams.

12. **Sugars (g)** (Numerical)  
    - *Description*: Sugar content in grams.

13. **Protein (g)** (Numerical)  
    - *Description*: Protein content in grams. 

14. **Weight Watchers Pnts** (Numerical)  
    - *Description*: Weight Watchers points assigned to the menu item, useful for dietary tracking. Contains many missing values.  

---




# Code for Test 1

In [79]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_absolute_error, accuracy_score


# loading the dataset
file_path = "https://raw.githubusercontent.com/Kevin-Nav07/Assignment-2/refs/heads/main/FastFoodNutritionMenuV2.csv"
data= pd.read_csv(file_path)


In [80]:

def evaluate_imputation(original, imputed, missing_idx):
    # Evaluate only on rows that were artificially set to missing and originally had a valid value
    eval_mask = original.index.isin(missing_idx) & original.notna()
    if eval_mask.sum() == 0:
        print("No imputation needed or no valid rows for evaluation.")
        return None
    mae = mean_absolute_error(original[eval_mask], imputed[eval_mask])
    print(f"Mean Absolute Error on imputed rows: {mae:.4f}")
    return mae

In [81]:


# Convert 'Calories' to numeric (coerce errors to NaN)
data["Calories"] = pd.to_numeric(data["Calories"], errors="coerce")


original_calories = data["Calories"].copy()

# Simulate Missingness (MCAR) 

np.random.seed(42)  
mcar_idx = data.sample(frac=0.2, random_state=42).index
data.loc[mcar_idx, "Calories"] = np.nan

# Imputation, replace missing values directly in 'Calories' with the median 
median_calories = data["Calories"].median()
data["Calories"] = data["Calories"].fillna(median_calories)

print("### Test 1: Univariate Imputation (Median) on 'Calories' (MCAR) ###")
print(f"Median used for imputation: {median_calories}")
evaluate_imputation(original_calories, data["Calories"], mcar_idx)

# save the modified dataset back
data.to_csv("FastFoodNutritionMenuV2_modified_test1.csv", index=False)


### Test 1: Univariate Imputation (Median) on 'Calories' (MCAR) ###
Median used for imputation: 240.0
Mean Absolute Error on imputed rows: 188.2143


# Test 2 Code

In [82]:

data.columns = data.columns.str.replace(r"[^\w\s]", "", regex=True).str.replace(r"\s+", "_", regex=True) # replaces the space for _ for better code readability
# Rename the "Total Fat (g)" column to "Total_Fat_g" for clarity.
data.rename(columns={"Total_Fat_(g)": "Total_Fat_g"}, inplace=True)

# Convert relevant columns to numeric
data["Calories"] = pd.to_numeric(data["Calories"], errors="coerce")
data["Total_Fat_g"] = pd.to_numeric(data["Total_Fat_g"], errors="coerce")
median_calories = data["Calories"].median()
data["Calories"] = data["Calories"].fillna(median_calories)
original_total_fat = data["Total_Fat_g"].copy()

#  Simulate Missingness (MAR) by removing Total_Fat_g for rows where Calories is below its median.
mar_indices = data[data["Calories"] < data["Calories"].median()].index
data.loc[mar_indices, "Total_Fat_g"] = np.nan
# Imputation: Regression Imputation with Clipping
# Use 'Calories' as the predictor for Total_Fat_g.
train_df = data.dropna(subset=["Total_Fat_g"])
X_train = train_df[["Calories"]]
y_train = train_df["Total_Fat_g"]
test_df = data[data["Total_Fat_g"].isna()]
X_test = test_df[["Calories"]]
model = LinearRegression()
model.fit(X_train, y_train)
predicted_fat = model.predict(X_test)
predicted_fat = np.maximum(predicted_fat, 0)
data.loc[data["Total_Fat_g"].isna(), "Total_Fat_g"] = predicted_fat

# Evaluation only on rows that were simulated missing and originally had valid Total_Fat_g.
evaluation_mask = data.index.isin(mar_indices) & original_total_fat.notna()
mae_total_fat = mean_absolute_error(original_total_fat[evaluation_mask], data["Total_Fat_g"][evaluation_mask])
print("### Test 2: Bivariate Imputation (Linear Regression with Clipping) on 'Total_Fat_g' (MAR) ###")
print(f"Mean Absolute Error on imputed rows: {mae_total_fat:.4f}")

# Output the rows that were modified, showing original and imputed values then save the file 
modified_df = pd.DataFrame({
    "Original_Total_Fat_g": original_total_fat.loc[mar_indices],
    "Imputed_Total_Fat_g": data.loc[mar_indices, "Total_Fat_g"]
})
print("\nRows that were modified (first 5 examples):")
print(modified_df.head())

data.to_csv("FastFoodNutritionMenuV2_modified_test2.csv", index=False)


### Test 2: Bivariate Imputation (Linear Regression with Clipping) on 'Total_Fat_g' (MAR) ###
Mean Absolute Error on imputed rows: 2.9645

Rows that were modified (first 5 examples):
    Original_Total_Fat_g  Imputed_Total_Fat_g
34                  11.0             9.486264
37                   0.0             0.000000
38                   0.0             0.000000
42                   0.0             0.141898
43                   0.0             0.141898


In [83]:

# Convert relevant columns to numeric in case for errors
data["Calories"] = pd.to_numeric(data["Calories"], errors="coerce")
data["Total_Fat_g"] = pd.to_numeric(data["Total_Fat_g"], errors="coerce")
data["Sodium_mg"] = pd.to_numeric(data["Sodium_mg"], errors="coerce")
orig_sodium = data["Sodium_mg"].copy()
# Identify high-sodium rows (those above the 75th percentile). This is for simulating missing values with MNAR
sodium_threshold = data["Sodium_mg"].quantile(0.75)
df_high_sodium = data[data["Sodium_mg"] > sodium_threshold]
missing_indices = df_high_sodium.sample(frac=0.3, random_state=42).index
data.loc[missing_indices, "Sodium_mg"] = np.nan
# Imputation: Similarity-Based (KNN) by using 'Calories' and 'Total_Fat_g' as predictors to impute 'Sodium_mg'
predictor_cols = ["Calories", "Total_Fat_g", "Sodium_mg"]
knn_imputer = KNNImputer(n_neighbors=5)
knn_imputed = knn_imputer.fit_transform(data[predictor_cols])
data["Sodium_mg"] = knn_imputed[:, 2]
# Evaluation by creating a mask for rows that were simulated as missing and originally had valid values then output example modified rows
evaluation_mask = data.index.isin(missing_indices) & orig_sodium.notna()
mae_sodium = mean_absolute_error(orig_sodium[evaluation_mask], data["Sodium_mg"][evaluation_mask])
print("### Test 3: Multivariate Imputation (KNN) on 'Sodium_mg' (MNAR) ###")
print(f"Mean Absolute Error on imputed rows: {mae_sodium:.4f}")
modified_rows = pd.DataFrame({
    "Original_Sodium_mg": orig_sodium.loc[missing_indices],
    "Imputed_Sodium_mg": data.loc[missing_indices, "Sodium_mg"]
})
print("\nRows that were modified (first 5 examples):")
print(modified_rows.head())

data.to_csv("FastFoodNutritionMenuV2_modified_test3.csv", index=False)


### Test 3: Multivariate Imputation (KNN) on 'Sodium_mg' (MNAR) ###
Mean Absolute Error on imputed rows: 324.8195

Rows that were modified (first 5 examples):
      Original_Sodium_mg  Imputed_Sodium_mg
32                 730.0              870.0
668               1080.0              630.0
740               1500.0              894.0
521               1110.0             1300.0
1075               720.0              540.0


# Conclusion

In these experiments, we demonstrated the application of three distinct imputation techniques to address missing data in the FastFoodNutritionMenuV2 dataset. We simulated missingness under different conditions—MCAR for the Calories attribute using median imputation, MAR for the Total_Fat_g attribute using regression imputation with clipping, and MNAR for the Sodium_mg attribute using KNN imputation. Each method was evaluated with Mean Absolute Error (MAE), providing a quantitative measure of performance. Although all techniques successfully replaced missing values, the results highlighted areas for further improvement, such as optimizing model parameters and exploring advanced imputation methods like MICE or predictive imputation approaches. Future work could also incorporate additional domain knowledge to refine the imputed values and assess the impact of imputation on downstream analyses.


# References


2. **Scikit-learn Documentation:**  
   - Imputation module: [https://scikit-learn.org/stable/modules/impute.html](https://scikit-learn.org/stable/modules/impute.html)  
   - Linear Regression: [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)  
   - KNN Imputer: [https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html)

3. **Pandas Documentation:**  
   [https://pandas.pydata.org/docs/](https://pandas.pydata.org/docs/)

4. **NumPy Documentation:**  
   [https://numpy.org/doc/](https://numpy.org/doc/)
