# Assignment 2: Data Cleaning - Part 2: Imputation
## Group 105
- Natasa Bolic (300241734)
- Brent Palmer (300193610)
## Imports

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, mean_absolute_error

## Introduction

Paragraph here

## Dataset Description

**Url:** https://www.kaggle.com/datasets/uciml/autompg-dataset <br>
**Name:** Auto-mpg Dataset <br>
**Author:** UCI Machine Learning Repository (originally from StatLib library, maintained at Carnegie Mellon University) <br>
**Purpose:** The dataset includes the technical specifications of cars. The original purpose of collection is not explicitly listed on Kaggle, but it appears to have been collected to evaluate how fuel consumption relates to various other attributes of vehicles (e.g., horsepower, weight). In 1993, Ross Quinlan used the dataset to train a machine learning model to predict fuel consumption based on the other eight features.<br>
**Shape:** There are 398 rows and 9 columns. (398, 9)<br>
**Features:** Further explanation of the features retrieved from https://code.datasciencedojo.com/tshrivas/dojoHub/tree/master/Auto%20MPG%20Data%20Set
- `mpg` (numerical): The vehicle's fuel efficiency, measured in miles per gallon (mpg).
- `cylinders` (categorical): The number of cylinders in the vehicle's engine.
- `displacement` (numerical): The total volume of the cylinders in the vehicle, measured in cubic inches.
- `horsepower` (numerical): A measurement of the vehicle's engine's power.
- `weight` (numerical): The weight of the vehicle, measured in pounds (lbs).
- `acceleration` (numerical): Time to go from 0 to 60 miles per hour, measured in seconds.
- `model year` (categorical): The year of release of the vehicle.
- `origin` (categorical): The region of manufacturing.
    - 1: USA
    - 2: Europe
    - 3: Japan
- `car name` (categorical): The name of the vehicle.

**Missing Values:** Yes, there are missing values. In particular, horsepower has 6 missing values.

## Loading Dataset and Basic Exploration

In [2]:
# Read in the dataset from a public repository
url = "https://raw.githubusercontent.com/Natasa127/CSI4142-A2/refs/heads/main/auto-mpg.csv"
auto_df = pd.read_csv(url)
auto_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [3]:
auto_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


In [4]:
auto_df.shape

(398, 9)

### Basic Cleaning
Since there are only six missing values in the entire dataset (1.5% of the rows), we decided to use the listwise removal approach to delete these rows. The assignment requires simulating missing values (as indicated by part b), which enables us to do cross validation since we know the true values of the data. Thus, deleting these six rows in advance will make the cross-validation approach possible. Since it is such a small amount of data, this deletion will not significantly impact our analysis.

**References:** <br>
Listwise removal: https://saturncloud.io/blog/how-to-remove-rows-with-specific-values-in-pandas-dataframe/ <br>
Series equality: https://pandas.pydata.org/docs/reference/api/pandas.Series.eq.html

In [5]:
auto_df = auto_df.drop(auto_df[auto_df['horsepower'] == '?'].index)

In [6]:
auto_df['horsepower'].eq('?').sum()

0

The six rows have successfully been removed, and we are now ready for imputation and evaluation using cross-validation with simulation.

## Imputation Tests

### Imputation Test 1: Random Sample Imputation on Acceleration (Univariate)

#### (a) Chosen Attribute

We have chosen to test random sample imputation (univariate) on the acceleration attribute.

#### (b) Simulate Missing Values

We are simulating missing acceleration values using the MCAR approach, where the missing acceleration values are chosen completely at random regardless of their own value or the values of the other attributes.

We are simulating the missing values in a copy of the original DataFrame such that we can evaluate the imputation accuracy afterwards by comparing the imputed acceleration values with the true acceleration values.

**References:** <br>
MCAR: https://www.kaggle.com/code/yassirarezki/handling-missing-data-mcar-mar-and-mnar-part-i <br>
Loc Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html <br>
Copy Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html <br>
Fixing Random Seed: https://stackoverflow.com/questions/21494489/what-does-numpy-random-seed0-do

In [7]:
# Create a copy of the DataFrame
missing_acceleration_df = auto_df.copy()

# Set the random seed to make the results reproducible (comment this out to try on truly random missing values)
np.random.seed(0)

# Generate a series that holds True for rows that will be missing, and False for rows that will persist
missing_percent = 0.1
missing_values = np.random.choice([True, False], len(auto_df), p=[missing_percent, 1 - missing_percent])

# Replace the acceleration with NaN where the missing_values series is True
missing_acceleration_df.loc[missing_values, "acceleration"] = np.nan

# Check to see if values are missing
missing_acceleration_df['acceleration'].isna().sum()

39

#### (c) Program an Imputation Approach to Replace the Missing Values

We have chosen random sample imputation as our first imputation approach. Random sample imputation is a univariate technique.

**References:** <br>
Sample Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html

In [8]:
# Save a series of the non-missing acceleration values
acceleration_values = missing_acceleration_df.loc[missing_acceleration_df['acceleration'].notna(), 'acceleration']

# Sample a number of non-missing acceleration values with replacement equal to the number of missing values
# Note that random state is set to one to make the results reproducible. This can be removed for truly random samples
sampled_acceleration_values = acceleration_values.sample(missing_acceleration_df['acceleration'].isna().sum(), replace=True, random_state=1).values

# Replace the missing values with sampled acceleration values
missing_acceleration_df.loc[missing_values, "acceleration"] = sampled_acceleration_values

# Check to see if values are missing
missing_acceleration_df['acceleration'].isna().sum()

0

#### (d) Evaluate to What Extent Your Approach is Finding the Missing Values

Since we are using the cross-validation with simulation approach where we simulate missing data, we can compare our imputed values to the true values.

##### Evaluation Using MSE

We will first use the Mean Squared Error (MSE) approach, where we take the average of the squared difference of each imputed value with the respective true value.

**References:** <br>
Mean Squared Error: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

In [9]:
# First determine the original values of the missing data
original_values = auto_df.loc[missing_values, 'acceleration'].values

# Then determine the imputed values of the missing data
imputed_values = missing_acceleration_df.loc[missing_values, 'acceleration'].values

# Compute MSE
mse = mean_squared_error(original_values, imputed_values)
mse

9.729999999999997

Thus, the mean square error is 9.73 using random sample imputation. The MSE is not particularly intuitive to interpret as a standalone value since it is measured in squared units. A great way to interpret MSE is to compare it to the MSE of a baseline method, like median imputation. In the following cell we will perform median imputation, and compare the MSE values to evaluate the effectiveness of the random sampling method.

In [10]:
# Replace the imputed values with NaN where the missing_values series is True.
missing_acceleration_df.loc[missing_values, "acceleration"] = np.nan

# Save the median of the non-missing acceleration values
acceleration_median = missing_acceleration_df.loc[missing_acceleration_df['acceleration'].notna(), 'acceleration'].median()

# Replace the missing values with median
missing_acceleration_df.loc[missing_values, "acceleration"] = acceleration_median

# Compute MSE
original_values = auto_df.loc[missing_values, 'acceleration'].values
imputed_values = missing_acceleration_df.loc[missing_values, 'acceleration'].values
median_mse = mean_squared_error(original_values, imputed_values)
median_mse

5.330512820512821

Given that random sampling gave an MSE of 9.73 and default value imputation using the median gave an MSE of 5.33, the random sampling approach is not very good comparatively in this case. 

##### Evaluation Using MAE

A second approach to evaluation is Mean Absolute Error (MAE), which is more intuitive since the units of MAE are the same as the original data. MAE takes the average absolute difference between each imputed value with the respective true value.

**References:** <br>
Mean Squared Error: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html

In [11]:
mae = mean_absolute_error(original_values, imputed_values)
mae

1.8076923076923077

The MAE is 2.61, which means that on average, the imputed acceleration had an error of 2.61 seconds. In the context of a vehicle's acceleration in seconds, this is not that bad.

##### Evaluation Conclusion
Despite the random sampling MSE being notably worse than the default value imputation using the median MSE, random sampling is still usable as seen through the reasonable MAE score of 2.61 seconds. 