# Individual Assignment 4: **Missing Data**
The selected project was topic A.

After selecting a few datasets without missing data for analysis, we will induce missing data on them with different mechanisms. Then, we will apply different techniques to deal with this missing data and compare the different missing data mechanisms and techniques.

The datasets chosen to experiment were:
- ...
- ...

The missing data mechanisms were MCAR, MAR and MNAR.

We experimented with the following techniques:
- ...
- ...

Finally, we evaluated these techniques and were able to compare them using these metrics:
- ...
- ...

The index for the different sections of the notebook is as follows:
1. Introduction
2. Datasets and Preprocessing
3. Missing Data Mechanisms
4. Missing Data Techniques
5. Evaluation
6. Conclusion

<!--ipykernel==6.29.5-->

In [553]:
%pip install -qU pandas==2.2.3 scikit-learn==1.6.0 seaborn==0.13.2 matplotlib==3.10.0 mdatagen==0.1.71 missingno==0.5.2 setuptools==75.6.0

Note: you may need to restart the kernel to use updated packages.


Note that the notebook was run using Python 3.12.3, also tested on Google Collab <!-- using TODO -->

In [554]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from mdatagen.univariate.uMCAR import uMCAR
from mdatagen.univariate.uMAR import uMAR
from mdatagen.univariate.uMNAR import uMNAR
from mdatagen.multivariate.mMCAR import mMCAR
from mdatagen.multivariate.mMAR import mMAR
from mdatagen.multivariate.mMNAR import mMNAR

import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno

In [555]:
def load_datasets():
    """Loads the datasets and returns a dictionary of dataset labels and dataframes"""
    mobile_price = pd.read_csv("mobile_price.csv")
    mobile_price.rename(columns={"price_range": "target"}, inplace=True)
    print(mobile_price.columns)
    return {
        "mobile_price": mobile_price,
    }

In [556]:
RANDOM_STATE = 42

In [557]:
class MyImputer:
    def __init__(self, general = None, numerical = None, binary = None):
        self.general = general
        self.numerical = numerical
        self.binary = binary

In [558]:
MISSING_RATES = [10, 30, 50]
MISSING_MECHANISMS = ["MCAR", "MAR", "MNAR"]
MISSING_TECHNIQUES = {
    "Simple": MyImputer(
        numerical=SimpleImputer(strategy="mean"),
        binary=SimpleImputer(strategy="most_frequent"),
    ),
    "MICE": MyImputer(general=IterativeImputer()),
    "KNN": MyImputer(general=KNNImputer()),
}

MODELS = [LogisticRegression(), RandomForestClassifier(), KNeighborsClassifier()]
METRICS = ["accuracy", "precision", "recall", "f1"]

In [559]:
def count_missing_values(df: pd.DataFrame) -> pd.Series:
    count = df.isna().sum()
    return count[count > 0]
def print_missing_values(count: pd.Series) -> None:
    if len(count) == 0:
        return print("No missing values")
    print(count)

In [560]:
initial_dfs = load_datasets()

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'target'],
      dtype='object')


We selected datasets without missing values to completely control the present missing data mechanisms.

In [561]:
for label, df in initial_dfs.items():
    print(f"Dataset {label}")
    print_missing_values(count_missing_values(df))

Dataset mobile_price
No missing values


In [562]:
for label, df in initial_dfs.items():
    print(f"Dataset {label}")
    print(df.info())

Dataset mobile_price
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   

Method="min" -> least correlated
Method="max" -> most correlated

In [563]:
missing_data_dfs = []
for dataset_label, df in initial_dfs.items():
    for missing_rate in MISSING_RATES:
        # Target variable should be named "target"
        X = df.drop("target", axis=1)
        y = df["target"].values

        # uMAR()
        # uMNAR()
        # mMAR()
        # mMCAR()
        # mMNAR()
        generator = uMCAR(X=X, y=y, missing_rate=missing_rate, method="max")
        new_df = generator.random()

        # generator = mMCAR(X=X, y=y, missing_rate=missing_rate)
        # new_df = generator.random()
        # new_df["target"] = y    # mMCAR does not return the dataframe with the target

        missing_data_dfs.append({
            "mechanism": generator,
            "missing_rate": missing_rate,
            "dataset": dataset_label,
            "df": new_df,
        })
        print_missing_values(count_missing_values(new_df))

ram    200
dtype: int64
ram    600
dtype: int64
ram    1000
dtype: int64


In [564]:
results = []
for df_info in missing_data_dfs:
    df = df_info["df"]
    missing_rate = df_info["missing_rate"]
    mechanism = df_info["mechanism"]
    dataset = df_info["dataset"]

    X = df.drop("target", axis=1)
    
    y = df["target"]

    binary_features = [i for i, col in enumerate(X.columns) if X[col].nunique() == 2]
    numerical_features = [i for i, col in enumerate(X.columns) if col not in X.columns[binary_features]]

    for imputer_label, imputer in MISSING_TECHNIQUES.items():
        if imputer.general is None:
            # Use separate imputers for numerical and binary columns
            preprocessor = ColumnTransformer([
                ('num', Pipeline([
                    ('imputer', imputer.numerical),         # Apply numerical imputer
                    ('scaler', StandardScaler())
                ]), numerical_features),
                ('binary', imputer.binary, binary_features) # Apply binary imputer
            ], remainder="passthrough")
        else:
            # Use just one imputer for everything
            preprocessor = Pipeline([
                ('imputer', imputer.general),               # Apply the general imputer to all columns
                ('column_transform', ColumnTransformer([
                    ('num', StandardScaler(), numerical_features)  # Scale only numerical features
                ], remainder="passthrough"))
            ])

        for model in MODELS:
            pipeline = Pipeline([
                ('preprocessor', preprocessor),
                ('classifier', model)
            ])

            scores = cross_val_score(pipeline, X=X, y=y, cv=5)
            results.append({
                'imputer': imputer_label,
                'model': model,
                "missing_rate": missing_rate,
                # "mechanism": mechanism,
                "dataset": dataset,
                'mean_accuracy': np.mean(scores),
                'std_accuracy': np.std(scores)
            })


In [565]:
print(len(results))
results_df = pd.DataFrame(results)
results_df

27


Unnamed: 0,imputer,model,missing_rate,dataset,mean_accuracy,std_accuracy
0,Simple,LogisticRegression(),10,mobile_price,0.8475,0.01084
1,Simple,RandomForestClassifier(),10,mobile_price,0.8045,0.015443
2,Simple,KNeighborsClassifier(),10,mobile_price,0.5395,0.010536
3,MICE,LogisticRegression(),10,mobile_price,0.8475,0.01084
4,MICE,RandomForestClassifier(),10,mobile_price,0.807,0.018802
5,MICE,KNeighborsClassifier(),10,mobile_price,0.5395,0.010536
6,KNN,LogisticRegression(),10,mobile_price,0.8535,0.009165
7,KNN,RandomForestClassifier(),10,mobile_price,0.809,0.006245
8,KNN,KNeighborsClassifier(),10,mobile_price,0.5365,0.008602
9,Simple,LogisticRegression(),30,mobile_price,0.667,0.023738
