# **Autoencoder for Missing Data Imputation**
This notebook demonstrates how to preprocess, train, and evaluate an Autoencoder for handling missing data. We will cover the following steps:

1. **Data Loading and Preprocessing**
2. **Data Transformation and Encoding**
3. **Missing Data Simulation**
4. **Autoencoder Implementation**
5. **Model Training with Early Stopping**
6. **Evaluation and Imputation Analysis**


## **1️⃣ Data Loading**
In this section, we load the dataset and prepare it for processing. We stratify the data to ensure balanced splits for training and testing.

In [ ]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

def load_data(path):
    df = pd.read_csv(path, sep=';')
    df['Age_cat'] = pd.cut(df['age'], bins=[10, 20, 30, 40, 50, 60, np.inf], labels=[1, 2, 3, 4, 5, 6])
    train_set, test_set = train_test_split(df, test_size=0.1, stratify=df['Age_cat'], random_state=42)
    for set_ in (train_set, test_set):
        set_.drop(['Age_cat', 'y'], axis=1, inplace=True)
    return train_set, test_set

# Example usage:
# path = 'path/to/bank-full.csv'
# train_set, test_set = load_data(path)

## **2️⃣ Data Transformation and Encoding**
In this section, we apply transformations to the data, including log transformations, power transformations, and one-hot encoding.

In [ ]:
from sklearn.preprocessing import PowerTransformer, OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('log', 'passthrough', ['duration']),
    ('pt', PowerTransformer(), ['balance']),
    ('scaler', StandardScaler(), ['age', 'campaign', 'pdays', 'previous']),
    ('ordinal', OrdinalEncoder(), ['education', 'month']),
    ('nominal', OneHotEncoder(), ['job', 'marital', 'default', 'housing', 'loan', 'contact', 'poutcome'])
])

# Example usage:
# preprocessed_train_set = preprocessor.fit_transform(train_set)

## **3️⃣ Missing Data Simulation**
We simulate missing data using MCAR (Missing Completely At Random) and MNAR (Missing Not At Random) mechanisms. This step introduces controlled missingness into the dataset.

In [ ]:
def apply_missingness(data, mechanism='mcar', missing_threshold=0.2, random_state=42):
    np.random.seed(random_state)
    data = data.copy()
    mask = np.random.rand(*data.shape) < missing_threshold
    data[mask] = np.nan
    return data, mask

# Example usage:
# corrupted_data, missing_mask = apply_missingness(preprocessed_train_set, 'mcar', 0.2)

## **4️⃣ Autoencoder Implementation**
We define a simple Autoencoder using PyTorch to reconstruct missing data. The encoder reduces dimensionality, and the decoder reconstructs the input.

In [ ]:
import torch
import torch.nn as nn

class AutoEncoder(nn.Module):
    def __init__(self, input_dim):
        super(AutoEncoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(128, input_dim)
        )
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

## **5️⃣ Training the Autoencoder**
We train the Autoencoder with early stopping. If the validation loss does not improve for a set number of epochs, training stops early.

In [ ]:
def train_autoencoder(model, train_loader, epochs=50, learning_rate=1e-3):
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    loss_fn = nn.MSELoss()
    for epoch in range(epochs):
        total_loss = 0
        for batch_data, in train_loader:
            optimizer.zero_grad()
            output = model(batch_data)
            loss = loss_fn(output, batch_data)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {total_loss:.4f}')

# Example usage:
# model = AutoEncoder(input_dim=preprocessed_train_set.shape[1])
# train_autoencoder(model, train_loader)