# Model Distillation - Additive Cyclic Boosted Tree

This notebook focuses on using Tree-based model distillation. Focuses on using teacher, neural network and student, Cyclic gradient boosting decision tree, to create a interpretable model. The methodology is based on Tan, S., Caruana, R., Hooker, G., Koch, P., and Gordo, A. (2018). 

Notes on paper :

https://www.notion.so/Model-Distillation-Tree-Based-LEARNING-GLOBAL-ADDITIVE-EXPLANATIONS-FOR-NEURAL-NETS-USING-MODEL-10ec2f4dbaa0800eb2ffeb76d1b8f744?pvs=4

Outline of Method:

1. Initialize Teacher Model : Create a FNN using Relu Functions, output is logits
2. Initialize Student Model : Initialize a decision tree model, with a learning rate and number of cycles
3. Optimise Decision tree: Using bagging / cyclic gradient boosting traing new $ h_m $: $ r_m = F(x) - \hat{F}_{m-1}(x) $
   
        - Cycle through feature subsets: Train each tree on different feature groups sequentially.
        - Cycle through loss functions: Use different loss functions (e.g., Mean Squared Error, KL Divergence) in alternating iterations.

4. Combine the Models: At the end of the iterations, the final student model will be an additive combination of all the trees trained during the process. Each tree corrects the residuals from the previous models.
5. using gSHAP to visualise feature importance

# Creating DataLoader and Dataset Class to load Data 

In [41]:
try:
    import torch 
    from torch.nn import Module
    from torch.utils.data import Dataset, DataLoader
    import numpy as np 
    import scipy 
    import matplotlib.pyplot as plt
    import pandas as pd
    from pathlib import Path
except: 
    print(f"Import not found")

In [103]:
# Get data into DataFrame 
try: 
    print("Trying to get data....")
    DATA_PATH = '../../data/01_encoded_no_transformations/01_encoded_no_transformations.csv'
    df = pd.read_csv(DATA_PATH)
    print("Data is laoded and in df....")
except LookupError as e:
    print(f"Couldn't find {e}")

# Create Labels 
try:
    print("Trying to label data...")
    X = df.drop(columns=['fraud_reported'])
    y = df['fraud_reported']
    print("Dataframe is now labelled...")
    
    # Convert to Tensors
    print("Converting features and labels from dataframe to tensors..")
    X = torch.from_numpy(X.to_numpy()).type(torch.float)
    y = torch.from_numpy(y.to_numpy())
    print("Completed converting features and labels to tensors...")
    print(f"Shape of X : {X.shape}")
    print(f"Shape of y : {y.shape}") 
except Exception as e:
    print(f"Failed {e}")
    
# Splitting Data into train, test 
try:
    print("Converting into training and testing split...")
    import sklearn
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print("Finished converting into training and testing split...")
    print(f"Length X_train: {len(X_train)} | X_test : {len(X_test)} | y_train : {len(y_train)} | y_test : {len(y_test)} ")
    print("All data loaded")
except Exception as e:
    print(f"Failed {e}")

Trying to get data....
Data is laoded and in df....
Trying to label data...
Dataframe is now labelled...
Converting features and labels from dataframe to tensors..
Completed converting features and labels to tensors...
Shape of X : torch.Size([2964, 124])
Shape of y : torch.Size([2964])
Converting into training and testing split...
Finished converting into training and testing split...
Length X_train: 2371 | X_test : 593 | y_train : 2371 | y_test : 593 
All data loaded


In [129]:
import torch
from torch.nn import Module

class CustomDataset(Module):
    def __init__(self, features: torch.Tensor, labels: torch.Tensor):
        super().__init__()
        self.features = features
        self.labels = labels 

    # Get length of data
    def __len__(self):
        features_len = len(self.features)
        label_len = len(self.labels)
        print(f"Length of features: {features_len} and labels: {label_len}")
        return features_len  # Ensure to return the length

    def __getitem__(self, idx): 
        try:
            feature = self.features[idx]
            label = self.labels[idx]
            print(f"Features shape: {feature.shape}")
            print(f"Label shape: {label.shape}")
            return feature, label
        except IndexError as e:
            print(f"This index is not in features: {e}")
            raise

# Example usage:
try:
    print("Creating Custom Dataset...")
    train_dataset = CustomDataset(features=X_train, labels=y_train)
    test_dataset = CustomDataset(features=X_test, labels=y_test)
    print("Created Custom Dataset...")

    print("Testing len function...")
    print(f"Train dataset length: {len(train_dataset)}")
    print(f"Test dataset length: {len(test_dataset)}")
    print("Finished Testing Length...")

    print("Testing get function for train ...")
    train_index_0_feature, train_index_0_label = train_dataset[0]  
    print("Testing get function finished train...")

    print("Testing get function for test ...")
    test_index_0_feature, test_index_0_label = test_dataset[0] 
    print("Testing get function Finished for test ...")   
except Exception as e:
    print(f"Error occurred in using CustomDataset class: {e}")


Creating Custom Dataset...
Created Custom Dataset...
Testing len function...
Length of features: 2371 and labels: 2371
Train dataset length: 2371
Length of features: 593 and labels: 593
Test dataset length: 593
Finished Testing Length...
Testing get function for train ...
Features shape: torch.Size([124])
Label shape: torch.Size([])
0.0
Testing get function finished train...
Testing get function for test ...
Features shape: torch.Size([124])
Label shape: torch.Size([])
Testing get function Finished for test ...


In [131]:
# Create DataLoader
try:
    print(f"Starting DataLoader for Test dataset")
    test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=True)
    print(f"Finished DataLoader for Test dataset")
    print(f"Starting DataLoader for Training dataset")
    train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    print(f"Finished DataLoader for Training dataset")
except Exception as e:
    print(f"Error occured in DataLoader as {e}")

Starting DataLoader for Test dataset
Length of features: 593 and labels: 593
Length of features: 593 and labels: 593
Finished DataLoader for Test dataset
Starting DataLoader for Training dataset
Length of features: 2371 and labels: 2371
Length of features: 2371 and labels: 2371
Finished DataLoader for Training dataset


# Load Model Teacher Model

# Create and Bag Student Model - Decision Tree

# Use gSHARP to create feature importance graphs