# Highly Adaptive Models Comparison Project

## Overview

This project aims to empirically compare the performance of two statistical models: the Highly Adaptive Lasso (HAL) and Highly Adaptive Ridge (HAR) in high-dimensional data analysis. Both models are variants of regularization techniques used in regression and machine learning to prevent overfitting by adding a penalty to the loss function. The focus is on evaluating these models in terms of computational efficiency, prediction accuracy, and the effectiveness of regularization under various conditions.

## Goals

1. **Empirical Comparison**: Conduct a thorough empirical comparison between HAL and HAR models to assess their performance on high-dimensional datasets.
2. **Computation Time**: Evaluate and compare the computation times required by each model to fit the data, providing insights into their efficiency.
3. **Prediction Accuracy**: Use metrics such as Mean Squared Error (MSE) to compare the prediction accuracy of the models across different dataset sizes and conditions.
4. **Regularization Effectiveness**: Examine how cross-validation techniques control the regularization parameters in both models, focusing on the L1 norm control in HAR despite its explicit use of L2 regularization.
5. **Scalability**: Assess how each model scales with increasing data dimensions, offering insights into their applicability to real-world, high-dimensional datasets.

## Methodology

The project utilizes a simulation-based approach to generate synthetic datasets with controllable features such as the number of samples, number of features, and the level of noise. The simulation involves:

- Generating datasets using the DGP classes in `generators.py`.
- Fitting both HAL and HAR models to these datasets.
- Evaluating model performance using cross-validation and calculating MSE on a test set.
- Repeating the process for various dataset sizes to gather comprehensive performance data.

## Results Analysis

The simulation results will be analyzed and visualized to compare the computation time and MSE of HAL and HAR models. This analysis aims to provide a clear understanding of each model's strengths and limitations, particularly regarding their efficiency and accuracy in handling high-dimensional data.



In [2]:
# # Performance metric R^2 as defined in HAL paper 
# def calculate_r_squared(Y, Y_hat):
#     ss_res = np.sum((Y - Y_hat) ** 2)
#     ss_tot = np.sum((Y - np.mean(Y)) ** 2)
#     r_squared = 1 - (ss_res / ss_tot)
#     return r_squared

In [1]:
import numpy as np
from data_generators import DataGenerator, SmoothDataGenerator, JumpDataGenerator, SinusoidalDataGenerator
import pandas as pd 
from run_trials import RunTrials
import warnings

# Suppress warnings
warnings.filterwarnings("ignore")

In [4]:
## BASIC PLOTTING TEST

import numpy as np
from data_generators import DataGenerator, SmoothDataGenerator, JumpDataGenerator, SinusoidalDataGenerator
import pandas as pd 
from run_trials import RunTrials
import warnings

# Suppress warnings
warnings.filterwarnings("ignore")
# D sizes to test: 1, 3, and 5
d = 1

# Create a list of sample sizes at regular intervals 
sample_sizes = np.arange(start=100, stop=1000, step=100)

# Number of trials to run for each sample size, dgp, model
num_trials = 3

# Create a data generator
dgp = SmoothDataGenerator()

results = RunTrials.run_trials(d, sample_sizes, num_trials, dgp)

# Convert results to DataFrame
df = pd.DataFrame(results)

KeyboardInterrupt: 

In [3]:
from train_time_plotter import TrainTimePlotter

# Plot the training time
TrainTimePlotter.plot(df, d, dgp.name)

In [4]:
import numpy as np
from data_generators import DataGenerator, SmoothDataGenerator, JumpDataGenerator, SinusoidalDataGenerator
import pandas as pd 
from run_trials import RunTrials
import warnings
from train_time_plotter import TrainTimePlotter

# Suppress warnings
warnings.filterwarnings("ignore")

In [13]:
## SMALL TRAINING RUN

# d_sizes = [1, 3, 5]
# num_trials = 3
# sample_sizes = np.arange(start=100, stop=1000, step=100)
# data_generators = [SmoothDataGenerator, JumpDataGenerator, SinusoidalDataGenerator]
# data_frames = []
# all_plots = []

# # Run trials for all combinations of d sizes, sample sizes, and data generators
# for d in d_sizes:
#     for dgp in data_generators:
#         results = RunTrials.run_trials(d, sample_sizes, num_trials, dgp)
#         df = pd.DataFrame(results)
#         all_plots.append(TrainTimePlotter.plot(df, d, dgp.name))
#         display(TrainTimePlotter.plot(df, d, dgp.name))
#         data_frames.append(df)
        

# Main training run for 3x3 grid of plots 

## For 100-2000 samples

In [5]:
## Run trials for all combinations of d sizes, sample sizes, and data generators

# IMPORTS REQUIRED FOR TRAINING, PLOTTING, AND SAVING DATAFRAMES
import numpy as np
from data_generators import DataGenerator, SmoothDataGenerator, JumpDataGenerator, SinusoidalDataGenerator
import pandas as pd 
from run_trials import RunTrials
import warnings
from train_time_plotter import TrainTimePlotter
import os
import pickle

# Suppress warnings
warnings.filterwarnings("ignore")

## Setup: Define the parameters for the experiment

ALL_DF_FILE_NAMES = "Training_df_files/all_file_names.pickle"

d_sizes = [1, 3, 5]     # Dimensionality of the data (same as HAL paper)
num_trials = 5          # Number of trials to run for each sample size, dgp, model
sample_sizes = np.arange(start=100, stop=2100, step=100) # Sample sizes to test: 100, 200, ..., 2000
data_generators = [SmoothDataGenerator, JumpDataGenerator, SinusoidalDataGenerator]
data_frames = [] # List to store all DataFrames for each combination of d, sample size, and data generator
all_plots = [] # List to store all plots (might as well!)
all_file_names = [] # List to store all file names for the saved DataFrames

# Run trials for all combinations of d sizes, sample sizes, and data generators
for d in d_sizes:
    for dgp in data_generators:

        # Run trials for the current combination of d, sample size, and data generator
        results = RunTrials.run_trials(d, sample_sizes, num_trials, dgp)

        # Convert results to DataFrame and append to the list of all DataFrames
        df = pd.DataFrame(results)
        data_frames.append(df)

        # Generate a descriptive file name, save to pickle format, and append to the list of all file names
        file_name = f"dataframe_d{d}_dgp_{dgp.name}.pickle"
        df.to_pickle(file_name)
        all_file_names.append(file_name)
        print(f"Saved DataFrame to {file_name}")

        # Append the plot to the list of all plots, and display the plot
        plot = TrainTimePlotter.plot(df, d, dgp.name)
        all_plots.append(plot)
        display(plot)
        
# Save the file names to a pickle file
with open(ALL_DF_FILE_NAMES, "wb") as f:
    pickle.dump(all_file_names, f)

Saved DataFrame to dataframe_d1_dgp_Smooth.pickle


Saved DataFrame to dataframe_d1_dgp_Jump.pickle


Saved DataFrame to dataframe_d1_dgp_Sinusoidal.pickle


Saved DataFrame to dataframe_d3_dgp_Smooth.pickle


Saved DataFrame to dataframe_d3_dgp_Jump.pickle


Saved DataFrame to dataframe_d3_dgp_Sinusoidal.pickle


Saved DataFrame to dataframe_d5_dgp_Smooth.pickle


Saved DataFrame to dataframe_d5_dgp_Jump.pickle


Saved DataFrame to dataframe_d5_dgp_Sinusoidal.pickle


# Plotting Train time results for N = 100-2000


In [None]:
## GRID OF PLOTS 1: TRAIN TIME VS SAMPLE SIZE FOR ALL DGPS AND D SIZES
import pandas as pd
import altair as alt

ALL_DF_FILE_NAMES = "Training_df_files/all_file_names.pickle"

reshaped_data = []
d_sizes = [1, 3, 5] 
dgp_types = ['Smooth', 'Jump', 'Sinusoidal']  # The order of DGP types for each 'd'

# --------------------------------------------------------------------------------- # 
# if data_frames is not defined (new kernel), load file names from ALL_DF_FILE_NAMES
# --------------------------------------------------------------------------------- # 
if 'data_frames' not in locals():
    # Load the file names from the pickle file
    with open(ALL_DF_FILE_NAMES, "rb") as f:
        all_file_names = pickle.load(f)
    # Load the dataframes from the pickle files
    data_frames = [pd.read_pickle(file_name) for file_name in all_file_names]
# --------------------------------------------------------------------------------- # 

## RESHAPING DATAFRAMES FOR PLOTTING
# Assuming data_frames is a list of 9 dataframes in the order mentioned
for i, data in enumerate(data_frames):
    # Calculate the index for d_sizes and dgp_types
    d_index = i // len(dgp_types)
    dgp_index = i % len(dgp_types)

    # Aggregate results by sample size and method
    aggregated_df = data.groupby(['Sample Size', 'Method']).agg(
        mean_training_time=pd.NamedAgg(column='training_time', aggfunc='mean'),
        std_training_time=pd.NamedAgg(column='training_time', aggfunc='std')
    ).reset_index()

    # Add 'd' and 'dgp' columns
    aggregated_df['d'] = d_sizes[d_index]
    aggregated_df['dgp_type'] = dgp_types[dgp_index]

    reshaped_data.append(aggregated_df)

# Function to create a line plot from an aggregated DataFrame
def create_plot_from_df(df, d, dgp_type):
    """Generate a line plot from an aggregated DataFrame."""
    line_chart = alt.Chart(df).mark_line(point=True).encode(
        x='Sample Size:Q',
        y=alt.Y('mean_training_time:Q', title=f"Mean Training Time (s), d={d}"),
        color='Method:N',
        tooltip=['Sample Size', 'Method', 'mean_training_time', 'std_training_time']
    ).properties(
        title=f"DGP: {dgp_type}",
        width=400,
        height=200
    )
    return line_chart

# Function to arrange plots in a grid
def arrange_plots_in_grid(plots, num_cols=3, num_rows=3):
    # Create rows of charts
    rows = [alt.hconcat(*plots[i:i+num_cols]) for i in range(0, len(plots), num_cols)]
    # Combine rows into a single chart
    grid = alt.vconcat(*rows)
    return grid

# Generate all individual line plots, now using the 'd' and 'dgp_type' directly from the dataframes
line_plots = [create_plot_from_df(df, df['d'].iloc[0], df['dgp_type'].iloc[0]) for df in reshaped_data]

# Arrange the line plots into a 3x3 grid
line_grid_chart = arrange_plots_in_grid(line_plots, num_cols=3)

# Display the grid chart
line_grid_chart.display()

In [16]:
from cProfile import label
import pandas as pd
import altair as alt
import pickle

ALL_DF_FILE_NAMES = "all_file_names.pickle"

reshaped_data = []
d_sizes = [1, 3, 5]
dgp_types = ['Smooth', 'Jump', 'Sinusoidal']

if 'data_frames' not in locals():
    with open(ALL_DF_FILE_NAMES, "rb") as f:
        all_file_names = pickle.load(f)
    data_frames = [pd.read_pickle(file_name) for file_name in all_file_names]

for i, data in enumerate(data_frames):
    d_index = i // len(dgp_types)
    dgp_index = i % len(dgp_types)
    aggregated_df = data.groupby(['Sample Size', 'Method']).agg(
        mean_training_time=pd.NamedAgg(column='training_time', aggfunc='mean'),
        std_training_time=pd.NamedAgg(column='training_time', aggfunc='std')
    ).reset_index()
    aggregated_df['d'] = d_sizes[d_index]
    aggregated_df['dgp_type'] = dgp_types[dgp_index]
    reshaped_data.append(aggregated_df)

def create_plot_from_df(df, d, dgp_type):
    line_chart = alt.Chart(df).mark_line(point=True).encode(
        x='Sample Size:Q',
        y=alt.Y('mean_training_time:Q', title=f"Mean Training Time (s), d={d}"),
        color='Method:N',
        tooltip=['Sample Size', 'Method', 'mean_training_time', 'std_training_time']
    ).properties(
        title=f"DGP: {dgp_type}",
        width=400,
        height=200
    )
    return line_chart

def arrange_plots_in_grid(plots, num_cols=3):
    rows = [alt.hconcat(*plots[i:i+num_cols]) for i in range(0, len(plots), num_cols)]
    grid = alt.vconcat(*rows).configure_title(
        color='white'
    ).configure_axis(
        gridColor='white',
        titleColor='white',
        labelColor='white',
        domainColor='white',
        tickColor='white',
        gridWidth=0.5
    ).configure_legend(
        labelColor='white',
        titleColor='white',
        labelFontSize=14,
        titleFontSize=16
    ).properties(
        background='black'
    )
    return grid

line_plots = [create_plot_from_df(df, df['d'].iloc[0], df['dgp_type'].iloc[0]) for df in reshaped_data]
line_grid_chart = arrange_plots_in_grid(line_plots, num_cols=3)
line_grid_chart.display()


# Plotting MSE results for N = 100-2000

In [24]:
import pandas as pd
import altair as alt

# Assuming data_frames is a list of 9 dataframes in the order mentioned
reshaped_data_mse = []
for i, data in enumerate(data_frames):
    # Calculate the index for d_sizes and dgp_types
    d_index = i // len(dgp_types)
    dgp_index = i % len(dgp_types)

    # Aggregate results by sample size and method for MSE
    aggregated_df_mse = data.groupby(['Sample Size', 'Method']).agg(
        mean_mse=pd.NamedAgg(column='MSE', aggfunc='mean'),
        std_mse=pd.NamedAgg(column='MSE', aggfunc='std')
    ).reset_index()

    # Add 'd' and 'dgp' columns
    aggregated_df_mse['d'] = d_sizes[d_index]
    aggregated_df_mse['dgp_type'] = dgp_types[dgp_index]

    reshaped_data_mse.append(aggregated_df_mse)

def create_mse_plot_from_df(df, d, dgp_type):
    """Generate a line plot from an aggregated DataFrame for MSE."""
    mse_chart = alt.Chart(df).mark_line(point=True).encode(
        x='Sample Size:Q',
        y=alt.Y('mean_mse:Q', title=f"Mean MSE, d={d}"),
        color='Method:N',
        tooltip=['Sample Size', 'Method', 'mean_mse', 'std_mse']
    ).properties(
        title=f"DGP: {dgp_type}",
        width=400,
        height=200
    )
    return mse_chart

def arrange_plots_in_grid(plots, num_cols=3):
    rows = [alt.hconcat(*plots[i:i+num_cols]) for i in range(0, len(plots), num_cols)]
    grid = alt.vconcat(*rows).configure_title(
        color='white'
    ).configure_axis(
        gridColor='white',
        titleColor='white',
        labelColor='white',
        domainColor='white',
        tickColor='white',
        gridWidth=0.5
    ).configure_legend(
        labelColor='white',
        titleColor='white',
        labelFontSize=14,
        titleFontSize=16
    ).properties(
        background='black'
    )
    return grid

# Generate all individual MSE line plots
mse_line_plots = [create_mse_plot_from_df(df, df['d'].iloc[0], df['dgp_type'].iloc[0]) for df in reshaped_data_mse]

# Arrange the MSE line plots into a 3x3 grid
mse_line_grid_chart = arrange_plots_in_grid(mse_line_plots, num_cols=3)

# Display the grid chart for MSE
mse_line_grid_chart.display()


# Repeat training for 2000-10000 samples

In [26]:
## Run trials for all combinations of d sizes, sample sizes, and data generators

# IMPORTS REQUIRED FOR TRAINING, PLOTTING, AND SAVING DATAFRAMES
import numpy as np
from data_generators import DataGenerator, SmoothDataGenerator, JumpDataGenerator, SinusoidalDataGenerator
import pandas as pd 
from run_trials import RunTrials
import warnings
from train_time_plotter import TrainTimePlotter
import os
import pickle

# Suppress warnings
warnings.filterwarnings("ignore")

## Setup: Define the parameters for the experiment

ALL_DF_FILE_NAMES_LARGE_N = "Training_df_files/all_file_names_large_n.pickle"

d_sizes = [1, 3, 5]     # Dimensionality of the data (same as HAL paper)
num_trials = 5          # Number of trials to run for each sample size, dgp, model
sample_sizes = np.arange(start=2000, stop=10100, step=500) # Sample sizes to test: 2000 to 10000
data_generators = [SmoothDataGenerator, JumpDataGenerator, SinusoidalDataGenerator]
data_frames = [] # List to store all DataFrames for each combination of d, sample size, and data generator
all_plots = [] # List to store all plots (might as well!)
all_file_names_large_n = [] # List to store all file names for the saved DataFrames

# Run trials for all combinations of d sizes, sample sizes, and data generators
for d in d_sizes:
    for dgp in data_generators:

        # Run trials for the current combination of d, sample size, and data generator
        results = RunTrials.run_trials(d, sample_sizes, num_trials, dgp)

        # Convert results to DataFrame and append to the list of all DataFrames
        df = pd.DataFrame(results)
        data_frames.append(df)

        # Generate a descriptive file name, save to pickle format, and append to the list of all file names
        file_name = f"large_n_df{d}_{dgp.name}_dgp.pickle"
        df.to_pickle(file_name)
        all_file_names.append(file_name)
        print(f"Saved DataFrame to {file_name}")

        # Append the plot to the list of all plots, and display the plot
        plot = TrainTimePlotter.plot(df, d, dgp.name)
        all_plots.append(plot)
        display(plot)
        
# Save the file names to a pickle file
with open(ALL_DF_FILE_NAMES_LARGE_N, "wb") as f:
    pickle.dump(all_file_names_large_n, f)

Saved DataFrame to large_n_df1_Smooth_dgp.pickle


Saved DataFrame to large_n_df1_Jump_dgp.pickle


KeyboardInterrupt: 

In [None]:
import numpy as np
from data_generators import DataGenerator, SmoothDataGenerator, JumpDataGenerator, SinusoidalDataGenerator
import pandas as pd 
from run_trials import RunTrials
import warnings
from train_time_plotter import TrainTimePlotter
from kernel_HAR import kernel_HAR

