# Exploring the Configs in TabRepo

This notebook is an explorative analysis of the Configs used in TabRepo. The goal is to find parameters that are good at simulating, emulating or approximating the following hardware-aware metrics:

- FLOPs (~ Inference Time)
- Latency (~ Inference Time)
- Energy Consumption
- Memory Footprint
- Area


## Additional Notes to each Metric

### FLOPs
--

### Latency
This is already mostly covered by TabRepos measured inference times. 
This could be improved upon by comparing this to actual inference times on CPU vs GPU. 
I would need different hardware to create meaningful data. 

> Is there research I could use to find good estimates? 
>
> What hardware was used for TabRepo?

Further, there are apparently simulators or hardware emulators, which I could use to measure inference times on specific hardware. 
I do not know at the moment how good the created data would be.

### Energy Consumption
One option here is to estimate the energy consumption of a model based on existing profiling tools. 
I am not sure how this would work and what tools there are, so this would require a lot of research.

If I have enough data for a couple of representative models, I could create an Energy Model through regression. 
Through this model I could estimate the other models' energy consumption. 
I could use the inference and training times from TabRepo as input for the model.

### Memory Footprint
--

### Area
--

## Exploring Configs

In [None]:
import json
import pandas as pd

# Load the JSON configuration files
config_path = "../extern/tabrepo/data/configs/"
files = {
    "xt": "configs_xt.json",
    "xgboost": "configs_xgboost.json",
    "tabpfn": "configs_tabpfn.json",
    "rf": "configs_rf.json",
    "nn_torch": "configs_nn_torch.json",
    "lr": "configs_lr.json",
    "lightgbm": "configs_lightgbm.json",
    "knn": "configs_knn.json",
    "ftt": "configs_ftt.json",
    "fastai": "configs_fastai.json",
    "catboost": "configs_catboost.json"
}

# Function to load JSON data
def load_json(file_path):
    with open(file_path, 'r') as file:
        return json.load(file)

# Load all data
data =  {name_id: load_json(config_path + filename) for name_id, filename in files.items()}

# Function to convert JSON data to DataFrame
def json_to_df(data, model_name):
    records = []
    for config_name, config_data in data.items():
        record = {"model": model_name, "config": config_name}
        record.update(config_data["hyperparameters"])
        records.append(record)
    return pd.DataFrame(records)


# Convert all JSON data to DataFrames
dfs = [json_to_df(config_data, model_name) for model_name, config_data in data.items()]
for i, df in enumerate(dfs):
    print(f"Shape of ({i}): {df.shape}")
    print(f"\tColumns: {list(df.columns)}")
    # Calculate total entries
    total_entries = df.shape[0] * df.shape[1]

    # Count non-NaN values
    non_nan_count = df.count().sum()

    # Count NaN values
    nan_count = df.isna().sum().sum()

    # Calculate sparsity metrics
    percent_non_nan = (non_nan_count / total_entries) * 100
    percent_nan = (nan_count / total_entries) * 100

    print(f'\tTotal entries: {total_entries}')
    print(f'\tNon-NaN entries: {non_nan_count} ({percent_non_nan:.2f}%)')
    print(f'\tNaN entries: {nan_count} ({percent_nan:.2f}%)')

Looking for NaNs in the data...

In [None]:
import numpy as np

def find_nans(df: pd.DataFrame):
    for column in df.columns:
        if column != 'config':  # Exclude the 'model' column from NaN check
            # Find rows where the current column is NaN
            nan_rows = df[column].isna()
            
            # Print the column name, row index, and 'model' field for each NaN
            for index, is_nan in nan_rows.items():
                if is_nan:
                    print(f'Column: {column}, Row Index: {index}, Config: {df.loc[index, "config"]}')

In [None]:
for i, df in enumerate(dfs):
    print(f"Finding NaNs for ({i})")
    find_nans(df)
    print()

In [None]:
from tabrepo import load_repository
repo = load_repository("D244_F3_C1530_100", cache=True)
metrics = repo.metrics(datasets=repo.datasets(), configs=repo.configs())
metrics.info()

In [None]:
dataset = "Australian"
model = "XGBoost_r97_BAG_L1"
fold = 0
metrics.loc[(dataset, fold, model)]

NaN values usually appear in the first few rows of each dataframe. 

## Instancing Models from Configs

### V1

In [None]:
import os
import numpy as np
import pandas as pd
import tracemalloc
from catboost import CatBoostClassifier

# Create the directory if it doesn't exist
model_dir = "../models"
os.makedirs(model_dir, exist_ok=True)

# Generate random data
num_samples = 1000
num_features = 10
X_random = np.random.random((num_samples, num_features))
y_random = np.random.randint(2, size=num_samples)

train_data = pd.DataFrame(X_random, columns=[f'feature_{i}' for i in range(num_features)])
train_data['target'] = y_random

# Initialize tracemalloc to measure memory usage
tracemalloc.start()

# Initialize and train the CatBoost model with minimal output
model = CatBoostClassifier(
    depth=8,
    grow_policy='Depthwise',
    l2_leaf_reg=3.860757465489678,
    learning_rate=0.030421683021409185,
    max_ctr_complexity=4,
    one_hot_max_size=10,
    verbose=0  # Reduce output
)

model.fit(train_data.drop(columns='target'), train_data['target'])

# Measure memory usage after training
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
memory_usage = sum(stat.size for stat in top_stats)

# Save the model to the specified directory
model_path = os.path.join(model_dir, 'random_model.cbm')
model.save_model(model_path)

# Measure the disk usage of the saved model
disk_usage = os.path.getsize(model_path)

# Output the memory and disk usage
print(f"Memory used by the model: {memory_usage} bytes")
print(f"Disk space used by the model: {disk_usage} bytes")
print(f"Model saved at: {model_path}")


### V2

This has been moved to a containerized version for reduced variance.

In [None]:
import json
import os
import numpy as np
import pandas as pd
import tracemalloc
from autogluon.tabular import TabularPredictor
from tqdm.notebook import tqdm

# Load the configurations from the JSON files
config_path = "../extern/tabrepo/data/configs/"
files = {
    "xt": "configs_xt.json",
    "xgboost": "configs_xgboost.json",
    "tabpfn": "configs_tabpfn.json",
    "rf": "configs_rf.json",
    "nn_torch": "configs_nn_torch.json",
    "lr": "configs_lr.json",
    "lightgbm": "configs_lightgbm.json",
    "knn": "configs_knn.json",
    "ftt": "configs_ftt.json",
    "fastai": "configs_fastai.json",
    "catboost": "configs_catboost.json"
}

# Combine configurations into a single dictionary
raw_hyperparameters = {}
for model, filename in files.items():
    with open(os.path.join(config_path, filename), 'r') as file:
        raw_hyperparameters.update(json.load(file))

# Adjust the configurations using the provided adjustment code logic
adjusted_hyperparameters = {}
configs_hps = raw_hyperparameters.copy()  # Assuming repo._zeroshot_context.configs_hyperparameters equivalent
portfolio_configs = list(raw_hyperparameters.keys())  # Assuming portfolio_configs is a list of all config names

for _config_prio, config in enumerate(portfolio_configs):
    tabrepo_config_name = config.replace("_BAG_L1", "")
    new_config = configs_hps[tabrepo_config_name].copy()
    model_type = new_config.pop("model_type")
    new_config = new_config["hyperparameters"]

    if model_type not in adjusted_hyperparameters:
        adjusted_hyperparameters[model_type] = []
    new_config["ag_args"] = new_config.get("ag_args", {})
    new_config["ag_args"]["priority"] = 0 - _config_prio
    adjusted_hyperparameters[model_type].append(new_config)

# Create dummy data
num_samples = 100
num_features = 10
X_dummy = pd.DataFrame(np.random.random((num_samples, num_features)), columns=[f'feature_{i}' for i in range(num_features)])
X_dummy['target'] = np.random.randint(2, size=num_samples)

# Prepare a list to collect results
results = []

# Measure memory and disk usage for each model type
for model_name, configs in tqdm(adjusted_hyperparameters.items(), desc="Model Types", leave=False):
    for i, config in enumerate(tqdm(configs, desc=f"{model_name} Configurations", leave=False)):
        try:
            tracemalloc.start()
            
            # Initialize and train the model on dummy data
            predictor = TabularPredictor(label='target', problem_type='binary', verbosity=0)
            predictor.fit(
                train_data=X_dummy,
                hyperparameters={f"{model_name}": config},
                time_limit=60,  # short time limit for fitting on dummy data
                verbosity=0,
            )
            
            # Measure memory usage
            snapshot = tracemalloc.take_snapshot()
            top_stats = snapshot.statistics('lineno')
            memory_usage = sum(stat.size for stat in top_stats)

            # Save the model
            predictor.save()
            
            # Measure disk usage
            predictor_file = os.path.join(predictor.path, "predictor.pkl")
            learner_file = os.path.join(predictor.path, "learner.pkl")
            model_files = []
            for root, dirs, files in os.walk(predictor.path):
                for file in files:
                    if file == "model.pkl":
                        model_files.append(os.path.join(root, file))
                        
            predictor_size = os.path.getsize(predictor_file)
            learner_size = os.path.getsize(learner_file)
            models_size = sum(os.path.getsize(f) for f in model_files)
            total_deployed_size = predictor_size + learner_size + models_size

            # Append the results to the list
            results.append({
                "Model": f"{model_name}_{i}",
                "Memory used (bytes)": memory_usage,
                "Predictor size (bytes)": predictor_size,
                "Learner size (bytes)": learner_size,
                "Models size (bytes)": models_size,
                "Total deployment size (bytes)": total_deployed_size
            })
            
            tracemalloc.stop()

        except KeyError as e:
            print(f"Error with model {model_name}_{i}: {e}")
        except Exception as e:
            print(f"General error with model {model_name}_{i}: {e}")

# Convert the results list to a DataFrame
df_results = pd.DataFrame(results)

# Save the DataFrame to a CSV file
df_results.to_csv('model_memory_and_disk_usage.csv', index=False)

# Output the first few rows for debugging purposes
print(df_results.head())


## cPickle

Inspecting the pkl files Autogluon creates.

In [None]:
import pickle
file_path = './AutogluonModels/ag-20240812_092137/models/ExtraTrees_c1/model.pkl'
with open(file_path, 'rb') as file:
    data = pickle.load(file)
print(data)
dir(data)