# Exploring the Configs in TabRepo

This notebook is an explorative analysis of the Configs used in TabRepo. The goal is to find parameters that are good at simulating, emulating or approximating the following hardware-aware metrics:

- FLOP
- Latency
- Energy Consumption
- Memory Footprint
- Area


## Additional Notes to each Metric

### FLOP
--

### Latency
This is already mostly covered by TabRepos measured inference times. 
This could be improved upon by comparing this to actual inference times on CPU vs GPU. 
I would need different hardware to create meaningful data. 

> Is there research I could use to find good estimates? 
>
> What hardware was used for TabRepo?

Further, there are apparently simulators or hardware emulators, which I could use to measure inference times on specific hardware. 
I do not know at the moment how good the created data would be.

### Energy Consumption
One option here is to estimate the energy consumption of a model based on existing profiling tools. 
I am not sure how this would work and what tools there are, so this would require a lot of research.

If I have enough data for a couple of representative models, I could create an Energy Model through regression. 
Through this model I could estimate the other models' energy consumption. 
I could use the inference and training times from TabRepo as input for the model.

### Memory Footprint
--

### Area
--

In [16]:
import json
import pandas as pd

# Load the JSON configuration files
config_path = "../extern/tabrepo/data/configs/"
files = {
    "xt": "configs_xt.json",
    "xgboost": "configs_xgboost.json",
    "tabpfn": "configs_tabpfn.json",
    "rf": "configs_rf.json",
    "nn_torch": "configs_nn_torch.json",
    "lr": "configs_lr.json",
    "lightgbm": "configs_lightgbm.json",
    "knn": "configs_knn.json",
    "ftt": "configs_ftt.json",
    "fastai": "configs_fastai.json",
    "catboost": "configs_catboost.json"
}

# Function to load JSON data
def load_json(file_path):
    with open(file_path, 'r') as file:
        return json.load(file)

# Load all data
data =  {name_id: load_json(config_path + filename) for name_id, filename in files.items()}

# Function to convert JSON data to DataFrame
def json_to_df(data, model_name):
    records = []
    for config_name, config_data in data.items():
        record = {"model": model_name, "config": config_name}
        record.update(config_data["hyperparameters"])
        records.append(record)
    return pd.DataFrame(records)


# Convert all JSON data to DataFrames
dfs = [json_to_df(config_data, model_name) for model_name, config_data in data.items()]
for i, df in enumerate(dfs):
    print(f"Shape of ({i}): {df.shape}")
    print(f"\tColumns: {list(df.columns)}")
    # Calculate total entries
    total_entries = df.shape[0] * df.shape[1]

    # Count non-NaN values
    non_nan_count = df.count().sum()

    # Count NaN values
    nan_count = df.isna().sum().sum()

    # Calculate sparsity metrics
    percent_non_nan = (non_nan_count / total_entries) * 100
    percent_nan = (nan_count / total_entries) * 100

    print(f'\tTotal entries: {total_entries}')
    print(f'\tNon-NaN entries: {non_nan_count} ({percent_non_nan:.2f}%)')
    print(f'\tNaN entries: {nan_count} ({percent_nan:.2f}%)')

Shape of (0): (201, 6)
	Columns: ['model', 'config', 'ag_args', 'max_features', 'max_leaf_nodes', 'min_samples_leaf']
	Total entries: 1206
	Non-NaN entries: 1203 (99.75%)
	NaN entries: 3 (0.25%)
Shape of (1): (204, 8)
	Columns: ['model', 'config', 'ag_args', 'learning_rate', 'enable_categorical', 'colsample_bytree', 'max_depth', 'min_child_weight']
	Total entries: 1632
	Non-NaN entries: 1616 (99.02%)
	NaN entries: 16 (0.98%)
Shape of (2): (3, 4)
	Columns: ['model', 'config', 'ag_args', 'N_ensemble_configurations']
	Total entries: 12
	Non-NaN entries: 11 (91.67%)
	NaN entries: 1 (8.33%)
Shape of (3): (201, 6)
	Columns: ['model', 'config', 'ag_args', 'max_features', 'max_leaf_nodes', 'min_samples_leaf']
	Total entries: 1206
	Non-NaN entries: 1203 (99.75%)
	NaN entries: 3 (0.25%)
Shape of (4): (204, 10)
	Columns: ['model', 'config', 'ag_args', 'use_batchnorm', 'num_layers', 'activation', 'dropout_prob', 'hidden_size', 'learning_rate', 'weight_decay']
	Total entries: 2040
	Non-NaN entries:

In [21]:
import numpy as np

def find_nans(df: pd.DataFrame):
    for column in df.columns:
        if column != 'config':  # Exclude the 'model' column from NaN check
            # Find rows where the current column is NaN
            nan_rows = df[column].isna()
            
            # Print the column name, row index, and 'model' field for each NaN
            for index, is_nan in nan_rows.items():
                if is_nan:
                    print(f'Column: {column}, Row Index: {index}, Config: {df.loc[index, "config"]}')

In [22]:
for i, df in enumerate(dfs):
    print(f"Finding NaNs for ({i})")
    find_nans(df)
    print()

Finding NaNs for 0)
Column: max_features, Row Index: 0, Config: ExtraTrees_c1
Column: max_leaf_nodes, Row Index: 0, Config: ExtraTrees_c1
Column: min_samples_leaf, Row Index: 0, Config: ExtraTrees_c1

Finding NaNs for 1)
Column: learning_rate, Row Index: 0, Config: XGBoost_c1
Column: learning_rate, Row Index: 2, Config: XGBoost_c3
Column: enable_categorical, Row Index: 0, Config: XGBoost_c1
Column: enable_categorical, Row Index: 1, Config: XGBoost_c2
Column: colsample_bytree, Row Index: 0, Config: XGBoost_c1
Column: colsample_bytree, Row Index: 1, Config: XGBoost_c2
Column: colsample_bytree, Row Index: 2, Config: XGBoost_c3
Column: colsample_bytree, Row Index: 3, Config: XGBoost_c4
Column: max_depth, Row Index: 0, Config: XGBoost_c1
Column: max_depth, Row Index: 1, Config: XGBoost_c2
Column: max_depth, Row Index: 2, Config: XGBoost_c3
Column: max_depth, Row Index: 3, Config: XGBoost_c4
Column: min_child_weight, Row Index: 0, Config: XGBoost_c1
Column: min_child_weight, Row Index: 1, Co

NaN values usually appear in the first few rows of each dataframe. 