# Tutorial 5: Neural Architecture Search (NAS) with Mase and Optuna

In this tutorial, we'll see how Mase can be integrated with Optuna, the popular hyperparameter optimization framework, to search for a Bert model optimized for sequence classification on the IMDb dataset. We'll take the Optuna-generated model and import it into Mase, then run the CompressionPipeline to prepare the model for edge deployment by quantizing and pruning its weights.

As we'll see, running Architecture Search with Mase/Optuna involves the following steps.

1. **Define the search space**: this is a dictionary containing the range of values for each parameter at each layer in the model.

2. **Write the model constructor**: this is a function which uses Optuna utilities to sample a model from the search space, and constructs the model using transformers from_config class method.

3. **Write the objective function**: this function calls on the model constructor defined in Step 2 and defines the training/evaluation setup for each search iteration.

4. **Go!** Choose an Optuna sampler, create a study and launch the search.

In [1]:
checkpoint = "prajjwal1/bert-tiny"
tokenizer_checkpoint = "bert-base-uncased"
dataset_name = "imdb"

First, fetch the dataset using the `get_tokenized_dataset` utility.

In [None]:
from chop.tools import get_tokenized_dataset

dataset, tokenizer = get_tokenized_dataset(
    dataset=dataset_name,
    checkpoint=tokenizer_checkpoint,
    return_tokenizer=True,
)

## 1. Defining the Search Space

We'll start by defining a search space, i.e. enumerating the possible combinations of hyperparameters that Optuna can choose during search. We'll explore the following range of values for the model's hidden size, intermediate size, number of layers and number of heads, inspired by the [NAS-BERT paper](https://arxiv.org/abs/2105.14444).

In [3]:
import torch.nn as nn
from chop.nn.modules import Identity

search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": [
        nn.Linear,
        Identity,
    ],
}

## 2. Writing a Model Constructor

We define the following function, which will get called in each iteration of the search process. The function is passed the `trial` argument, which is an Optuna object that comes with many functionalities - see the [Trial documentation](https://optuna.readthedocs.io/en/stable/reference/trial.html) for more details. Here, we use the `trial.suggest_int` and `trial.suggest_categorical` functions to trigger the chosen sampler to choose parameter choices and layer types. The suggested integer is the index into the search space for each parameter, which we defined in the previous cell.

In [4]:
from transformers import AutoConfig, AutoModelForSequenceClassification
from chop.tools.utils import deepsetattr


def construct_model(trial):
    config = AutoConfig.from_pretrained(checkpoint)

    # Update the paramaters in the config
    for param in [
        "num_layers",
        "num_heads",
        "hidden_size",
        "intermediate_size",
    ]:
        chosen_idx = trial.suggest_int(param, 0, len(search_space[param]) - 1)
        setattr(config, param, search_space[param][chosen_idx])

    trial_model = AutoModelForSequenceClassification.from_config(config)

    for name, layer in trial_model.named_modules():
        if isinstance(layer, nn.Linear) and layer.in_features == layer.out_features:
            new_layer_cls = trial.suggest_categorical(
                f"{name}_type",
                search_space["linear_layer_choices"],
            )

            if new_layer_cls == nn.Linear:
                continue
            elif new_layer_cls == Identity:
                new_layer = Identity()
                deepsetattr(trial_model, name, new_layer)
            else:
                raise ValueError(f"Unknown layer type: {new_layer_cls}")

    return trial_model

## 3. Defining the Objective Function

Next, we define the objective function for the search, which gets called on each trial. In each trial, we create a new model instace with chosen hyperparameters according to the defined sampler. We then use the `get_trainer` utility in Mase to run a training loop on the IMDb dataset for a number of epochs. Finally, we use `evaluate` to report back the classification accuracy on the test split.

In [5]:
from chop.tools import get_trainer


def objective(trial):

    # Define the model
    model = construct_model(trial)

    trainer = get_trainer(
        model=model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=1,
    )

    trainer.train()
    eval_results = trainer.evaluate()

    # Set the model as an attribute so we can fetch it later
    trial.set_user_attr("model", model)

    return eval_results["eval_accuracy"]

## 4. Launching the Search

Optuna provides a number of samplers, for example:

* **GridSampler**: iterates through every possible combination of hyperparameters in the search space
* **RandomSampler**: chooses a random combination of hyperparameters in each iteration
* **TPESampler**: uses Tree-structured Parzen Estimator algorithm to choose hyperparameter values.

You can define the chosen sampler by simply importing from `optuna.samplers` as below.

In [6]:
from optuna.samplers import GridSampler, RandomSampler, TPESampler

sampler = RandomSampler()

With all the pieces in place, we can launch the search as follows. The number of trials is set to 1 so you can go get a coffee for 10 minutes, then proceed with the tutorial. However, this will essentially be a random model - for better results, set this to 100 and leave it running overnight!

In [None]:
import optuna

study = optuna.create_study(
    direction="maximize",
    study_name="bert-tiny-nas-study",
    sampler=sampler,
)

study.optimize(
    objective,
    n_trials=1,
    timeout=60 * 60 * 24,
)

Fetch the model associated with the best trial as follows, and export to be used in future tutorials. In Tutorial 6, we'll see how to run mixed-precision quantization search on top of the model we've just found through NAS to further find the optimal quantization mapping.

In [8]:
from pathlib import Path
import dill

model = study.best_trial.user_attrs["model"].cpu()

with open(f"{Path.home()}/tutorial_5_best_model.pkl", "wb") as f:
    dill.dump(model, f)

## Deploying the Optimized Model with CompressionPipeline

Now, we can run the CompressionPipeline in Mase to run uniform quantization and pruning over the searched model.

In [None]:
from chop.pipelines import CompressionPipeline
from chop import MaseGraph

mg = MaseGraph(model)
pipe = CompressionPipeline()

quantization_config = {
    "by": "type",
    "default": {
        "config": {
            "name": None,
        }
    },
    "linear": {
        "config": {
            "name": "integer",
            # data
            "data_in_width": 8,
            "data_in_frac_width": 4,
            # weight
            "weight_width": 8,
            "weight_frac_width": 4,
            # bias
            "bias_width": 8,
            "bias_frac_width": 4,
        }
    },
}

pruning_config = {
    "weight": {
        "sparsity": 0.5,
        "method": "l1-norm",
        "scope": "local",
    },
    "activation": {
        "sparsity": 0.5,
        "method": "l1-norm",
        "scope": "local",
    },
}

mg, _ = pipe(
    mg,
    pass_args={
        "quantize_transform_pass": quantization_config,
        "prune_transform_pass": pruning_config,
    },
)

Finally, export the MaseGraph for the compressed checkpoint to be used in future tutorials for hardware generation and distributed deployment.

In [None]:
mg.export(f"{Path.home()}/tutorial_5_nas_compressed")

In [None]:
### Part 5a, full code for random method ###

import torch.nn as nn
from chop.nn.modules import Identity
from transformers import AutoConfig, AutoModelForSequenceClassification
from chop.tools.utils import deepsetattr
from chop.tools import get_tokenized_dataset, get_trainer
import optuna
from optuna.samplers import GridSampler, TPESampler, RandomSampler
import matplotlib.pyplot as plt
import dill
from pathlib import Path
import pandas as pd

# -----------------------------
# 1. Define your checkpoints & dataset
# -----------------------------
checkpoint = "prajjwal1/bert-tiny"
tokenizer_checkpoint = "bert-base-uncased"
dataset_name = "imdb"

# Load tokenized dataset and tokenizer
dataset, tokenizer = get_tokenized_dataset(
    dataset=dataset_name,
    checkpoint=tokenizer_checkpoint,
    return_tokenizer=True,
)

# -----------------------------
# 2. Define search space
# -----------------------------
grid_search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": ["linear", "identity"],
}

# Same discrete sets for TPE/Random:
tpe_search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": ["linear", "identity"],
}


# -----------------------------
# 3. Construct the model given a trial
# -----------------------------
def construct_model(trial):
    """
    Creates a BERT model with the hyperparameters suggested by the trial.
    We manually map each parameter to the correct config attribute or layer override.
    """
    # Load the default config from a small BERT checkpoint
    config = AutoConfig.from_pretrained(checkpoint)

    # Map parameters to config
    config.num_hidden_layers = trial.suggest_categorical(
        "num_layers", tpe_search_space["num_layers"]
    )
    config.num_attention_heads = trial.suggest_categorical(
        "num_heads", tpe_search_space["num_heads"]
    )
    config.hidden_size = trial.suggest_categorical(
        "hidden_size", tpe_search_space["hidden_size"]
    )
    config.intermediate_size = trial.suggest_categorical(
        "intermediate_size", tpe_search_space["intermediate_size"]
    )

    model = AutoModelForSequenceClassification.from_config(config)

    # Handle linear layer choice
    linear_choice = trial.suggest_categorical(
        "linear_layer_choices", tpe_search_space["linear_layer_choices"]
    )
    if linear_choice == "identity":
        # For each linear layer that is square (in_features == out_features),
        # replace it with Identity
        for name, layer in model.named_modules():
            if isinstance(layer, nn.Linear) and layer.in_features == layer.out_features:
                deepsetattr(model, name, Identity())

    return model


# -----------------------------
# 4. Define the objective function
# -----------------------------
def objective(trial):
    # Build the model given the trial
    trial_model = construct_model(trial)

    # Create a Trainer (from chop) that handles fine-tuning
    trainer = get_trainer(
        model=trial_model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=3,
    )

    # Train and evaluate
    trainer.train()
    eval_results = trainer.evaluate()

    # Save the model to the trial user attributes for later retrieval if needed
    trial.set_user_attr("model", trial_model)

    # Return the metric you want to optimize
    return eval_results["eval_accuracy"]


# -----------------------------
# 5. Helper to run a study and collect "running best accuracies"
# -----------------------------
def run_study_and_get_curve(sampler, n_trials=None, study_name="study"):
    """
    Runs an Optuna study with the provided sampler and returns:
      - the study object
      - a list of best accuracies up to each trial (running max)
    """
    study = optuna.create_study(
        direction="maximize",
        study_name=study_name,
        sampler=sampler,
    )

    study.optimize(
        objective,
        n_trials=n_trials,
        timeout=60 * 60 * 24,  # or specify a time limit
        show_progress_bar=False,
    )

    # Build a list of max accuracies up to each trial
    running_max_accuracies = []
    current_max = 0.0
    for t in study.trials:
        if t.value is not None and t.value > current_max:
            current_max = t.value
        running_max_accuracies.append(current_max)

    return study, running_max_accuracies

# -----------------------------
# 6. Function to save trials data to CSV
# -----------------------------
def save_study_results_to_csv(study, filename):
    """
    Saves each trial's results into a CSV, including:
      - trial number
      - objective value (accuracy)
      - parameters
      - model config parameters
    """
    rows = []
    for t in study.trials:
        row = {
            "trial_number": t.number,
            "accuracy": t.value,
        }
        # Merge in parameter key-value pairs directly
        row.update(t.params)

        # Add model config if it exists in user attributes
        if "model" in t.user_attrs:
            model_config = t.user_attrs["model"].config.to_dict()
            for key, value in model_config.items():
                row[f"config_{key}"] = value

        rows.append(row)

    df = pd.DataFrame(rows)
    df.to_csv(filename, index=False)
    print(f"Saved {len(rows)} trials with model configs to {filename}.")

if __name__ == "__main__":
    random_sampler = RandomSampler()
    random_study, random_max_curve = run_study_and_get_curve(
        sampler=random_sampler,
        n_trials=10,
        study_name="bert-random-study",
    )
    print(f"[RandomSampler] Number of trials: {len(random_study.trials)}")
    print(f"[RandomSampler] Best accuracy: {random_study.best_value:.4f}")

    plt.figure(figsize=(8, 6))
    plt.plot(range(1, len(random_max_curve)+1), random_max_curve, label="RandomSampler")
    plt.xlabel("Number of Trials")
    plt.ylabel("Max Accuracy So Far")
    plt.title("Comparison of GridSampler vs TPESampler vs RandomSampler")
    plt.legend()
    plt.show()

    best_random_model = random_study.best_trial.user_attrs["model"].cpu()
    with open("best_random_model.pkl", "wb") as f:
        dill.dump(best_random_model, f)

    save_study_results_to_csv(random_study, "random_study_trials.csv")

In [None]:
### Part a, full code for grid method ###

import torch.nn as nn
from chop.nn.modules import Identity
from transformers import AutoConfig, AutoModelForSequenceClassification
from chop.tools.utils import deepsetattr
from chop.tools import get_tokenized_dataset, get_trainer
import optuna
from optuna.samplers import GridSampler, TPESampler, RandomSampler
import matplotlib.pyplot as plt
import dill
from pathlib import Path
import pandas as pd

# -----------------------------
# 1. Define your checkpoints & dataset
# -----------------------------
checkpoint = "prajjwal1/bert-tiny"
tokenizer_checkpoint = "bert-base-uncased"
dataset_name = "imdb"

# Load tokenized dataset and tokenizer
dataset, tokenizer = get_tokenized_dataset(
    dataset=dataset_name,
    checkpoint=tokenizer_checkpoint,
    return_tokenizer=True,
)

# -----------------------------
# 2. Define search space
# -----------------------------
grid_search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": ["linear", "identity"],
}

# Same discrete sets for TPE/Random:
tpe_search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": ["linear", "identity"],
}


# -----------------------------
# 3. Construct the model given a trial
# -----------------------------
def construct_model(trial):
    """
    Creates a BERT model with the hyperparameters suggested by the trial.
    We manually map each parameter to the correct config attribute or layer override.
    """
    # Load the default config from a small BERT checkpoint
    config = AutoConfig.from_pretrained(checkpoint)

    # Map parameters to config
    config.num_hidden_layers = trial.suggest_categorical(
        "num_layers", grid_search_space["num_layers"]
    )
    config.num_attention_heads = trial.suggest_categorical(
        "num_heads", grid_search_space["num_heads"]
    )
    config.hidden_size = trial.suggest_categorical(
        "hidden_size", grid_search_space["hidden_size"]
    )
    config.intermediate_size = trial.suggest_categorical(
        "intermediate_size", grid_search_space["intermediate_size"]
    )

    model = AutoModelForSequenceClassification.from_config(config)

    # Handle linear layer choice
    linear_choice = trial.suggest_categorical(
        "linear_layer_choices", grid_search_space["linear_layer_choices"]
    )
    if linear_choice == "identity":
        # For each linear layer that is square (in_features == out_features),
        # replace it with Identity
        for name, layer in model.named_modules():
            if isinstance(layer, nn.Linear) and layer.in_features == layer.out_features:
                deepsetattr(model, name, Identity())

    return model


# -----------------------------
# 4. Define the objective function
# -----------------------------
def objective(trial):
    # Build the model given the trial
    trial_model = construct_model(trial)

    # Create a Trainer (from chop) that handles fine-tuning
    trainer = get_trainer(
        model=trial_model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=3,
    )

    # Train and evaluate
    trainer.train()
    eval_results = trainer.evaluate()

    # Save the model to the trial user attributes for later retrieval if needed
    trial.set_user_attr("model", trial_model)

    # Return the metric you want to optimize
    return eval_results["eval_accuracy"]


# -----------------------------
# 5. Helper to run a study and collect "running best accuracies"
# -----------------------------
def run_study_and_get_curve(sampler, n_trials=None, study_name="study"):
    """
    Runs an Optuna study with the provided sampler and returns:
      - the study object
      - a list of best accuracies up to each trial (running max)
    """
    study = optuna.create_study(
        direction="maximize",
        study_name=study_name,
        sampler=sampler,
    )

    study.optimize(
        objective,
        n_trials=n_trials,
        timeout=60 * 60 * 24,  # or specify a time limit
        show_progress_bar=False,
    )

    # Build a list of max accuracies up to each trial
    running_max_accuracies = []
    current_max = 0.0
    for t in study.trials:
        if t.value is not None and t.value > current_max:
            current_max = t.value
        running_max_accuracies.append(current_max)

    return study, running_max_accuracies

# -----------------------------
# 6. Function to save trials data to CSV
# -----------------------------
def save_study_results_to_csv(study, filename):
    """
    Saves each trial's results into a CSV, including:
      - trial number
      - objective value (accuracy)
      - parameters
      - model config parameters
    """
    rows = []
    for t in study.trials:
        row = {
            "trial_number": t.number,
            "accuracy": t.value,
        }
        # Merge in parameter key-value pairs directly
        row.update(t.params)

        # Add model config if it exists in user attributes
        if "model" in t.user_attrs:
            model_config = t.user_attrs["model"].config.to_dict()
            for key, value in model_config.items():
                row[f"config_{key}"] = value

        rows.append(row)

    df = pd.DataFrame(rows)
    df.to_csv(filename, index=False)
    print(f"Saved {len(rows)} trials with model configs to {filename}.")



# -----------------------------
# 7. Compare GridSampler vs TPESampler vs RandomSampler
# -----------------------------
if __name__ == "__main__":
    # For GridSampler, define discrete search space
    grid_sampler = GridSampler(
       search_space={
           "num_layers": grid_search_space["num_layers"],
           "num_heads": grid_search_space["num_heads"],
           "hidden_size": grid_search_space["hidden_size"],
           "intermediate_size": grid_search_space["intermediate_size"],
           "linear_layer_choices": grid_search_space["linear_layer_choices"],
       }
    )

    #tpe_sampler = TPESampler()
    # random_sampler = RandomSampler()

    # ---------------------------------------
    # 7a. Run with GridSampler
    # ---------------------------------------
    grid_study, grid_max_curve = run_study_and_get_curve(
       sampler=grid_sampler,
       n_trials=10,  # product of all combos if you set this None or a bigger number
       study_name="bert-grid-study",
    )
    print(f"[GridSampler] Number of trials: {len(grid_study.trials)}")
    print(f"[GridSampler] Best accuracy: {grid_study.best_value:.4f}")

    # ---------------------------------------
    # 7b. Run with TPESampler
    # ---------------------------------------
    #tpe_study, tpe_max_curve = run_study_and_get_curve(
    #    sampler=tpe_sampler,
    #    n_trials=10,  # pick more if desired
    #    study_name="bert-tpe-study",
    #)
    #print(f"[TPESampler] Number of trials: {len(tpe_study.trials)}")
    #print(f"[TPESampler] Best accuracy: {tpe_study.best_value:.4f}")

    # ---------------------------------------
    # 7c. Run with RandomSampler
    # ---------------------------------------
    # random_study, random_max_curve = run_study_and_get_curve(
    #     sampler=random_sampler,
    #     n_trials=10,  # pick more if desired
    #     study_name="bert-random-study",
    # )
    # print(f"[RandomSampler] Number of trials: {len(random_study.trials)}")
    # print(f"[RandomSampler] Best accuracy: {random_study.best_value:.4f}")

    # -----------------------------
    # 8. Plot the results
    # -----------------------------
    plt.figure(figsize=(8, 6))
    plt.plot(range(1, len(grid_max_curve)+1), grid_max_curve, label="GridSampler")
    #plt.plot(range(1, len(tpe_max_curve)+1), tpe_max_curve, label="TPESampler")
    # plt.plot(range(1, len(random_max_curve)+1), random_max_curve, label="RandomSampler")
    plt.xlabel("Number of Trials")
    plt.ylabel("Max Accuracy So Far")
    plt.title("Comparison of GridSampler vs TPESampler vs RandomSampler")
    plt.legend()
    plt.show()

    # -----------------------------
    # 9. (Optional) Save best models
    # -----------------------------
    best_grid_model = grid_study.best_trial.user_attrs["model"].cpu()
    with open("best_grid_model.pkl", "wb") as f:
        dill.dump(best_grid_model, f)

    #best_tpe_model = tpe_study.best_trial.user_attrs["model"].cpu()
    #with open("best_tpe_model.pkl", "wb") as f:
        #dill.dump(best_tpe_model, f)

    # best_random_model = random_study.best_trial.user_attrs["model"].cpu()
    # with open("best_random_model.pkl", "wb") as f:
    #     dill.dump(best_random_model, f)

    # -----------------------------
    # 10. Save all trials to CSV
    # -----------------------------
    save_study_results_to_csv(grid_study, "grid_study_trials.csv")
    # save_study_results_to_csv(tpe_study, "tpe_study_trials.csv")
    # save_study_results_to_csv(random_study, "random_study_trials.csv")

In [None]:
### Part a, full code for TPE method ###

import torch.nn as nn
from chop.nn.modules import Identity
from transformers import AutoConfig, AutoModelForSequenceClassification
from chop.tools.utils import deepsetattr
from chop.tools import get_tokenized_dataset, get_trainer
import optuna
from optuna.samplers import GridSampler, TPESampler, RandomSampler
import matplotlib.pyplot as plt
import dill
from pathlib import Path
import pandas as pd

# -----------------------------
# 1. Define your checkpoints & dataset
# -----------------------------
checkpoint = "prajjwal1/bert-tiny"
tokenizer_checkpoint = "bert-base-uncased"
dataset_name = "imdb"

# Load tokenized dataset and tokenizer
dataset, tokenizer = get_tokenized_dataset(
    dataset=dataset_name,
    checkpoint=tokenizer_checkpoint,
    return_tokenizer=True,
)

# -----------------------------
# 2. Define search space
# -----------------------------
grid_search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": ["linear", "identity"],
}

# Same discrete sets for TPE/Random:
tpe_search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": ["linear", "identity"],
}


# -----------------------------
# 3. Construct the model given a trial
# -----------------------------
def construct_model(trial):
    """
    Creates a BERT model with the hyperparameters suggested by the trial.
    We manually map each parameter to the correct config attribute or layer override.
    """
    # Load the default config from a small BERT checkpoint
    config = AutoConfig.from_pretrained(checkpoint)

    # Map parameters to config
    config.num_hidden_layers = trial.suggest_categorical(
        "num_layers", tpe_search_space["num_layers"]
    )
    config.num_attention_heads = trial.suggest_categorical(
        "num_heads", tpe_search_space["num_heads"]
    )
    config.hidden_size = trial.suggest_categorical(
        "hidden_size", tpe_search_space["hidden_size"]
    )
    config.intermediate_size = trial.suggest_categorical(
        "intermediate_size", tpe_search_space["intermediate_size"]
    )

    model = AutoModelForSequenceClassification.from_config(config)

    # Handle linear layer choice
    linear_choice = trial.suggest_categorical(
        "linear_layer_choices", tpe_search_space["linear_layer_choices"]
    )
    if linear_choice == "identity":
        # For each linear layer that is square (in_features == out_features),
        # replace it with Identity
        for name, layer in model.named_modules():
            if isinstance(layer, nn.Linear) and layer.in_features == layer.out_features:
                deepsetattr(model, name, Identity())

    return model


# -----------------------------
# 4. Define the objective function
# -----------------------------
def objective(trial):
    # Build the model given the trial
    trial_model = construct_model(trial)

    # Create a Trainer (from chop) that handles fine-tuning
    trainer = get_trainer(
        model=trial_model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=3,
    )

    # Train and evaluate
    trainer.train()
    eval_results = trainer.evaluate()

    # Save the model to the trial user attributes for later retrieval if needed
    trial.set_user_attr("model", trial_model)

    # Return the metric you want to optimize
    return eval_results["eval_accuracy"]


# -----------------------------
# 5. Helper to run a study and collect "running best accuracies"
# -----------------------------
def run_study_and_get_curve(sampler, n_trials=None, study_name="study"):
    """
    Runs an Optuna study with the provided sampler and returns:
      - the study object
      - a list of best accuracies up to each trial (running max)
    """
    study = optuna.create_study(
        direction="maximize",
        study_name=study_name,
        sampler=sampler,
    )

    study.optimize(
        objective,
        n_trials=n_trials,
        timeout=60 * 60 * 24,  # or specify a time limit
        show_progress_bar=False,
    )

    # Build a list of max accuracies up to each trial
    running_max_accuracies = []
    current_max = 0.0
    for t in study.trials:
        if t.value is not None and t.value > current_max:
            current_max = t.value
        running_max_accuracies.append(current_max)

    return study, running_max_accuracies

# -----------------------------
# 6. Function to save trials data to CSV
# -----------------------------
def save_study_results_to_csv(study, filename):
    """
    Saves each trial's results into a CSV, including:
      - trial number
      - objective value (accuracy)
      - parameters
      - model config parameters
    """
    rows = []
    for t in study.trials:
        row = {
            "trial_number": t.number,
            "accuracy": t.value,
        }
        # Merge in parameter key-value pairs directly
        row.update(t.params)

        # Add model config if it exists in user attributes
        if "model" in t.user_attrs:
            model_config = t.user_attrs["model"].config.to_dict()
            for key, value in model_config.items():
                row[f"config_{key}"] = value

        rows.append(row)

    df = pd.DataFrame(rows)
    df.to_csv(filename, index=False)
    print(f"Saved {len(rows)} trials with model configs to {filename}.")



# -----------------------------
# 7. Compare GridSampler vs TPESampler vs RandomSampler
# -----------------------------
if __name__ == "__main__":
    # For GridSampler, define discrete search space
    # grid_sampler = GridSampler(
    #    search_space={
    #        "num_layers": grid_search_space["num_layers"],
    #        "num_heads": grid_search_space["num_heads"],
    #        "hidden_size": grid_search_space["hidden_size"],
    #        "intermediate_size": grid_search_space["intermediate_size"],
    #        "linear_layer_choices": grid_search_space["linear_layer_choices"],
    #    }
    # )

    tpe_sampler = TPESampler()
    # random_sampler = RandomSampler()

    # ---------------------------------------
    # 7a. Run with GridSampler
    # ---------------------------------------
    # grid_study, grid_max_curve = run_study_and_get_curve(
    #    sampler=grid_sampler,
    #    n_trials=10,  # product of all combos if you set this None or a bigger number
    #    study_name="bert-grid-study",
    # )
    # print(f"[GridSampler] Number of trials: {len(grid_study.trials)}")
    # print(f"[GridSampler] Best accuracy: {grid_study.best_value:.4f}")

    # ---------------------------------------
    # 7b. Run with TPESampler
    # ---------------------------------------
    tpe_study, tpe_max_curve = run_study_and_get_curve(
       sampler=tpe_sampler,
       n_trials=10,  # pick more if desired
       study_name="bert-tpe-study",
    )
    print(f"[TPESampler] Number of trials: {len(tpe_study.trials)}")
    print(f"[TPESampler] Best accuracy: {tpe_study.best_value:.4f}")

    # ---------------------------------------
    # 7c. Run with RandomSampler
    # ---------------------------------------
    # random_study, random_max_curve = run_study_and_get_curve(
    #     sampler=random_sampler,
    #     n_trials=10,  # pick more if desired
    #     study_name="bert-random-study",
    # )
    # print(f"[RandomSampler] Number of trials: {len(random_study.trials)}")
    # print(f"[RandomSampler] Best accuracy: {random_study.best_value:.4f}")

    # -----------------------------
    # 8. Plot the results
    # -----------------------------
    plt.figure(figsize=(8, 6))
    # plt.plot(range(1, len(grid_max_curve)+1), grid_max_curve, label="GridSampler")
    plt.plot(range(1, len(tpe_max_curve)+1), tpe_max_curve, label="TPESampler")
    # plt.plot(range(1, len(random_max_curve)+1), random_max_curve, label="RandomSampler")
    plt.xlabel("Number of Trials")
    plt.ylabel("Max Accuracy So Far")
    plt.title("Comparison of GridSampler vs TPESampler vs RandomSampler")
    plt.legend()
    plt.show()

    # -----------------------------
    # 9. (Optional) Save best models
    # -----------------------------
    # best_grid_model = grid_study.best_trial.user_attrs["model"].cpu()
    # with open("best_grid_model.pkl", "wb") as f:
    #     dill.dump(best_grid_model, f)

    best_tpe_model = tpe_study.best_trial.user_attrs["model"].cpu()
    with open("best_tpe_model.pkl", "wb") as f:
        dill.dump(best_tpe_model, f)

    # best_random_model = random_study.best_trial.user_attrs["model"].cpu()
    # with open("best_random_model.pkl", "wb") as f:
    #     dill.dump(best_random_model, f)

    # -----------------------------
    # 10. Save all trials to CSV
    # -----------------------------
    # save_study_results_to_csv(grid_study, "grid_study_trials.csv")
    save_study_results_to_csv(tpe_study, "tpe_study_trials.csv")
    # save_study_results_to_csv(random_study, "random_study_trials.csv")

In [None]:
### Part 5b Combined ###

from chop.tools import get_tokenized_dataset
import torch.nn as nn
from chop.nn.modules import Identity
from transformers import AutoConfig, AutoModelForSequenceClassification
from chop.tools.utils import deepsetattr
from chop.tools import get_trainer
import optuna
from optuna.samplers import GridSampler, RandomSampler, TPESampler
from pathlib import Path
import dill
from chop.pipelines import CompressionPipeline
from chop import MaseGraph

def construct_model(trial):
    config = AutoConfig.from_pretrained(checkpoint)

    # Update the paramaters in the config
    for param in [
        "num_layers",
        "num_heads",
        "hidden_size",
        "intermediate_size",
    ]:
        chosen_idx = trial.suggest_int(param, 0, len(search_space[param]) - 1)
        setattr(config, param, search_space[param][chosen_idx])

    trial_model = AutoModelForSequenceClassification.from_config(config)

    for name, layer in trial_model.named_modules():
        if isinstance(layer, nn.Linear) and layer.in_features == layer.out_features:
            new_layer_cls = trial.suggest_categorical(
                f"{name}_type",
                search_space["linear_layer_choices"],
            )

            if new_layer_cls == nn.Linear:
                continue
            elif new_layer_cls == Identity:
                new_layer = Identity()
                deepsetattr(trial_model, name, new_layer)
            else:
                raise ValueError(f"Unknown layer type: {new_layer_cls}")

    return trial_model

def objective(trial):

    # Define the model
    model = construct_model(trial)

    trainer = get_trainer(
        model=model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=1,
    )

    trainer.train()

    # Quantize and prune the model
    mg = MaseGraph(model)
    pipe = CompressionPipeline()

    quantization_config = {
        "by": "type",
        "default": {
            "config": {
                "name": None,
            }
        },
        "linear": {
            "config": {
                "name": "integer",
                "data_in_width": 8,
                "data_in_frac_width": 4,
                "weight_width": 8,
                "weight_frac_width": 4,
                "bias_width": 8,
                "bias_frac_width": 4,
            }
        },
    }

    pruning_config = {
        "weight": {
            "sparsity": 0.5,
            "method": "l1-norm",
            "scope": "local",
        },
        "activation": {
            "sparsity": 0.5,
            "method": "l1-norm",
            "scope": "local",
        },
    }

    mg, _ = pipe(
        mg,
        pass_args={
            "quantize_transform_pass": quantization_config,
            "prune_transform_pass": pruning_config,
        },
    )

    # Replace the original model with the compressed model
    model = mg.model

    trainer = get_trainer(
        model=model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=1,  # Post-compression training phase
    )

    trainer.train()
    eval_results = trainer.evaluate()

    # Set the model as an attribute so we can fetch it later
    trial.set_user_attr("model", model)

    return eval_results["eval_accuracy"]

def main_run():
    global checkpoint, tokenizer, dataset, search_space

    checkpoint = "prajjwal1/bert-tiny"
    tokenizer_checkpoint = "bert-base-uncased"
    dataset_name = "imdb"

    dataset, tokenizer = get_tokenized_dataset(
        dataset=dataset_name,
        checkpoint=tokenizer_checkpoint,
        return_tokenizer=True,
    )

    search_space = {
        "num_layers": [2, 4, 8],
        "num_heads": [2, 4, 8, 16],
        "hidden_size": [128, 192, 256, 384, 512],
        "intermediate_size": [512, 768, 1024, 1536, 2048],
        "linear_layer_choices": [
            nn.Linear,
            Identity,
        ],
    }

    sampler = TPESampler()

    study = optuna.create_study(
        direction="maximize",
        study_name="bert-tiny-compression-aware-study",
        sampler=sampler,
    )

    study.optimize(
        objective,
        n_trials=3,
        timeout=60 * 60 * 24,
    )

    # Save the best model
    model = study.best_trial.user_attrs["model"].cpu()

    with open(f"{Path.home()}/tutorial_5_best_model.pkl", "wb") as f:
        dill.dump(model, f)

main_run()

In [None]:
### Download all csvs ###

import glob
# from google.colab import files

def download_all_csvs():
    """
    Find all CSV files in the directory structure and download them to the local machine.
    Handles errors gracefully and provides feedback for each file.
    """
    try:
        # Find all CSV files (searches recursively)
        csv_files = glob.glob("**/*.csv", recursive=True)

        if not csv_files:
            print("No CSV files found.")
            return

        print(f"Found {len(csv_files)} CSV file(s):")
        for file in csv_files:
            print(f"- {file}")

        # Download each CSV file
        for file in csv_files:
            try:
                print(f"Downloading {file}...")
                files.download(file)
            except Exception as e:
                print(f"Error downloading {file}: {e}")

        print("All CSV files processed.")

    except Exception as e:
        print(f"An error occurred: {e}")

# Call the function
download_all_csvs()


In [None]:
### Print Forward Info ###

import inspect

def print_model_forward_info(model, label="Model"):
    print(f"[INFO] === {label} ===")
    print(f"[INFO] Class: {model.__class__}")
    # Attempt to show the forward signature
    try:
        sig = inspect.signature(model.forward)
        print(f"[INFO] forward signature: {sig}")
    except ValueError:
        print("[INFO] Could not retrieve a signature (might be a built-in / GraphModule).")

    # Try printing the docstring if available
    doc = inspect.getdoc(model.forward)
    print(f"[INFO] forward docstring:\n{doc}\n")

In [None]:
### Complete pass of compression-aware training ###

### Part 5b Combined ###
import torch
from chop.tools import get_tokenized_dataset
import torch.nn as nn
from chop.nn.modules import Identity
from transformers import AutoConfig, AutoModelForSequenceClassification
from chop.tools.utils import deepsetattr
from chop.tools import get_trainer
import optuna
from optuna.samplers import GridSampler, RandomSampler, TPESampler
from pathlib import Path
import dill
from chop.pipelines import CompressionPipeline
from chop import MaseGraph
import chop.passes as passes
import inspect


def construct_model(trial):
    """
    Creates a BERT model with the hyperparameters suggested by the trial.
    We manually map each parameter to the correct config attribute or layer override.
    """
    # Load the default config from a small BERT checkpoint
    config = AutoConfig.from_pretrained(checkpoint)

    # Map parameters to config
    config.num_hidden_layers = trial.suggest_categorical(
        "num_layers", search_space["num_layers"]
    )
    config.num_attention_heads = trial.suggest_categorical(
        "num_heads", search_space["num_heads"]
    )
    config.hidden_size = trial.suggest_categorical(
        "hidden_size", search_space["hidden_size"]
    )
    config.intermediate_size = trial.suggest_categorical(
        "intermediate_size", search_space["intermediate_size"]
    )

    model = AutoModelForSequenceClassification.from_config(config)

    # Handle linear layer choice
    linear_choice = trial.suggest_categorical(
        "linear_layer_choices", search_space["linear_layer_choices"]
    )
    if linear_choice == "identity":
        # For each linear layer that is square (in_features == out_features),
        # replace it with Identity
        for name, layer in model.named_modules():
            if isinstance(layer, nn.Linear) and layer.in_features == layer.out_features:
                deepsetattr(model, name, Identity())

    return model

def objective(trial):
    # Define the model and ensure it is on the correct device
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print(f"[INFO] Model is being initialized on: {device}")

    model = construct_model(trial).to(device)
    print(f"[INFO] Model moved to: {next(model.parameters()).device}")

    original_forward = model.forward
    model.config.use_cache = False

    # Set up the trainer
    trainer = get_trainer(
        model=model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=1,
    )

    print("[INFO] Starting initial training...")
    trainer.train()

    # Move model to CPU after training
    model.cpu()
    print(f"[INFO] Model moved to: {next(model.parameters()).device} after training")

    # print_model_forward_info(model, label="Model BEFORE compression")

    # MaseGraph processing
    # mg = MaseGraph(model)  # Create MaseGraph on the same device
    mg = MaseGraph(
        model,
        hf_input_names=[
            "input_ids",
            "attention_mask",
            "labels",
        ],
    )

    mg, _ = passes.init_metadata_analysis_pass(mg)
    mg, _ = passes.add_common_metadata_analysis_pass(mg)
    pipe = CompressionPipeline()

    quantization_config = {
        "by": "type",
        "default": {"config": {"name": None}},
        "linear": {
            "config": {
                "name": "integer",
                "data_in_width": 8,
                "data_in_frac_width": 4,
                "weight_width": 8,
                "weight_frac_width": 4,
                "bias_width": 8,
                "bias_frac_width": 4,
            }
        },
    }

    pruning_config = {
        "weight": {"sparsity": 0.5, "method": "l1-norm", "scope": "local"},
        "activation": {"sparsity": 0.5, "method": "l1-norm", "scope": "local"},
    }

    print("[INFO] Applying compression pipeline (quantization & pruning)...")
    mg, _ = pipe(
        mg,
        pass_args={
            "quantize_transform_pass": quantization_config,
            "prune_transform_pass": pruning_config,
        },
    )

    compressed_model = mg.model.to(device)
    print(f"[INFO] Compressed model class: {compressed_model.__class__}")
    print(f"[INFO] Compressed model moved to: {next(compressed_model.parameters()).device}")

    print("[INFO] Reassigning original forward to the compressed model...")
    # compressed_model.forward = original_forward

    print_model_forward_info(compressed_model, label="Model AFTER compression")

    # Continue training the compressed model
    trainer = get_trainer(
        model=compressed_model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=1,  # Post-compression training phase
    )

    print("[INFO] Starting post-compression training...")
    trainer.train()

    # Evaluate the final model
    eval_results = trainer.evaluate()
    print(f"[INFO] Final evaluation accuracy: {eval_results['eval_accuracy']:.4f}")

    # Set the model as an attribute for later use
    trial.set_user_attr("model", model)

    return eval_results["eval_accuracy"]


def main_run():
    global checkpoint, tokenizer, dataset, search_space

    checkpoint = "prajjwal1/bert-tiny"
    tokenizer_checkpoint = "bert-base-uncased"
    dataset_name = "imdb"

    dataset, tokenizer = get_tokenized_dataset(
        dataset=dataset_name,
        checkpoint=tokenizer_checkpoint,
        return_tokenizer=True,
    )

    sampler = TPESampler()

    # search_space = {
    #     "num_layers": [2, 4, 8],
    #     "num_heads": [2, 4, 8, 16],
    #     "hidden_size": [128, 192, 256, 384, 512],
    #     "intermediate_size": [512, 768, 1024, 1536, 2048],
    #     "linear_layer_choices": [
    #         nn.Linear,
    #         Identity,
    #     ],
    # }

    search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": ["linear", "identity"],
    }

    study = optuna.create_study(
        direction="maximize",
        study_name="bert-tiny-compression-aware-study",
        sampler=sampler,
    )

    print('Reached!')
    study.optimize(
        objective,
        n_trials=3,
        timeout=60 * 60 * 24,
    )

    # Save the best model
    try:
        model = study.best_trial.user_attrs["model"].cpu()
    except AttributeError:  # If `.cpu()` fails, fetch without it
        model = study.best_trial.user_attrs["model"]


    with open(f"{Path.home()}/tutorial_5_best_model.pkl", "wb") as f:
        dill.dump(model, f)

main_run()

In [None]:
### Part 5b Combined ###

import torch
from chop.tools import get_tokenized_dataset
import torch.nn as nn
from chop.nn.modules import Identity
from transformers import AutoConfig, AutoModelForSequenceClassification
from chop.tools.utils import deepsetattr
from chop.tools import get_trainer
import optuna
from optuna.samplers import TPESampler
from pathlib import Path
import dill
from chop.pipelines import CompressionPipeline
from chop import MaseGraph
import chop.passes as passes
import inspect
import csv

def construct_model(trial):
    """
    Creates a BERT model with the hyperparameters suggested by the trial.
    We manually map each parameter to the correct config attribute or layer override.
    """
    # Load the default config from a small BERT checkpoint
    config = AutoConfig.from_pretrained(checkpoint)

    # Map parameters to config
    config.num_hidden_layers = trial.suggest_categorical(
        "num_layers", search_space["num_layers"]
    )
    config.num_attention_heads = trial.suggest_categorical(
        "num_heads", search_space["num_heads"]
    )
    config.hidden_size = trial.suggest_categorical(
        "hidden_size", search_space["hidden_size"]
    )
    config.intermediate_size = trial.suggest_categorical(
        "intermediate_size", search_space["intermediate_size"]
    )

    model = AutoModelForSequenceClassification.from_config(config)

    # Handle linear layer choice
    linear_choice = trial.suggest_categorical(
        "linear_layer_choices", search_space["linear_layer_choices"]
    )
    if linear_choice == "identity":
        # For each linear layer that is square (in_features == out_features),
        # replace it with Identity
        for name, layer in model.named_modules():
            if isinstance(layer, nn.Linear) and layer.in_features == layer.out_features:
                deepsetattr(model, name, Identity())

    return model

def objective(trial):
    # Define the model and ensure it is on the correct device
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print(f"[INFO] Model is being initialized on: {device}")

    model = construct_model(trial).to(device)
    print(f"[INFO] Model moved to: {next(model.parameters()).device}")

    # Keep a copy of the original forward if needed later
    original_forward = model.forward
    model.config.use_cache = False

    # === 1) Training Only (No Compression) ===
    print("[INFO] Starting initial training (no compression)...")
    trainer_no_comp = get_trainer(
        model=model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=3,
    )
    trainer_no_comp.train()
    eval_results_no_comp = trainer_no_comp.evaluate()
    no_comp_acc = eval_results_no_comp["eval_accuracy"]
    print(f"[INFO] Accuracy (no compression): {no_comp_acc:.4f}")

    # Move model to CPU after initial training if you prefer
    model.cpu()
    print(f"[INFO] Model moved to: {next(model.parameters()).device} after initial training")

    # === 2) Apply Compression, Evaluate with No Additional Training ===
    print("[INFO] Building MaseGraph and applying compression pipeline...")
    mg = MaseGraph(
        model,
        hf_input_names=["input_ids", "attention_mask", "labels"],
    )
    mg, _ = passes.init_metadata_analysis_pass(mg)
    mg, _ = passes.add_common_metadata_analysis_pass(mg)

    pipe = CompressionPipeline()
    quantization_config = {
        "by": "type",
        "default": {"config": {"name": None}},
        "linear": {
            "config": {
                "name": "integer",
                "data_in_width": 8,
                "data_in_frac_width": 4,
                "weight_width": 8,
                "weight_frac_width": 4,
                "bias_width": 8,
                "bias_frac_width": 4,
            }
        },
    }
    pruning_config = {
        "weight": {"sparsity": 0.5, "method": "l1-norm", "scope": "local"},
        "activation": {"sparsity": 0.5, "method": "l1-norm", "scope": "local"},
    }

    mg, _ = pipe(
        mg,
        pass_args={
            "quantize_transform_pass": quantization_config,
            "prune_transform_pass": pruning_config,
        },
    )

    compressed_model = mg.model.to(device)
    print(f"[INFO] Compressed model moved to: {next(compressed_model.parameters()).device}")

    print("[INFO] Evaluating compression result (no post-training)...")
    trainer_comp_no_post = get_trainer(
        model=compressed_model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=1,  # We'll just create a trainer; not calling train() means no extra training
    )
    # Just evaluate directly (no further training call):
    eval_results_no_post = trainer_comp_no_post.evaluate()
    compression_no_post_acc = eval_results_no_post["eval_accuracy"]
    print(f"[INFO] Accuracy (compression, no post-training): {compression_no_post_acc:.4f}")

    # === 3) Continue Training the Compressed Model (Post-Compression Training) ===
    print("[INFO] Starting post-compression training...")
    trainer_comp_post = get_trainer(
        model=compressed_model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=3,  # Post-compression training phase
    )
    trainer_comp_post.train()
    eval_results_post_train = trainer_comp_post.evaluate()
    compression_post_acc = eval_results_post_train["eval_accuracy"]
    print(f"[INFO] Accuracy (compression + post-training): {compression_post_acc:.4f}")

    # Set the model as an attribute for later use
    # (the final compressed & post-trained model)
    trial.set_user_attr("model", compressed_model)

    # Store scenario accuracies for CSV logging
    trial.set_user_attr("no_compression_acc", no_comp_acc)
    trial.set_user_attr("compression_no_post_acc", compression_no_post_acc)
    trial.set_user_attr("compression_post_acc", compression_post_acc)

    # Return the final accuracy after compression + post-training
    return compression_post_acc

def main_run():
    global checkpoint, tokenizer, dataset, search_space

    checkpoint = "prajjwal1/bert-tiny"
    tokenizer_checkpoint = "bert-base-uncased"
    dataset_name = "imdb"

    dataset, tokenizer = get_tokenized_dataset(
        dataset=dataset_name,
        checkpoint=tokenizer_checkpoint,
        return_tokenizer=True,
    )

    sampler = TPESampler()

    search_space = {
        "num_layers": [2, 4, 8],
        "num_heads": [2, 4, 8, 16],
        "hidden_size": [128, 192, 256, 384, 512],
        "intermediate_size": [512, 768, 1024, 1536, 2048],
        "linear_layer_choices": ["linear", "identity"],
    }

    study = optuna.create_study(
        direction="maximize",
        study_name="bert-tiny-compression-aware-study",
        sampler=sampler,
    )

    print("[INFO] Optimisation started...")
    study.optimize(
        objective,
        n_trials=10,       # Adjust as needed
        timeout=60 * 60 * 24,  # One hour; adjust as needed
    )

    # Save the best model
    try:
        best_model = study.best_trial.user_attrs["model"].cpu()
    except AttributeError:
        best_model = study.best_trial.user_attrs["model"]

    with open(f"{Path.home()}/tutorial_5_best_model.pkl", "wb") as f:
        dill.dump(best_model, f)

    # === Write the results of all trials into a CSV file ===
    results_path = "results.csv"  # Will be saved in your Colab working directory
    with open(results_path, mode="w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["trial_number", "no_compression_acc", "compression_no_post_acc", "compression_post_acc"])
        for t in study.trials:
            writer.writerow([
                t.number,
                t.user_attrs.get("no_compression_acc", None),
                t.user_attrs.get("compression_no_post_acc", None),
                t.user_attrs.get("compression_post_acc", None),
            ])

    print(f"[INFO] Results written to {results_path}")

main_run()


In [None]:
import torch.nn as nn
from chop.nn.modules import Identity
from transformers import AutoConfig, AutoModelForSequenceClassification
from chop.tools.utils import deepsetattr
from chop.tools import get_tokenized_dataset, get_trainer
import optuna
from optuna.samplers import GridSampler, TPESampler, RandomSampler
import matplotlib.pyplot as plt
import dill
from pathlib import Path
import pandas as pd

# -----------------------------
# 1. Define your checkpoints & dataset
# -----------------------------
checkpoint = "prajjwal1/bert-tiny"
tokenizer_checkpoint = "bert-base-uncased"
dataset_name = "imdb"

# Load tokenized dataset and tokenizer
dataset, tokenizer = get_tokenized_dataset(
    dataset=dataset_name,
    checkpoint=tokenizer_checkpoint,
    return_tokenizer=True,
)

# -----------------------------
# 2. Define search space
# -----------------------------
grid_search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": ["linear", "identity"],
}

# Same discrete sets for TPE/Random:
tpe_search_space = {
    "num_layers": [2, 4, 8],
    "num_heads": [2, 4, 8, 16],
    "hidden_size": [128, 192, 256, 384, 512],
    "intermediate_size": [512, 768, 1024, 1536, 2048],
    "linear_layer_choices": ["linear", "identity"],
}


# -----------------------------
# 3. Construct the model given a trial
# -----------------------------
def construct_model(trial):
    """
    Creates a BERT model with the hyperparameters suggested by the trial.
    We manually map each parameter to the correct config attribute or layer override.
    """
    # Load the default config from a small BERT checkpoint
    config = AutoConfig.from_pretrained(checkpoint)

    # Map parameters to config
    config.num_hidden_layers = trial.suggest_categorical(
        "num_layers", tpe_search_space["num_layers"]
    )
    config.num_attention_heads = trial.suggest_categorical(
        "num_heads", tpe_search_space["num_heads"]
    )
    config.hidden_size = trial.suggest_categorical(
        "hidden_size", tpe_search_space["hidden_size"]
    )
    config.intermediate_size = trial.suggest_categorical(
        "intermediate_size", tpe_search_space["intermediate_size"]
    )

    model = AutoModelForSequenceClassification.from_config(config)

    # Handle linear layer choice
    linear_choice = trial.suggest_categorical(
        "linear_layer_choices", tpe_search_space["linear_layer_choices"]
    )
    if linear_choice == "identity":
        # For each linear layer that is square (in_features == out_features),
        # replace it with Identity
        for name, layer in model.named_modules():
            if isinstance(layer, nn.Linear) and layer.in_features == layer.out_features:
                deepsetattr(model, name, Identity())

    return model


# -----------------------------
# 4. Define the objective function
# -----------------------------
def objective(trial):
    # Build the model given the trial
    trial_model = construct_model(trial)

    # Create a Trainer (from chop) that handles fine-tuning
    trainer = get_trainer(
        model=trial_model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy",
        num_train_epochs=1,  # For demonstration only
    )

    # Train and evaluate
    trainer.train()
    eval_results = trainer.evaluate()

    # Save the model to the trial user attributes for later retrieval if needed
    trial.set_user_attr("model", trial_model)

    # Return the metric you want to optimize
    return eval_results["eval_accuracy"]


# -----------------------------
# 5. Helper to run a study and collect "running best accuracies"
# -----------------------------
def run_study_and_get_curve(sampler, n_trials=None, study_name="study"):
    """
    Runs an Optuna study with the provided sampler and returns:
      - the study object
      - a list of best accuracies up to each trial (running max)
    """
    study = optuna.create_study(
        direction="maximize",
        study_name=study_name,
        sampler=sampler,
    )

    study.optimize(
        objective,
        n_trials=n_trials,
        timeout=60 * 60 * 24,  # or specify a time limit
        show_progress_bar=False,
    )

    # Build a list of max accuracies up to each trial
    running_max_accuracies = []
    current_max = 0.0
    for t in study.trials:
        if t.value is not None and t.value > current_max:
            current_max = t.value
        running_max_accuracies.append(current_max)

    return study, running_max_accuracies

# -----------------------------
# 6. Function to save trials data to CSV
# -----------------------------
def save_study_results_to_csv(study, filename):
    """
    Saves each trial's results into a CSV, including:
      - trial number
      - objective value (accuracy)
      - parameters
      - model config parameters
    """
    rows = []
    for t in study.trials:
        row = {
            "trial_number": t.number,
            "accuracy": t.value,
        }
        # Merge in parameter key-value pairs directly
        row.update(t.params)

        # Add model config if it exists in user attributes
        if "model" in t.user_attrs:
            model_config = t.user_attrs["model"].config.to_dict()
            for key, value in model_config.items():
                row[f"config_{key}"] = value

        rows.append(row)

    df = pd.DataFrame(rows)
    df.to_csv(filename, index=False)
    print(f"Saved {len(rows)} trials with model configs to {filename}.")



# -----------------------------
# 7. Compare GridSampler vs TPESampler vs RandomSampler
# -----------------------------
if __name__ == "__main__":
    # For GridSampler, define discrete search space
    #grid_sampler = GridSampler(
    #    search_space={
    #        "num_layers": grid_search_space["num_layers"],
    #        "num_heads": grid_search_space["num_heads"],
    #        "hidden_size": grid_search_space["hidden_size"],
    #        "intermediate_size": grid_search_space["intermediate_size"],
    #        "linear_layer_choices": grid_search_space["linear_layer_choices"],
    #    }
    #)

    #tpe_sampler = TPESampler()
    random_sampler = RandomSampler()

    # ---------------------------------------
    # 7a. Run with GridSampler
    # ---------------------------------------
    #grid_study, grid_max_curve = run_study_and_get_curve(
    #    sampler=grid_sampler,
    #    n_trials=10,  # product of all combos if you set this None or a bigger number
    #    study_name="bert-grid-study",
    #)
    #print(f"[GridSampler] Number of trials: {len(grid_study.trials)}")
    #print(f"[GridSampler] Best accuracy: {grid_study.best_value:.4f}")

    # ---------------------------------------
    # 7b. Run with TPESampler
    # ---------------------------------------
    #tpe_study, tpe_max_curve = run_study_and_get_curve(
    #    sampler=tpe_sampler,
    #    n_trials=10,  # pick more if desired
    #    study_name="bert-tpe-study",
    #)
    #print(f"[TPESampler] Number of trials: {len(tpe_study.trials)}")
    #print(f"[TPESampler] Best accuracy: {tpe_study.best_value:.4f}")

    # ---------------------------------------
    # 7c. Run with RandomSampler
    # ---------------------------------------
    random_study, random_max_curve = run_study_and_get_curve(
        sampler=random_sampler,
        n_trials=10,  # pick more if desired
        study_name="bert-random-study",
    )
    print(f"[RandomSampler] Number of trials: {len(random_study.trials)}")
    print(f"[RandomSampler] Best accuracy: {random_study.best_value:.4f}")

    # -----------------------------
    # 8. Plot the results
    # -----------------------------
    plt.figure(figsize=(8, 6))
    #plt.plot(range(1, len(grid_max_curve)+1), grid_max_curve, label="GridSampler")
    #plt.plot(range(1, len(tpe_max_curve)+1), tpe_max_curve, label="TPESampler")
    plt.plot(range(1, len(random_max_curve)+1), random_max_curve, label="RandomSampler")
    plt.xlabel("Number of Trials")
    plt.ylabel("Max Accuracy So Far")
    plt.title("Comparison of GridSampler vs TPESampler vs RandomSampler")
    plt.legend()
    plt.show()

    # -----------------------------
    # 9. (Optional) Save best models
    # -----------------------------
    #best_grid_model = grid_study.best_trial.user_attrs["model"].cpu()
    #with open("best_grid_model.pkl", "wb") as f:
        #dill.dump(best_grid_model, f)

    #best_tpe_model = tpe_study.best_trial.user_attrs["model"].cpu()
    #with open("best_tpe_model.pkl", "wb") as f:
        #dill.dump(best_tpe_model, f)

    best_random_model = random_study.best_trial.user_attrs["model"].cpu()
    with open("best_random_model.pkl", "wb") as f:
        dill.dump(best_random_model, f)

    # -----------------------------
    # 10. Save all trials to CSV
    # -----------------------------
    #save_study_results_to_csv(grid_study, "grid_study_trials.csv")
    #save_study_results_to_csv(tpe_study, "tpe_study_trials.csv")
    save_study_results_to_csv(random_study, "random_study_trials.csv")