# World of Quantum Hands-on: QisK-I-T

This Notebook is divided into
- Code for automated supervised learning best configuration Prediction in Optimization stage (RQ2)
    - Data-Preparation & Pass Selection and Experimentation
    - Model-Training and Experimentation
    - Evaluation

## Technical: Before you begin
- `conda create -n [env_name]`
- `conda activate [env_name]`
- `pip install -r ./requirements.txt`
- `conda update -all`

## Imports

In [1]:
import ast
import numpy as np
import pandas as pd
import optuna
import joblib
from xgboost import XGBClassifier
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import make_scorer, hamming_loss
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from pathlib import Path
from datetime import datetime
from scipy.stats import skew, kurtosis


  from .autonotebook import tqdm as notebook_tqdm


# 2. Code for automated supervised learning best configuration Prediction in Optimization stage (RQ2)

Below is the visualized pipeline (values not correct)

![My diagram](Resources/graph.png)

## Model-Training and Experimentation

### Data preprocessing
We explored aggregating the top 10 configurations for each instance by applying a pointwise **OR** operation. However, this approach diluted the distance from the optimal solution and led to unreliable results.

This outcome was expected. Configurations in the top results can differ significantly from each other. Applying a pointwise OR often resulted in a configuration vector that enabled nearly **all** optimizations. This, in turn, increased runtime and introduced noise, as some optimization passes can cancel each other out or interfere negatively.

We also used following dataset modifications:
- Leaving all top 10 results pro circuit in the dataset
- Leaving all top 3 results pro circuit in the dataset
- Leaving only the top 1 result

Among these, using **only the top 1 configuration per circuit** yielded the best predictive performance (best F1-micro and F1-macro)


In [2]:
dataset = pd.read_csv('data/training_data.csv', header=None)

dataset.rename(
    columns={
      dataset.columns[0]:  "ID",
      dataset.columns[-1]: "configuration"
    },
    inplace=True
)

dataset = (
    dataset
    .groupby("ID", sort=False)
    .head(1)
    .reset_index(drop=True)
)

dataset.head()

Unnamed: 0,ID,1,2,3,4,5,6,7,8,configuration
0,ae_nativegates_ibm_qiskit_opt0_10.qasm,10,183,20,375,90,0,90,0,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ..."
1,ae_nativegates_ibm_qiskit_opt0_11.qasm,11,204,22,441,110,0,110,0,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
2,ae_nativegates_ibm_qiskit_opt0_12.qasm,12,225,24,512,132,0,132,0,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
3,ae_nativegates_ibm_qiskit_opt0_13.qasm,13,246,26,588,156,0,156,0,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
4,ae_nativegates_ibm_qiskit_opt0_14.qasm,14,267,28,669,182,0,182,0,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."


In [3]:
df_configurations_with_id_with_duplicates = dataset[['ID', 'configuration']].copy()
df_features_with_id = dataset.drop(['configuration'], axis=1).copy()

In [4]:
df_configurations_with_id_with_duplicates.head()

Unnamed: 0,ID,configuration
0,ae_nativegates_ibm_qiskit_opt0_10.qasm,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ..."
1,ae_nativegates_ibm_qiskit_opt0_11.qasm,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
2,ae_nativegates_ibm_qiskit_opt0_12.qasm,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
3,ae_nativegates_ibm_qiskit_opt0_13.qasm,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
4,ae_nativegates_ibm_qiskit_opt0_14.qasm,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."


In [5]:
df_features_with_id.head()

Unnamed: 0,ID,1,2,3,4,5,6,7,8
0,ae_nativegates_ibm_qiskit_opt0_10.qasm,10,183,20,375,90,0,90,0
1,ae_nativegates_ibm_qiskit_opt0_11.qasm,11,204,22,441,110,0,110,0
2,ae_nativegates_ibm_qiskit_opt0_12.qasm,12,225,24,512,132,0,132,0
3,ae_nativegates_ibm_qiskit_opt0_13.qasm,13,246,26,588,156,0,156,0
4,ae_nativegates_ibm_qiskit_opt0_14.qasm,14,267,28,669,182,0,182,0


In [6]:
df_configurations_with_id_with_duplicates['config_arr'] = (
    df_configurations_with_id_with_duplicates['configuration']
      .apply(ast.literal_eval)          # "[1,0,1,...]" → [1,0,1,...]
)

df_configurations = df_configurations_with_id_with_duplicates[['ID','config_arr']].copy()

In [7]:
df_configurations.head()

Unnamed: 0,ID,config_arr
0,ae_nativegates_ibm_qiskit_opt0_10.qasm,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ..."
1,ae_nativegates_ibm_qiskit_opt0_11.qasm,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
2,ae_nativegates_ibm_qiskit_opt0_12.qasm,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
3,ae_nativegates_ibm_qiskit_opt0_13.qasm,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
4,ae_nativegates_ibm_qiskit_opt0_14.qasm,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."


In [8]:
df_features_with_id.head()

Unnamed: 0,ID,1,2,3,4,5,6,7,8
0,ae_nativegates_ibm_qiskit_opt0_10.qasm,10,183,20,375,90,0,90,0
1,ae_nativegates_ibm_qiskit_opt0_11.qasm,11,204,22,441,110,0,110,0
2,ae_nativegates_ibm_qiskit_opt0_12.qasm,12,225,24,512,132,0,132,0
3,ae_nativegates_ibm_qiskit_opt0_13.qasm,13,246,26,588,156,0,156,0
4,ae_nativegates_ibm_qiskit_opt0_14.qasm,14,267,28,669,182,0,182,0


In [9]:
single_value_cols = df_features_with_id.columns[df_features_with_id.nunique(dropna=False) == 1]
single_value_cols

Index([6], dtype='object')

In [10]:
df_feat = df_features_with_id.drop(single_value_cols, axis=1).copy()
df_feat.head()

Unnamed: 0,ID,1,2,3,4,5,7,8
0,ae_nativegates_ibm_qiskit_opt0_10.qasm,10,183,20,375,90,90,0
1,ae_nativegates_ibm_qiskit_opt0_11.qasm,11,204,22,441,110,110,0
2,ae_nativegates_ibm_qiskit_opt0_12.qasm,12,225,24,512,132,132,0
3,ae_nativegates_ibm_qiskit_opt0_13.qasm,13,246,26,588,156,156,0
4,ae_nativegates_ibm_qiskit_opt0_14.qasm,14,267,28,669,182,182,0


In [11]:
df_feat.drop(['ID'], inplace=True, axis=1)
df_configurations.drop(['ID'], inplace=True, axis=1)

In [12]:
df_feat

Unnamed: 0,1,2,3,4,5,7,8
0,10,183,20,375,90,90,0
1,11,204,22,441,110,110,0
2,12,225,24,512,132,132,0
3,13,246,26,588,156,156,0
4,14,267,28,669,182,182,0
...,...,...,...,...,...,...,...
212,5,38,10,135,30,30,0
213,6,42,12,171,45,45,0
214,7,46,14,210,63,63,0
215,8,50,16,252,84,84,0


In [13]:
df_configurations

Unnamed: 0,config_arr
0,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ..."
1,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
2,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
3,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
4,"[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
...,...
212,"[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
213,"[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
214,"[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
215,"[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [14]:
df_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1       217 non-null    int64
 1   2       217 non-null    int64
 2   3       217 non-null    int64
 3   4       217 non-null    int64
 4   5       217 non-null    int64
 5   7       217 non-null    int64
 6   8       217 non-null    int64
dtypes: int64(7)
memory usage: 12.0 KB


In [15]:
stats = pd.DataFrame({
    'mean':        df_feat.mean(),
    'std':         df_feat.std(),
    'range':       df_feat.max() - df_feat.min(),
    'IQR':         df_feat.quantile(0.75) - df_feat.quantile(0.25),
    'CV':          df_feat.std() / df_feat.mean().abs().replace(0, np.nan),
    'skewness':    df_feat.apply(skew, nan_policy='omit'),
    'kurtosis':    df_feat.apply(lambda x: kurtosis(x, nan_policy='omit')),
    'outlier_%':   df_feat.apply(lambda x: ((x < x.mean() - 3*x.std()) | (x > x.mean() + 3*x.std())).mean() * 100)
})

# sorting by biggest disparities.
stats_sorted = stats.sort_values(by='std', ascending=False)
print(stats_sorted)


         mean          std  range    IQR        CV   skewness    kurtosis  \
4  948.654378  2691.495716  38101  910.0  2.837172  12.269458  166.051440   
2  266.036866  1297.019849  18741  169.0  4.875339  13.505387  188.860029   
5  336.400922  1027.815952  14532  387.0  3.055330  12.279628  166.321343   
7  336.447005  1027.800933  14531  387.0  3.054867  12.280046  166.329026   
3   28.041475    16.682024     77   24.0  0.594905   0.680163    0.261058   
1   13.976959     8.326910     38   12.0  0.595760   0.715896    0.293709   
8    0.046083     0.210150      1    0.0  4.560245   4.329932   16.748309   

   outlier_%  
4   0.460829  
2   0.460829  
5   0.460829  
7   0.460829  
3   0.460829  
1   0.921659  
8   4.608295  


The output from above can be analyzed in the following way:
- skewness measures the symmetry of distribution, with 0 symbolizing perfectly symmetric. Values of 12 - 13 are really high, meaning we have a few very large values outliers.
- mean and std are also very high (or interquartile range). This means some features possibly completly overshadow others.
- outlier_% computes for each feature column the mean and deviation and marks those which lie outside of 3 standard deviations. This flags extreme points. Features 1 and even more so feature 8 have high outlier percentage. Value of 4.61 % means on average one in twenty rows is beyond 3 standard deviations, meaning this explanatory variable has a very heavy high values outliers (is heavy right tailed).
- CV is the standard deviation / mean. The bigger it is, the bigger is the noise, meaning intuitevly for instance a lot of rows with value 0 and then a few with value 1000 -> chaotic, unstable scale of the feature.
- Kurtosis measures how much propability lives in the tails vs. shoulders of distribution. In other words, "normal" kurtosis of about 3 represents classic bell kurve - tails are moderate and outliers are rare but possible. In the example above, for instance feature 4 has kurtosis of 166. This means outliers are not occasional, but they are the "default" baked in the data. (Here the word "outlier" is now quite counter-intuitiv).

This could possibly explain, as of the first prototype, the very low value of macro-F1 of 0.26. Extreme feature values could correspond to rare passes, so the labels, or passes that are in the minority are completly overwhelmed by two things: class imbalance (difference in size between how some columns / explanatory variables overwhelming others) and by feature scale imbalance (meaning for instance a lot of 0's in one column and then minority of very high values, e.g. [0,0,0, ..., 1243, 0, ..., 0])


Solution: at first I have tried the scaling, but as the name suggests in only rescaled the features but does not fix the skewness or outliers. I will now look into PowerTransformer, QuantileTransformer, RobustScaler, clipping the extreme values before scaling...

API description from scikit-learn:
- PowerTransformer: Apply a power transform featurewise to make data more Gaussian-like.
- RobustScaler: Scale features using statistics that are robust to outliers.
- Transform features using quantiles information.

The main difference is PowerTransformer() being parametric and QuantileTransformer() being non-parametric.

In [16]:
df_feat.describe() # I have found, that a lot of what I have calculated is already implemented with this method...

Unnamed: 0,1,2,3,4,5,7,8
count,217.0,217.0,217.0,217.0,217.0,217.0,217.0
mean,13.976959,266.036866,28.041475,948.654378,336.400922,336.447005,0.046083
std,8.32691,1297.019849,16.682024,2691.495716,1027.815952,1027.800933,0.21015
min,2.0,4.0,3.0,6.0,0.0,1.0,0.0
25%,7.0,35.0,14.0,160.0,21.0,21.0,0.0
50%,13.0,90.0,26.0,450.0,135.0,135.0,0.0
75%,19.0,204.0,38.0,1070.0,408.0,408.0,0.0
max,40.0,18745.0,80.0,38107.0,14532.0,14532.0,1.0


In [17]:
#scaler = StandardScaler()
#X_scaled = scaler.fit_transform(df_feat)
#df_scaled = pd.DataFrame(X_scaled, columns=df_feat.columns, index=df_feat.index)
#df_scaled

# PowerTransformer with standarize=True does the same as Scaler, so I omit calling scaler here.

pw_transformer = PowerTransformer(method='yeo-johnson', standardize=True)
X_scaled = pw_transformer.fit_transform(df_feat)
df_scaled = pd.DataFrame(X_scaled, columns=df_feat.columns, index=df_feat.index)
df_scaled

Unnamed: 0,1,2,3,4,5,7,8
0,-0.334151,0.609500,-0.339833,-0.089958,-0.119033,-0.113486,-0.219793
1,-0.197490,0.695883,-0.205040,0.032579,-0.002527,0.004616,-0.219793
2,-0.067511,0.772809,-0.076580,0.146505,0.105096,0.113369,-0.219793
3,0.056591,0.842058,0.046317,0.253070,0.205185,0.214217,-0.219793
4,0.175474,0.904959,0.164281,0.353260,0.298800,0.308290,-0.219793
...,...,...,...,...,...,...,...
212,-1.165106,-0.772691,-1.154966,-0.833917,-0.719977,-0.729116,-0.219793
213,-0.972454,-0.677319,-0.966419,-0.666112,-0.505596,-0.508220,-0.219793
214,-0.795713,-0.591474,-0.793311,-0.518153,-0.321048,-0.319210,-0.219793
215,-0.631800,-0.513507,-0.632543,-0.385198,-0.158616,-0.153701,-0.219793


In [18]:
df_scaled.describe()

Unnamed: 0,1,2,3,4,5,7,8
count,217.0,217.0,217.0,217.0,217.0,217.0,217.0
mean,3.2743900000000003e-17,-1.135804e-16,2.455793e-16,1.094874e-16,-2.455793e-16,9.823171000000001e-17,-8.185976e-18
std,1.002312,1.002312,1.002312,1.002312,1.002312,1.002312,1.002312
min,-1.898785,-3.08415,-2.039368,-2.76276,-2.330681,-2.123736,-0.2197935
25%,-0.7957134,-0.8517751,-0.7933111,-0.7135753,-0.9009688,-0.9167594,-0.2197935
50%,0.05659055,0.01630947,0.04631745,0.04793566,0.1184778,0.1268686,-0.2197935
75%,0.7080499,0.6958832,0.6956566,0.724581,0.8089339,0.8168126,-0.2197935
max,2.337068,3.413501,2.351389,3.921266,3.504715,3.404701,4.549725


In [19]:
y = np.vstack(df_configurations['config_arr'].values)
X = df_scaled.values

In [20]:
# Configuration
HPO_NUM_TRIALS  = 60
N_OUTER_FOLDS   = 10
N_INNER_FOLDS   = 5
RANDOM_STATE    = 43

BASELINE_PARAMS = {
    "n_estimators":       1000,
    "objective":          "binary:logistic",
    "tree_method":        "hist",
    "eval_metric":        "logloss",
    "multi_strategy":     "multi_output_tree",
    "random_state":       RANDOM_STATE,
    "base_score":         0.5
}

In [21]:
pos = y.sum(axis=0)              # positives for each label
neg = y.shape[0] - pos           # negatives for each label
ratios = neg / np.maximum(pos, 1)

overall_spw = np.median(ratios)

Some notes:
- We wrap XGBoost in the MultiOutputClassifier (I have since dropped sci-kit wrapper MultiOutputClassifier since it isn't necessery in xboost since version 1.6: https://xgboost.readthedocs.io/en/stable/tutorials/multioutput.html). The MultiOutputClassifier trains one binary model for each label - that's why the objective, binary:logistic loss function, makes sense (instead of softmax or softprob)
    - objective is the loss function to be minimized. This is the main driver that actualy updates the parameters during training.
    - Avoid confusion with Multi-Class (we have Multi-Label) tasks, where we have single label multi class (for instance recognition if an object is a pear, apple, strawberry). Softmax outputs hard labels, meaning index of the most-likely class and softprob the probabilities as a vector of all classes for single label.
- our first evaluation metric of logloss is good for multi-label tasks, but I think there is some mismatch between the internal XGBoost score (cross-entropy due to logloss), while OPTUNA does Cross-Validation, which tries to optimise macro-F1.
    -  Macro-F1 which we try to improve gives every label equal weight, meaning that no matter how rare a label is.
    -  Log-Loss weights each example equally, meaning that rare labels contribute much less and the optimizer could ignore them.
    -  Also, F1 in general depends only to the hard 0/1 classification after tresholding on the 0.5 cutoff, meaning 0.51 and 0.49 probabilities, even though they are near each other, swings the entire F1. In log-loss this would only mean a small change. Overall this means that optimising one may not directly mean optimising the other.
    - Possible solutions:
        - Customize weight function of log-loss, meaning passing scale_pos_weight per label?
        - Write custom objective to match f1-macro more closely
        - Using "aucpr", which calculates area under the pr curve, (I have seen someone using it for classification), but I don't understand details now.
        - We could use "binary:focal_loss". This loss function generalizes binary cross-entropy by introducing a hyperparameter 𝛾 (gamma), called the focusing parameter, that allows hard-to-classify examples to be penalized more heavily relative to easy-to-classify examples.
         - adapting scale_pos_weight manually for each label - now we only have global scale_pos_weight found by hyper-parameter

### Testing approach

1. Separate 80 / 20 % train - test data -> perform hyperparameter cross-val search on the 80% test set -> fitting the best hyperparameters and the model on 20 % test-set to evaluate the model -> fitting the model on the whole data and saving it.
2. Just like step 1, but running it 5 times on different seeds. This repeats the whole train/val/test and reports the mean +- std of the five test scores.
3. As the data is scarse, performing nested 10x5 cross-validation to utilize the maximum of the aviable data. This approach sadly doesn't leave any hold-out set to truly evaluate the models performance. This is the approach contained in this file.

In [22]:
outer_cv = MultilabelStratifiedKFold(
    n_splits=N_OUTER_FOLDS,
    shuffle=True,
    random_state=RANDOM_STATE,
)

inner_cv = MultilabelStratifiedKFold(
    n_splits=N_INNER_FOLDS,
    shuffle=True,
    random_state=RANDOM_STATE,
)

outer_scores  = {"f1_micro": [], "f1_macro": [], "hamming": []}

for train_idx, test_idx in outer_cv.split(X, y):
    X_training, X_test = X[train_idx], X[test_idx]
    y_training, y_test = y[train_idx], y[test_idx]

    def objective(trial):
        param = {
            "n_estimators":       trial.suggest_int("n_estimators", 50, 1000),
            "max_depth":          trial.suggest_int("max_depth", 3, 10),
            "eta":                trial.suggest_float("eta", 1e-3, 0.2, log=True),
            "subsample":          trial.suggest_float("subsample", 0.5, 1.0),
            "colsample_bytree":   trial.suggest_float("colsample_bytree", 0.5, 1.0),
            "gamma":              trial.suggest_float("gamma", 1e-3, 0.1, log=True),
            "scale_pos_weight":   trial.suggest_float("scale_pos_weight", 1, overall_spw), # most important parameter, it adds "sample weight" making rare passes count more and highly reducing class imbalance
            "objective":          "binary:logistic",
            "tree_method":        "hist",
            "eval_metric":        "logloss",
            "multi_strategy":     "multi_output_tree",
            "random_state":       RANDOM_STATE,
            "base_score":         0.5
        }

        classifier = XGBClassifier(
        **param,
        )

        scores = cross_val_score(
            classifier,
            X_training,
            y_training,
            cv=inner_cv,
            #scoring=make_scorer(f1_score, average='macro', zero_division=0),
            n_jobs=-1
        )
        # return the mean macro-F1 across folds
        return scores.mean()

    sampler = optuna.samplers.TPESampler(seed=RANDOM_STATE)
    study = optuna.create_study(direction="maximize", sampler=sampler)
    study.optimize(objective, n_trials=HPO_NUM_TRIALS)
    best_params = study.best_params

    final_model = XGBClassifier(**best_params,
                                objective="binary:logistic",
                                tree_method="hist",
                                multi_strategy="multi_output_tree",
                                random_state=RANDOM_STATE,
                                base_score=0.5,
                                #scoring=make_scorer(f1_score, average='macro', zero_division=0),
                                eval_metric="logloss",
                                )
    final_model.fit(X_training, y_training)
    y_pred = final_model.predict(X_test)
    outer_scores["f1_micro"].append(f1_score(y_test, y_pred, average="micro", zero_division=0))
    outer_scores["f1_macro"].append(f1_score(y_test, y_pred, average="macro", zero_division=0))
    outer_scores["hamming"].append(hamming_loss(y_test, y_pred))

for metric, values in outer_scores.items():
    print(f"{metric:>9s}: {np.mean(values):.3f} ± {np.std(values):.3f}")

[I 2025-06-27 20:38:00,104] A new study created in memory with name: no-name-59e49fc3-62bc-40d9-9ad3-f31e9b4e1b49
[I 2025-06-27 20:38:01,309] Trial 0 finished with value: 0.06594269444718738 and parameters: {'n_estimators': 159, 'max_depth': 7, 'eta': 0.0020273867775857284, 'subsample': 0.6202948099826744, 'colsample_bytree': 0.6635695279055699, 'gamma': 0.052272705892251366, 'scale_pos_weight': 144.87548602917357}. Best is trial 0 with value: 0.06594269444718738.
[I 2025-06-27 20:38:02,234] Trial 1 finished with value: 0.2383539054014022 and parameters: {'n_estimators': 564, 'max_depth': 3, 'eta': 0.04879517040529174, 'subsample': 0.6974750092155029, 'colsample_bytree': 0.9010235593143332, 'gamma': 0.0032273216524928163, 'scale_pos_weight': 13.287146316648787}. Best is trial 1 with value: 0.2383539054014022.
[I 2025-06-27 20:38:03,428] Trial 2 finished with value: 0.08646226259833448 and parameters: {'n_estimators': 874, 'max_depth': 4, 'eta': 0.00854855810346696, 'subsample': 0.65804

 f1_micro: 0.787 ± 0.048
 f1_macro: 0.283 ± 0.031
  hamming: 0.063 ± 0.014


## Evaluation Metrics

### F1-micro

* **Measures:** Overall accuracy by pooling all label decisions.
* **Formula:**

  $$
  \frac{2 \cdot \text{TP}_{\text{total}}}
       {2 \cdot \text{TP}_{\text{total}} + \text{FP}_{\text{total}} + \text{FN}_{\text{total}}}
  $$

---

### F1-macro

* **Measures:** Average F1 across labels, treating each label equally.
* **Formula:**

  $$
  \frac{1}{L}\sum_{i=1}^{L}
  \frac{2 \cdot \text{TP}_i}
       {2 \cdot \text{TP}_i + \text{FP}_i + \text{FN}_i}
  $$

---

### Hamming Loss

* **Measures:** Fraction of incorrect label decisions.
* **Formula:**

  $$
  \frac{\text{FP}_{\text{total}} + \text{FN}_{\text{total}}}
       {(\text{number of labels}) \times (\text{number of samples})}
  $$

Interpretation of final test-set metrics:
 F1-micro: 0.691
 F1-macro: 0.260
 Hamming : 0.075

- Micro-F1 & Hamming: We are getting ~68 % of bits right and misclassifying only 7–8 % on average
- Macro-F1: 0.26 is low, indicating we struggle on rare passes. The reason is possibly unsufficient data preprocessing, meaning underrepresenting rare instances, due to the limited time. This could be possibly gratly improved by better preprocessing / data weighting with custom weight functions.


In [23]:
best_model = XGBClassifier(**best_params)
best_model.fit(X, y)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.874068155057337
,device,
,early_stopping_rounds,
,enable_categorical,False


In [24]:
data_dir = Path("models")
data_dir.mkdir(parents=True, exist_ok=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

model_filename  = f"qiskit_pass_predictor_{timestamp}.joblib"

model_path  = data_dir / model_filename


joblib.dump(best_model, model_path)
print(f"Model saved to {model_path}")

Model saved to models/qiskit_pass_predictor_20250627_212722.joblib


## Evaluation
We use the trained model to predict the optimal configuration for new circuits.
Then, we evaluate performance based on the number of qubits in the transpiled output, comparing it against Qiskit's default PassManager strategies: fast, normal, and slow.

In [25]:
# TODO: Add evaluation code against default optimization routine