# WoAG Coder Capability - HiClass Model Builder
## An Executable Notebook to Try Different HiClass Features 

## Najmeh Samadiani (MINDS/DESMB/MDSD)

This is a running notebook to get the user inputs and pass them to different HiClass-based modules. They include three base classifiers: LogesticRegression, Random Forest, and Support Vector Machine (SGDClassifier).

You are able to choose one of the two real-world datasets for your tests: the product reviews from Kaggle and the houshold data from Saudi Arabia Statistics (the latter is still under development).

Feel free to raise any issues/questions through my email: najmeh.samadiani@abs.gov.au

---

#### DGB cell - running a Processing Job
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.html

In [None]:
!python --version
# !pip install thinc
# √ó Failed to build installable wheels for some pyproject.toml based projects
#       ‚ï∞‚îÄ> thinc

###___________________________________________________________________________________________________________###

DGB comment - created virtual environment before installing requirements.txt
```
virtualenv venv
source venv/bin/activate
```

DGB comment - Upgraded pip
```
pip install --upgrade pip
```

In [20]:
import os
os.getcwd()

'/home/sagemaker-user/git_codecommit/minds_coder_capability_hiclass'

In [26]:
%%writefile install_rqts.py
#Check the requirements file to ensure all necessary libraries are installed
import sys
sys.path.append("./helper")
from helper import install_requirements #helper.helper
# find the requirements.txt file in the main folder
install_requirements('./requirements.txt')

Overwriting install_rqts.py


---

## üß≠ Model Configuration Panel
Please use the configuration form below to define how you'd like your hierarchical classification model to behave. This panel allows you to control every major component of the pipeline, including:

üßæ **Dataset Selection**
- Choose Dataset: Select which dataset you'd like to use (dataset1, or dataset2) - 'dataset2 is still under development'
- Label Columns: Enter the column names that represent the hierarchical labels for your selected dataset (e.g., Cat1, Cat2, Cat3).

üìù **Text Feature Selection**

Text Column - Choose which column(s) to use for feature extraction:
- Title: Use only the title column.
- Text: Use only the text/content column.
- Both: Concatenate Title and Text for richer input.

‚ú® **TF-IDF Preprocessing Settings**

Configure the text-to-numeric transformation:

- Minimum / Maximum Document Frequency: Control rare or overly common term filtering.
- N-gram Range: Specify the size of text chunks to analyze (e.g., unigrams or bigrams).
- Stop Words: Choose a stop-word list (e.g., English) or disable it.

---

üß† **HiClass Strategies**

HiClass provides multiple strategies for building hierarchical classifiers. Each strategy controls how base classifiers are positioned relative to the hierarchy structure:

üîπ `lcppn` ‚Äî Local Classifier Per Parent Node (default)
- Trains a separate classifier at **each parent node** to distinguish between its direct children.
- **Advantages**: Efficient and focused. Classifiers learn from local child distributions only.
- **Recommended for**: Large or deep hierarchies where different parent nodes have distinct structures.

üîπ `lcpn` ‚Äî Local Classifier Per Node
- Trains a classifier at **every node**, including leaf nodes (Can result in many small classifiers, increasing complexity).
- **Advantages**: Fine-grained control and potentially better local precision.
- **Recommended for**: Irregular hierarchies or custom tree control.

üîπ `lcpl` ‚Äî Local Classifier Per Level
- Trains one classifier per **hierarchical level** (e.g., one for top-level, one for mid-level, etc.).
- **Advantages**: Simple to train. Less memory intensive. Easy to interpret.
- **Recommended for**: Shallow hierarchies or where level-wise accuracy is prioritized.

---

üß† **Classifier Settings**

Choose which base classifier(s) you'd like to use (**you have an option to choose all three in a once**):

- SGD: Linear model with SVM-style or logistic loss
- LogisticRegression: Classical log-linear model
- RandomForest: Ensemble-based non-linear classifier

Each classifier section includes parameters like:

- Number of estimators, maximum iterations, maximum depth, and criterion for Random Forest
- Max iterations, Loss function (hinge/log) and SVM class weight for SGD

üß™ **Model Calibration & Probabilities**

Calibration/Probability Control:
- none: No probability output or calibration
- calibration only: Apply isotonic/beta/ivap on raw model scores
- probability only: Enable probability combination (e.g., geometric)
- both: Use both calibrated scores and probability combination

- Calibration Method: Choose from isotonic, beta, ivap, or cvap.
- Probability Combiner: Choose how probabilities are merged across hierarchy levels (e.g., geometric, arithmetic, or multiply).

---

## break into 3 scripts
1. install_rqmts.py = loading requirements.txt
2. dynamic_vals.py  = create dynamic values - dynamic_vals.py . DGB - we could add piece to feed Dynamic values into process job script
3. processjob.py

### DGB cell - problems with openpyxl module - put into requirements.txt - maybe even install this again on command line 
```
pip install openpxyl
pip show openpxyl
```

In [2]:
# %%writefile dynamic_vals.py
import sys
# Adding the relevant folders to the python path
sys.path.append("./modules")

from helper.display_utils import build_model_configuration_widgets
from IPython.display import display, HTML

display(HTML("<h2>üõ†Ô∏è Model Configuration Inputs</h2><p>Please fill in the configuration below to build and train your HiClass model.</p>"))
# print("Please fill in the configuration below to build and train your HiClass model.")

dynamic_values = build_model_configuration_widgets()

# DGB - Save dynamic values as json 
# import yaml
# config_filepath = "./dynamic_values.yaml"
# config=open("dynamic_values.yaml","w")
# yaml.dump(dynamic_values,config)
# print("YAML config file saved.")
# config.close()

# import json
# dv_filepath = "./dynamic_values.json"
# with open(dv_filepath, "w") as json_file:
#     json.dump(dynamic_values, json_file, indent=4)

VBox(children=(HTML(value='<b>Choose Dataset</b>'), Dropdown(description='Dataset', options=('dataset1', 'data‚Ä¶

VBox(children=(Text(value='Cat1,Cat2,Cat3', description='Labels for dataset1'),))

VBox(children=(HTML(value='<b>Text Column</b>'), Dropdown(options=('Title', 'Text', 'Both'), value='Title')))

VBox(children=(HTML(value='<b>Min Df</b>'), IntText(value=1)))

VBox(children=(HTML(value='<b>Max Df</b>'), FloatText(value=0.5)))

VBox(children=(HTML(value='<b>Max Features</b>'), IntText(value=50000)))

VBox(children=(HTML(value='<b>Stop Words</b>'), Dropdown(options=('english', None), value='english')))

VBox(children=(HTML(value='<b>Ngram Min</b>'), IntText(value=1)))

VBox(children=(HTML(value='<b>Ngram Max</b>'), IntText(value=2)))

VBox(children=(HTML(value='<b>Select Base Classifiers (multi-select)</b>'), SelectMultiple(index=(0, 1, 2), op‚Ä¶

VBox(children=(HTML(value='<b>Calibration/Probability Control</b>'), Dropdown(index=3, options=('none', 'proba‚Ä¶

VBox(children=(VBox(children=(HTML(value='<b>Calibration Method</b>'), Dropdown(options=('isotonic', 'beta', '‚Ä¶

VBox(children=(HTML(value='<b>Hiclass Strategy</b>'), Dropdown(options=('lcppn', 'lcpn', 'lcpl'), value='lcppn‚Ä¶

VBox(children=(HTML(value='<b>Sgd Loss</b>'), Dropdown(options=('hinge', 'log_loss'), value='hinge')))

VBox(children=(HTML(value='<b>Sgd Class Weight</b>'), Dropdown(options=('balanced', None), value='balanced')))

VBox(children=(HTML(value='<b>Sgd Max Iter</b>'), IntText(value=1000)))

VBox(children=(HTML(value='<b>Random State</b>'), IntText(value=0)))

VBox(children=(HTML(value='<b>Rf Estimators</b>'), IntText(value=100)))

VBox(children=(HTML(value='<b>Rf Max Depth</b>'), IntText(value=0)))

VBox(children=(HTML(value='<b>Rf Criterion</b>'), Dropdown(options=('gini', 'entropy'), value='gini')))

In [3]:
# Save dynamic values as json - cannot capture new dynamic values entered in widgets if this code block is in cell above
import json
with open("./dynamic_values.json", "w") as json_file:
    json.dump(dynamic_values, json_file, indent=4)

In [16]:
# import yaml
# config_filepath = "./dynamic_values.yaml"
# config=open("dynamic_values.yaml","w")
# yaml.dump(dynamic_values,config)
# print("YAML config file saved.")
# config.close()

## Model Building
Once you‚Äôve configured your model using above panel, click Run to execute this cell for:

- Train selected models

- Evaluate performance

- View confusion matrices

- Save trained pipelines

In [35]:
# Executing the models based on the given input
# !pip install openpyxl #hiclass 
import hiclass
from run_workflows import (
    run_logreg_workflow_from_widgets,
    run_rf_workflow_from_widgets,
    run_sgd_workflow_from_widgets 
)

# Dictionary to collect trained models
trained_models = {}

# Iterate over selected classifiers from widgets
for classifier_name in dynamic_values['base_classifiers']:
    clf = classifier_name.lower()

    if clf == 'logisticregression':
        print("‚öôÔ∏è Running Logistic Regression workflow...")
        model = run_logreg_workflow_from_widgets(dynamic_values)
        trained_models['LogisticRegression'] = model

    elif clf == 'randomforest':
        print("üå≤ Running Random Forest workflow...")
        model = run_rf_workflow_from_widgets(dynamic_values)
        trained_models['RandomForest'] = model

    elif clf == 'sgd':
        print("‚öôÔ∏è Running SGD workflow...")
        model = run_sgd_workflow_from_widgets(dynamic_values)
        trained_models['SGD'] = model

    else:
        print(f"‚ùì Unknown classifier selected: {classifier_name}")


In [34]:
%%writefile processjob.py

## DGB code
# config = yaml.safe_load(config_filepath)
# print("Config file of Dynamic values file loaded.")
with open("./dynamic_values.json", 'r') as file:
    dynamic_values = json.load(file)
print("the dynamic values are",dynamic_values)

# Executing the models based on the given input
# !pip install openpyxl #hiclass 
import hiclass
import sys
sys.path.append("./helper")
sys.path.append("./modules")
from modules.run_workflows import (
    run_logreg_workflow_from_widgets,
    run_rf_workflow_from_widgets,
    run_sgd_workflow_from_widgets 
)

# Dictionary to collect trained models
trained_models = {}

# Iterate over selected classifiers from widgets
for classifier_name in dynamic_values['base_classifiers']:
    clf = classifier_name.lower()

    if clf == 'logisticregression':
        print("‚öôÔ∏è Running Logistic Regression workflow...")
        model = run_logreg_workflow_from_widgets(dynamic_values)
        trained_models['LogisticRegression'] = model

    elif clf == 'randomforest':
        print("üå≤ Running Random Forest workflow...")
        model = run_rf_workflow_from_widgets(dynamic_values)
        trained_models['RandomForest'] = model

    elif clf == 'sgd':
        print("‚öôÔ∏è Running SGD workflow...")
        model = run_sgd_workflow_from_widgets(dynamic_values)
        trained_models['SGD'] = model

    else:
        print(f"‚ùì Unknown classifier selected: {classifier_name}")

import pandas as pd

def evaluate_and_compare_models(trained_models, X_test_text, y_test, output_path="outputs/model_comparison.xlsx"):
    """
    Evaluates all trained models, compares their metrics, and saves results to Excel.

    Parameters:
    - trained_models: dict of {model_name: model_instance}
    - X_test_text: text input for prediction
    - y_test: true hierarchical labels
    - output_path: where to save the comparison Excel file

    Returns:
    - summary_df: DataFrame of all model metrics
    """
    all_metrics = []

    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        for model_name, model in trained_models.items():
            print(f"üîç Evaluating {model_name}...")
            model.predict(X_test_text)
            metrics = model.evaluate(y_test)

            flat_metrics = {
                'model': model_name,
                'hierarchical_f1': metrics['hierarchical_f1'],
                'hierarchical_precision': metrics['hierarchical_precision'],
                'hierarchical_recall': metrics['hierarchical_recall'],
                'exact_path_accuracy': metrics['exact_path_accuracy']
            }

            # Include level-wise accuracy
            for level, acc in metrics['level_accuracy'].items():
                flat_metrics[f'accuracy_{level}'] = acc

            all_metrics.append(flat_metrics)

            # Save full metric details as a sheet
            pd.DataFrame([flat_metrics]).to_excel(writer, sheet_name=model_name, index=False)

    summary_df = pd.DataFrame(all_metrics)
    summary_df.to_excel(output_path, sheet_name="Summary", index=False)

    print(f"‚úÖ Evaluation comparison saved to: {output_path}")
    return summary_df

# This assumes all models were trained and test data is ready
from helper.helper import prepare_text_and_labels, build_dataset_sources_and_labels, build_dataset_map

# Step 1: Get test data
source_dict, label_dict = build_dataset_sources_and_labels(dynamic_values)
dataset_map = build_dataset_map(source_dict, label_dict)
_, X_test_text, _, y_test = prepare_text_and_labels(dynamic_values, dataset_map)

# Step 2: Evaluate all
results_df = evaluate_and_compare_models(trained_models, X_test_text, y_test)
display(results_df)

Overwriting processjob.py


In [6]:
import pandas as pd

def evaluate_and_compare_models(trained_models, X_test_text, y_test, output_path="outputs/model_comparison.xlsx"):
    """
    Evaluates all trained models, compares their metrics, and saves results to Excel.

    Parameters:
    - trained_models: dict of {model_name: model_instance}
    - X_test_text: text input for prediction
    - y_test: true hierarchical labels
    - output_path: where to save the comparison Excel file

    Returns:
    - summary_df: DataFrame of all model metrics
    """
    all_metrics = []

    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        for model_name, model in trained_models.items():
            print(f"üîç Evaluating {model_name}...")
            model.predict(X_test_text)
            metrics = model.evaluate(y_test)

            flat_metrics = {
                'model': model_name,
                'hierarchical_f1': metrics['hierarchical_f1'],
                'hierarchical_precision': metrics['hierarchical_precision'],
                'hierarchical_recall': metrics['hierarchical_recall'],
                'exact_path_accuracy': metrics['exact_path_accuracy']
            }

            # Include level-wise accuracy
            for level, acc in metrics['level_accuracy'].items():
                flat_metrics[f'accuracy_{level}'] = acc

            all_metrics.append(flat_metrics)

            # Save full metric details as a sheet
            pd.DataFrame([flat_metrics]).to_excel(writer, sheet_name=model_name, index=False)

    summary_df = pd.DataFrame(all_metrics)
    summary_df.to_excel(output_path, sheet_name="Summary", index=False)

    print(f"‚úÖ Evaluation comparison saved to: {output_path}")
    return summary_df

# This assumes all models were trained and test data is ready
from helper import prepare_text_and_labels, build_dataset_sources_and_labels, build_dataset_map

# Step 1: Get test data
source_dict, label_dict = build_dataset_sources_and_labels(dynamic_values)
dataset_map = build_dataset_map(source_dict, label_dict)
_, X_test_text, _, y_test = prepare_text_and_labels(dynamic_values, dataset_map)

# Step 2: Evaluate all
results_df = evaluate_and_compare_models(trained_models, X_test_text, y_test)
display(results_df)

üîç Evaluating SGD...
üîç Evaluating LogisticRegression...
üîç Evaluating RandomForest...
‚úÖ Evaluation comparison saved to: outputs/model_comparison.xlsx


Unnamed: 0,model,hierarchical_f1,hierarchical_precision,hierarchical_recall,exact_path_accuracy,accuracy_Cat1,accuracy_Cat2,accuracy_Cat3
0,SGD,0.902904,0.902904,0.902904,0.863365,0.940747,0.897524,0.87044
1,LogisticRegression,0.875967,0.875967,0.875967,0.815167,0.936436,0.869335,0.822131
2,RandomForest,0.887096,0.887096,0.887096,0.847336,0.929582,0.879615,0.852089


In [None]:
# hierarchical_f1	hierarchical_precision	hierarchical_recall	all the same 