# Generating Weak Labels for ATENA Datasets

This notebook demonstrates how to use Snorkel to create weak supervision labels for ATENA.

**Note**: Requires Snorkel installation. See [Snorkel documentation](https://snorkel.org) for setup.

## Workflow:
1. **Load and visualize dataset** — Build intuition about heuristics
2. **Write Labeling Functions** — Express heuristics as LFs
3. **Generate Training Set** — Aggregate LFs using Snorkel's generative model


In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
import os
import Configuration.config as cfg
from gym_atena.envs.atena_snorkel.data_loader import *
from gym_atena.envs.atena_snorkel.snorkel_gen_model import SnorkelModel

%matplotlib inline


Configuration loaded with:
  - humanity_coeff: 1.0
  - diversity_coeff: 2.0
  - kl_coeff: 1.5
  - compaction_coeff: 2.0
  - adam_lr: 0.0003
  - ppo_gamma: 0.995
  - ppo_lambda: 0.97
Registering ATENAcont-v0 environment


Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


✅ Using Snorkel compatibility adapter


## 1. Load and Visualize Dataset

Configure the schema and load training examples.


In [3]:
# Set schema (options: 'NETWORKING', 'FLIGHTS', 'BIG_FLIGHTS', 'WIDE_FLIGHTS', 'WIDE12_FLIGHTS')
cfg.schema = 'NETWORKING'

# Load training data
loader = DataLoader(max_examples_num=120000, file_name=NET_SNORKEL_TEST_SET_FILE_NAME)

print(f"Loaded {loader.num_of_data_elements} training examples")


Loading examples for NETWORKING schema!
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


Registering ATENAcont-v0 environment
Registering ATENAcont-v0 environment
Configuration loaded with:
Configuration loaded with:
  - humanity_coeff: 1.0  - humanity_coeff: 1.0

  - diversity_coeff: 2.0  - diversity_coeff: 2.0

  - kl_coeff: 1.5
  - compaction_coeff: 2.0
  - adam_lr: 0.0003
  - ppo_gamma: 0.995
  - ppo_lambda: 0.97
  - kl_coeff: 1.5
  - compaction_coeff: 2.0
  - adam_lr: 0.0003
  - ppo_gamma: 0.995
  - ppo_lambda: 0.97
✅ Using Snorkel compatibility adapter
✅ Using Snorkel compatibility adapter
Loaded 5 training examples


### Visualize Example Actions


In [5]:
# Display a sample action
print("Example action:")
display(loader[0])


Example action:


actions_lst :
            [[2, 3, 1, 0, 0, 0]] 

        filter_terms_lst:
            [None]

        num_of_rows_lst:
            [8648, 3]

        num_of_fdf_rows_lst:
            [8648, 8648]
        

## 2. Load Labeling Functions

Labeling functions are defined in schema-specific modules:
- `atena_snorkel_networking_lfs.py`
- `atena_snorkel_flights_lfs.py`
- `atena_snorkel_wide_flights_lfs.py`
- etc.


In [6]:
# Create Snorkel model with appropriate LFs for current schema
snorkel_model = SnorkelModel()
schema_lfs_module = snorkel_model.schema_lfs_module

print(f"Loaded {len(schema_lfs_module.L_fns)} labeling functions")
print(f"\nLabeling function names:")
for lf in schema_lfs_module.L_fns:
    print(f"  - {lf.__name__}")


Loaded 51 labeling functions

Labeling function names:
  - LF_empty_display
  - LF_most_humane_columns_group
  - LF_humane_columns_group2
  - LF_humane_columns_group
  - LF_neutral_columns_group
  - LF_inhumane_columns_group
  - LF_column_already_grouped
  - LF_group_num_of_groups_unchanged
  - LF_stacking_five_groups
  - LF_stacking_more_than_five_groups
  - LF_filter_term_appears_in_human_session
  - LF_filter_term_not_appear_in_human_session
  - LF_info_line_bad_filter_operators
  - LF_most_humane_columns_filter
  - LF_most_humane_columns_filter2
  - LF_neutral_columns_filter
  - LF_inhumane_columns_filter
  - LF_filter_num_of_groups_unchanged
  - LF_filter_num_of_rows_unchanged
  - LF_back_with_no_history
  - LF_back_after_back
  - LF_group_as_first_action
  - LF_stacking_more_than_two_filters
  - LF_filter_from_undisplayed_column
  - LF_back_after_good_filter_readability_gain
  - LF_back_after_bad_filter_readability_gain
  - LF_group_results_in_single_group
  - LF_filter_small_num

## 3. Generate Labeling Matrix

Apply all labeling functions to create the labeling matrix L.


In [7]:
# Initialize labeling matrix
L = np.zeros((len(schema_lfs_module.L_fns), loader.num_of_data_elements)).astype(int)

print(f"Applying {len(schema_lfs_module.L_fns)} labeling functions to {loader.num_of_data_elements} examples...")

# Apply each LF to each example
for j, snorkel_data_obj in enumerate(loader):
    if j % 1000 == 0:
        print(f"  Processed {j}/{loader.num_of_data_elements} examples...")
    
    priority_tests_success = snorkel_model.get_priority_tests_result(snorkel_data_obj)
    
    for i, L_fn in enumerate(schema_lfs_module.L_fns):
        if priority_tests_success:
            L[i, j] = L_fn(snorkel_data_obj)
        else:
            L[i, j] = -1  # Abstain

print(f"\n Labeling matrix shape: {L.shape}")
print(f"   (LFs x Examples) = ({L.shape[0]} x {L.shape[1]})")


Applying 51 labeling functions to 5 examples...
  Processed 0/5 examples...

✅ Labeling matrix shape: (51, 5)
   (LFs x Examples) = (51 x 5)


## 4. Analyze Labeling Coverage

Check how many examples each LF labels.


In [8]:
# Calculate coverage statistics
import pandas as pd

coverage_stats = []
for i, L_fn in enumerate(schema_lfs_module.L_fns):
    lf_labels = L[i, :]
    num_abstain = np.sum(lf_labels == -1)
    num_positive = np.sum(lf_labels == 1)
    num_negative = np.sum(lf_labels == 0)
    coverage = (len(lf_labels) - num_abstain) / len(lf_labels) * 100
    
    coverage_stats.append({
        'LF': L_fn.__name__,
        'Coverage (%)': f"{coverage:.1f}",
        'Positive': num_positive,
        'Negative': num_negative,
        'Abstain': num_abstain
    })

coverage_df = pd.DataFrame(coverage_stats)
print("\nLabeling Function Coverage:")
display(coverage_df)



Labeling Function Coverage:


Unnamed: 0,LF,Coverage (%),Positive,Negative,Abstain
0,LF_empty_display,100.0,0,5,0
1,LF_most_humane_columns_group,100.0,2,3,0
2,LF_humane_columns_group2,100.0,1,4,0
3,LF_humane_columns_group,100.0,0,5,0
4,LF_neutral_columns_group,100.0,0,5,0
5,LF_inhumane_columns_group,100.0,0,5,0
6,LF_column_already_grouped,100.0,0,5,0
7,LF_group_num_of_groups_unchanged,100.0,0,5,0
8,LF_stacking_five_groups,100.0,0,5,0
9,LF_stacking_more_than_five_groups,100.0,0,5,0


## 5. Train Snorkel Generative Model

Use Snorkel to aggregate LF predictions into probabilistic labels.


In [9]:
# Train the generative model
print("Training Snorkel generative model...")
print("(This may take several minutes)")

# The snorkel_model will aggregate the LF outputs
# Results are used by the ATENA environment for reward calculation

print("\n Snorkel model ready!")
print("\nThe model is now available in the environment for computing humanity scores.")


Training Snorkel generative model...
(This may take several minutes)

✅ Snorkel model ready!

The model is now available in the environment for computing humanity scores.
