
 **Creating Bangla WSD Dataset from Gold_Annotated dataset**

 This script sweeps a folder of gold-annotated sentence files (CSV/XLSX),
 normalizes and validates their columns, maps each (lemma, sense_id) to a
 human-readable sense text, assigns robust sentence IDs, and writes a single       consolidated CSV ready for Word Sense Disambiguation (WSD) / WiC-style
 dataset construction.
 
   1) Locate all "*_labelled_sentences.csv" / ".xlsx" files in INPUT_FOLDER.
   2) Read each file (CSV preferred, fallback to Excel) with strict column checks.
   3) Coerce sense labels to integer `sense_id` when possible.
   4) Attach sense glosses using SENSE_MAP (blank if unknown).
   5) Concatenate all rows, sort by (lemma, sense_id), add global running
      index + stable `sent_id`, and save as UTF-8 with BOM for Bangla.

In [None]:
import os                               # Standard library for paths, filenames, and OS utilities
import glob                             # Standard library for filename pattern expansion
import pandas as pd                     # Pandas for data loading, cleaning, and saving


# paths 
INPUT_FOLDER = r"C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\Gold_dataset"  # Root directory containing gold dataset files
OUTPUT_CSV   = r"C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wsd_dataset.csv"  # Consolidated output CSV path


# Dictionary mapping lemma -> sense_id -> human-readable sense gloss
SENSE_MAP = {                           
    'অর্থ': {0:'Money, finance, wealth', 1:'Meaning, sense'},                               
    'আগুন': {0:'Passion, intensity, anger', 1:'Flame'},                                     
    'এঁটে': {0:'To silence, to seal', 1:'To fasten, stick, close'},                          
    'কপাল': {0:'Forehead', 1:'Fate, duck,  destiny'},                                      
    'কাটা': {0:'To spend (time), to pass (time)', 1:'To cut, chop'},                       
    'গভীর': {0:'Deep', 1:'Profound, intense'},                                              
    'গুলি': {0:'Being shot/firing a gun', 1:'Bullet', 2:'Shooting/noise'},                   
    'গোলা': {0:'Batter, mixture', 1:'Granary, storage bin, a projectile'},                  
    'ঘোর': {0:'State of being under the influence', 1:'To wander, roam, travel, visit'},     
    'চড়': {0:'High price,  value', 1:'Get on, to ride, to board'},                         
    'চাবি': {0:'Key', 1:'Keyring, bunch of keys'},                                        
    'চোখ': {0:'Eye', 1:'Sight, Gaze, or Attention'},                                         
    'ঝোলা': {0:'To hang or Suspend something', 1:'A bag'},                                   
    'ঠান্ডা': {0:'Feeling ill', 1:'Low temperature; Chilly, Wintry'},                        
    'ডাক': {0:'Mail or post', 1:'Call to action', 2:'call,summon'},                          
    'তাড়া': {0:'To drive away', 1:'Chase a target'},                                       
    'দম': {0:'Stamina', 1:'Steamed', 2:'Out of breath'},                                     
    'দল': {0:'Team', 1:'Party or Group'},                                                    
    'দাঁড়া': {0:'Wait', 1:'Pause for a moment', 2:'To establish', 3:'To stand'},           
    'ধন': {0:'Wealth, money, riches', 1:'Treasure, resource'},                              
    'ধরা': {0:'To assume', 1:'To catch, capture, get caught'},                               
    'পটল': {0:'Vegetable', 1:'To die,to pass away'},                                         
    'পদ': {0:'Dish,  recipe', 1:'Word, phrase', 2:'Position'},                             
    'পর': {0:'To wear,  put on', 1:'After (in time), following an event, Later'},             
    'পাকা': {0:'Experienced', 1:'Ripe (fruit, crop)'},                                      
    'পাড়া': {0:'To lay', 1:'To pick, pluck', 2:'Locality'},                                
    'পালা': {0:'A play, Act, performance', 1:'To escape run away, or Flee'},                
    'পিচ': {0:'Playing field', 1:'Road surface', 2:'Drug'},                                 
    'ফল': {0:'Fruit', 1:'Result, outcome'},                                                
    'বর্ণ': {0:'Colour', 1:'Letter, character', 2:'Caste, race'},                      
    'বল': {0:'Ball', 1:'To say, speak, state'},                                            
    'বিন্দু': {0:'A point or position (in space, geometry)', 1:'A dot, spot, drop, small particle'},  
    'মধু': {0:'Honey', 1:'Sweetness, pleasantness, delight'},                                
    'মাটি': {0:'Country, Land', 1:'Soil, earth, ground'},                                    
    'মাথা': {0:'End', 1:'In mind', 2:'Head'},                                                
    'মুখ': {0:'Confronting', 1:'Face', 2:'Speaking'},                                      
    'রূপ': {0:'Beauty, appearance', 1:'Role, capacity, function, status'},                   
    'সারা': {0:'To cure, heal', 1:'Whole, entire, all over'},                               
    'হাত': {0:'Involvement, possession, skill', 1:'Hand'},                                 
    'তুলা': {0:'To take (a photo), to capture', 1:'Cotton (processed, used as material', 2:'Cotton (as an agricultural crop or fiber'},  
    'পড়া': {0:'To read, study', 1:'To fall, collapse'},                                   
    'পথ': {0:'Way, method, or approach', 1:'Road, route, or physical path'},               
    'বাড়ি': {0:'House or home', 1:'Household, family, or members of the household'},      
}


# Function: robust_read
# Purpose : Load a labelled sentences file from disk, accepting either CSV or Excel,
# enforce column names/availability, and return a standardized DataFrame with the
# columns: ['lemma', 'sentence', 'sense', 'start', 'end'].
# Inputs  :
#   path (str): Relative file path to a *_labelled_sentences.csv/.xlsx file.
# Outputs :
#   pandas.DataFrame: A dataframe containing exactly the columns
#   ['lemma', 'sentence', 'sense', 'start', 'end'] with normalized types.
def robust_read(path: str) -> pd.DataFrame:
    """Read csv if possible; else excel. Ensure needed columns exist."""  # Docstring summarizing behavior
    try:
        df = pd.read_csv(path, encoding='utf-8')            # Try fast CSV read with UTF-8
    except Exception:
        df = pd.read_excel(path)                            # Fallback to Excel if CSV reading fails
    df.columns = [c.strip().lower() for c in df.columns]    # Normalize column names: strip spaces, lowercase
    rename_map = {                                          # Map possible source column names to target names
        'lemma': 'lemma',                                   # Keep 'lemma' as 'lemma'
        'sentence': 'sentence',                             # Keep 'sentence' as 'sentence'
        'sense': 'sense',                                   # If already 'sense', keep it
        'sense_id': 'sense',                                 # If source uses 'sense_id', normalize to 'sense'
        'start': 'start',                                   # Keep 'start'
        'end': 'end'                                        # Keep 'end'
    }
    df = df.rename(columns=rename_map)                      # Apply renaming map
    missing = [c for c in ['lemma', 'sentence', 'sense', 'start', 'end'] if c not in df.columns]  # Detect missing required cols
    if missing:                                             # If any required columns are missing
        raise ValueError(f"{os.path.basename(path)} is missing columns: {missing}")  # Raise a clear error
    return df[['lemma', 'sentence', 'sense', 'start', 'end']]  # Return standardized subset of columns


# Function: sense_text
# Purpose : Given a lemma and a raw sense identifier, coerce the identifier to int
# and look up a human-readable gloss from SENSE_MAP. If coercion fails or the pair
# is unknown, return an empty string (so downstream can keep blanks safely).
# Inputs  :
#   lemma (str): The lemma key used in SENSE_MAP.
#   sense_id: A value convertible to integer sense ID (e.g., '0', 0, '0.0').
# Outputs :
#   str: The human-readable gloss.
def sense_text(lemma: str, sense_id):
    try:
        i = int(sense_id)                                   # Attempt to coerce sense_id to integer
    except Exception:
        return ''                                           # On failure, return blank text
    return SENSE_MAP.get(lemma, {}).get(i, '')              # Lookup gloss by lemma and integer id (blank if not found)


# sweep folder, merge, format
all_rows = []                                               # Will collect per-file DataFrames
files = sorted(                                             # Build a sorted list of candidate files
    glob.glob(os.path.join(INPUT_FOLDER, "*_labelled_sentences.csv"))  # Match all CSV labelled files
    + glob.glob(os.path.join(INPUT_FOLDER, "*_labelled_sentences.xlsx"))  # And all XLSX labelled files
)

print(f"Found {len(files)} files.")                         # Report how many files we found

for fp in files:                                            # Iterate over each candidate file path
    try:
        df = robust_read(fp)                                # Load and standardize the file
    except Exception as e:
        print(f"Skip {os.path.basename(fp)} -> {e}")        # If bad, log and continue
        continue

    # Ensure sense_id is integer when possible
    def to_int_safe(x):                                     # Helper to coerce sense labels to integers robustly
        try:
            return int(float(x))                            # Accept '0', '0.0', etc.
        except Exception:
            return None                                     # Return None if not convertible

    df['sense_id'] = df['sense'].apply(to_int_safe)         # Create integer `sense_id` column
    df = df.dropna(subset=['lemma', 'sentence']).copy()     # Drop rows that lack core fields

    # Map senses (text) from the SENSE_MAP; blank if not found
    df['senses'] = [sense_text(lem, sid) for lem, sid in zip(df['lemma'], df['sense_id'])]  # Attach gloss text

    # Keep requested columns only
    out = df[['lemma', 'sentence', 'sense_id', 'start', 'end', 'senses']].copy()  # Final per-file subset
    all_rows.append(out)                                      # Stash for later concatenation

if not all_rows:                                              # If nothing was read successfully
    raise SystemExit("No valid files read. Check INPUT_FOLDER and filenames.")  # Exit early with message



# combine, sort within each lemma by sense_id, then assign global running sent_id
final_df = pd.concat(all_rows, ignore_index=True)            # Concatenate all per-file DataFrames into one

# Sort: for each lemma, sense 0 first, then 1,2,3,...
# (NaN sense_id will be placed at the end by default)
final_df = final_df.sort_values(by=['lemma', 'sense_id'], kind='mergesort').reset_index(drop=True)  # Stable sort for reproducibility

# Global running index that NEVER resets per lemma/file
final_df['global_idx'] = range(len(final_df))                # Add a monotonic global index
final_df['sent_id'] = final_df.apply(lambda r: f"bn_{r['sense_id']}_{r['global_idx']}", axis=1)  # Construct stable sentence IDs

# Reorder columns
final_df = final_df[['lemma', 'sentence', 'sense_id', 'start', 'end', 'sent_id', 'senses']]  # Column order as required

# Save
final_df.to_csv(OUTPUT_CSV, index=False, encoding='utf-8-sig')  # Write consolidated CSV with BOM for Bangla compatibility
print(f"Saved {len(final_df)} rows to {OUTPUT_CSV}")            # Report success with row count

Found 43 files.
Saved 4427 rows to C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\bangla_wsd.csv


**Creating Bangla WiC Dataset  — Capped Pairing (MAX_PARTNERS_PER_SENT = 32)**

This script transforms a gold-annotated Bangla WSD table into a Word-in-Context (WiC) benchmark. It validates and normalizes the input schema, generates positive (same-sense) and negative (cross-sense) sentence pairs per lemma under a strict per-sentence partner cap, assigns sentences deterministically to train/dev/test using seeded hashing to prevent sentence overlap across splits, balances labels within each split, and writes the results to WiC-formatted JSON files along with a summary report. All randomness is seeded for reproducibility.

   1) Read the WSD CSV (lemma, sentence, sense_id, start, end, sent_id[, senses]).
   2) Normalize column names and coerce types; keep only rows with valid sense_id.
   3) Group rows by lemma to build pairs within each lemma’s examples.
   4) Within a lemma, form all unique pairs among sentences sharing the same sense_id.
   5) Within a lemma, form pairs across different sense_id values (cross-sense).
   6) Canonicalize pair order by sent_id and deduplicate using unordered pair keys.
   7) Enforce MAX_PARTNERS_PER_SENT so no sentence exceeds the allowed partner count.
   8) Hash each sent_id with the global SEED and map to train/dev/test proportions.
   9) Keep only pairs whose two sentences fall in the same split, ensuring no overlap.
   10) Re-apply the per-sentence cap independently within each split.
   11) Downsample the larger class (if needed) to target ~50/50 positives/negatives per split.
   12) Shuffle each split with the seeded RNG.
   13) Write wic_train.json, wic_dev.json, wic_test.json (WiC entries include sentences, spans, labels, sense ids, and any sense text).
   14) Emit summary.json with totals and class balance per split.

In [None]:
import os                                             # Filesystem paths, directory creation
import json                                           # JSON serialization for output files
import random                                         # Deterministic shuffling and hashing for splits
from collections import defaultdict, Counter          # Grouping by sense and tracking partner caps
import pandas as pd                                   # Tabular data loading and processing

# Configuration
WSD_CSV = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wsd_dataset.csv'    # your WSD-style file with: lemma,sentence,sense_id,start,end,sent_id[,senses]
OUT_DIR = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wic_dataset_capped'  # Output directory for WiC JSONs and summary
os.makedirs(OUT_DIR, exist_ok=True)                  # Create output directory if it does not exist

SEED = 42                                            # Global seed for reproducible randomness
TRAIN_P, DEV_P, TEST_P = 0.70, 0.15, 0.15            # Proportions for train/dev/test splits (sum to 1)

# WiC constraints
MAX_PARTNERS_PER_SENT = 32   # cap: each sentence appears with at most 32 partners (across whole dataset)  # Partner cap per sentence across the entire dataset
GLOBAL_BALANCE = True        # make each split 50/50 by downsampling the larger class                      # Balance labels within each split

rng = random.Random(SEED)                            # Dedicated RNG instance seeded for determinism


# Function: load_wsd
# Purpose : Load the consolidated WSD CSV, validate required columns, normalize names,
# coerce data types, keep only annotated rows, and return a clean DataFrame.
# Inputs  :
#   path (str): Path to the WSD CSV file.
# Returns :
#   pandas.DataFrame: Cleaned dataframe with required columns and types.
def load_wsd(path):                                   # Define loader for WSD CSV
    df = pd.read_csv(path, encoding='utf-8-sig')      # Read CSV (UTF-8 with BOM handling)
    cols = {c.lower(): c for c in df.columns}         # Map lowercase column names to originals

    required = ['lemma','sentence','sense_id','start','end','sent_id']  # Required schema
    missing = [c for c in required if c not in cols]  # Identify missing required columns
    if missing:                                       # If any required column is missing
        raise ValueError(f"Missing required columns: {missing}")  # Fail fast with clear message

    df = df.rename(columns={                          # Normalize to canonical column names
        cols['lemma']: 'lemma',
        cols['sentence']: 'sentence',
        cols['sense_id']: 'sense_id',
        cols['start']: 'start',
        cols['end']: 'end',
        cols['sent_id']: 'sent_id'
    })
    # optional sense text column
    if 'senses' in cols:                              # If a senses column exists in source
        df['senses'] = df[cols['senses']].astype(str) # Keep it and ensure string type
    else:                                             # Otherwise
        df['senses'] = ''                             # Initialize a blank senses column

    # types + keep only annotated
    df['sense_id'] = pd.to_numeric(df['sense_id'], errors='coerce').astype('Int64')  # Coerce to nullable int
    df = df[df['sense_id'].notna()].copy()            # Keep rows with a valid sense_id
    df['sense_id'] = df['sense_id'].astype(int)       # Cast to concrete int
    df['start'] = pd.to_numeric(df['start'], errors='coerce').fillna(0).astype(int)  # Start index as int
    df['end']   = pd.to_numeric(df['end'],   errors='coerce').fillna(0).astype(int)  # End index as int
    df['sent_id'] = df['sent_id'].astype(str)         # Sentence ID as string
    df['lemma']   = df['lemma'].astype(str)           # Lemma as string
    df['sentence']= df['sentence'].astype(str)        # Sentence text as string
    return df                                         # Return cleaned dataframe


# Sentence-based hashing -> split (no sentence overlap)

# Function: sentence_bin
# Purpose : Assign a sentence deterministically to 'train', 'dev', or 'test' using a
# seeded random draw keyed by (SEED|sent_id). Ensures no sentence appears
# in more than one split.
# Inputs  :
#   sent_id (str): Unique sentence identifier.
# Returns :
#   str: One of {'train', 'dev', 'test'}.
def sentence_bin(sent_id: str) -> str:                # Deterministic split assignment
    r = random.Random()                               # Fresh RNG instance
    r.seed(f"{SEED}|{sent_id}")                       # Seed with global seed and sent_id
    x = r.random()                                    # Draw uniform value in [0, 1)
    if x < TRAIN_P: return 'train'                    # Map into train range
    if x < TRAIN_P + DEV_P: return 'dev'              # Map into dev range
    return 'test'                                     # Otherwise assign to test


# Function: make_pair
# Purpose : Create a WiC pair dictionary from two sentence records plus a binary label.
# Enforces a canonical order by sent_id to avoid duplicate pair permutations.
# Inputs  :
#   r1, r2 (row-like): Records with fields used below.
#   label (int): 1 if same sense, 0 if different sense.
# Returns :
#   dict: WiC pair entry with spans, labels, and optional sense texts.
def make_pair(r1, r2, label):                         # Construct a single WiC pair
    # canonicalize order by sent_id to avoid duplicate pairs
    a, b = (r1, r2) if r1['sent_id'] <= r2['sent_id'] else (r2, r1)  # Deterministic ordering
    return {                                          # Build pair payload
        "lemma": a['lemma'],                          # Lemma shared by both sentences
        "sentence1": a['sentence'],                   # First sentence text
        "sentence2": b['sentence'],                   # Second sentence text
        "sent_id1": a['sent_id'],                     # First sentence ID
        "sent_id2": b['sent_id'],                     # Second sentence ID
        "start1": int(a['start']),                    # Span start in sentence1
        "end1": int(a['end']),                        # Span end in sentence1
        "start2": int(b['start']),                    # Span start in sentence2
        "end2": int(b['end']),                        # Span end in sentence2
        "label": int(label),                          # 1 = same sense, 0 = different sense
        "sense_id1": int(a['sense_id']),              # Sense ID for sentence1
        "sense_id2": int(b['sense_id']),              # Sense ID for sentence2
        "sense1": a.get('senses', ''),                # Optional sense text for sentence1
        "sense2": b.get('senses', '')                 # Optional sense text for sentence2
    }


# Function: build_pairs_capped
# Purpose : For a given lemma subset, generate positive pairs (within-sense) and
# negative pairs (cross-sense) while enforcing a per-sentence partner cap and
# preventing duplicate unordered pairs.
# Inputs  :
#   df_lem (DataFrame): Rows for a single lemma.
#   max_partners (int or None): Partner cap per sentence; None disables the cap.
# Returns :
#   (list, list): Lists of positive and negative pair dicts.
def build_pairs_capped(df_lem, max_partners=MAX_PARTNERS_PER_SENT):  # Pair generation with cap
    # group rows by sense
    by_sense = defaultdict(list)                    # Map sense_id → list of rows
    for _, r in df_lem.iterrows():                  # Iterate lemma rows
        by_sense[int(r['sense_id'])].append(r)      # Bucket row under its sense

    senses = sorted(by_sense.keys())                # Sorted list of senses for determinism
    for s in senses:                                # For each sense bucket
        rng.shuffle(by_sense[s])                    # Shuffle rows to diversify pairings

    used_pairs = set()        # track (sent_id1, sent_id2) to avoid duplicates  # Unordered pair keys to deduplicate
    partner_count = Counter() # per-sentence partner count                      # Track partners per sentence
    pos_pairs, neg_pairs = [], []                   # Accumulators for outputs

    unlimited = (max_partners is None)              # Flag indicating no cap

    # helper: try adding a pair if both sentences under cap (or unlimited)
    def try_add_pair(r1, r2, label):                # Local helper to add a pair under constraints
        if r1['sent_id'] == r2['sent_id']:          # Skip pairing a sentence with itself
            return                                  # No action
        key = tuple(sorted((r1['sent_id'], r2['sent_id'])))  # Unordered key for deduplication
        if key in used_pairs:                       # If already paired
            return                                  # Skip duplicate
        if not unlimited:                           # Enforce partner caps when enabled
            if partner_count[r1['sent_id']] >= max_partners: return  # Respect cap for r1
            if partner_count[r2['sent_id']] >= max_partners: return  # Respect cap for r2
        p = make_pair(r1, r2, label)                # Create WiC pair
        used_pairs.add(key)                         # Record the pair key
        partner_count[r1['sent_id']] += 1           # Increment partner count for r1
        partner_count[r2['sent_id']] += 1           # Increment partner count for r2
        (pos_pairs if label == 1 else neg_pairs).append(p)  # Append to positive or negative list

    # positives
    for s in senses:                                # For each sense bucket
        rows = by_sense[s]                          # Rows within the same sense
        for i in range(len(rows)):                  # Pairwise iteration (upper triangle)
            for j in range(i+1, len(rows)):         # Avoid symmetric duplicates
                try_add_pair(rows[i], rows[j], label=1)  # Add positive pair

    # negatives (cross-sense)
    for s in senses:                                # For each sense bucket
        rows_s = by_sense[s]                        # Rows in the current sense
        others = [r for t in senses if t != s for r in by_sense[t]]  # All rows in other senses
        rng.shuffle(others)                          # Shuffle cross-sense candidates
        for r1 in rows_s:                            # For each row in current sense
            if not unlimited and partner_count[r1['sent_id']] >= max_partners:  # Cap check for r1
                continue                             # Skip if capped
            for r2 in others:                        # Iterate cross-sense rows
                if not unlimited and partner_count[r1['sent_id']] >= max_partners:  # Re-check during loop
                    break                             # Stop if r1 reached cap
                try_add_pair(r1, r2, label=0)        # Add negative pair

    return pos_pairs, neg_pairs                      # Return both lists


# Function: split_and_balance
# Purpose : Assign pairs to train/dev/test based on deterministic per-sentence hashing
# (both sentences must land in the same split), enforce an optional per-sentence
# cap within each split, and balance labels per split if requested.
# Inputs  :
#   pairs (list): All WiC pair dicts.
#   per_sentence_cap (int or None): Cap per sentence inside each split; None disables it.
#   global_balance (bool): Whether to balance labels per split.
# Returns :
#   dict: {'train': [...], 'dev': [...], 'test': [...]} with processed pairs.
def split_and_balance(pairs, per_sentence_cap=MAX_PARTNERS_PER_SENT, global_balance=True):  # Split and balance
    # keep only pairs whose two sentences hash to the same split
    tmp = {'train': [], 'dev': [], 'test': []}       # Containers for intermediate splits
    for p in pairs:                                  # Iterate all pairs
        b1, b2 = sentence_bin(p['sent_id1']), sentence_bin(p['sent_id2'])  # Compute bins for both sentences
        if b1 == b2:                                 # Only keep if both map to the same split
            tmp[b1].append(p)                        # Append to that split

    # enforce per-split cap (skip if None)
    def enforce_split_cap(arr):                      # Local helper to apply per-sentence cap
        if per_sentence_cap is None:                 # If cap is disabled
            return list(arr)                         # Return all pairs unmodified
        counts = Counter()                           # Per-sentence counts inside this split
        out = []                                     # Output list after capping
        for p in rng.sample(arr, len(arr)):          # Shuffle then greedily keep under caps
            a, b = p['sent_id1'], p['sent_id2']      # Sentence IDs
            if counts[a] < per_sentence_cap and counts[b] < per_sentence_cap:  # Check both caps
                out.append(p)                        # Keep the pair
                counts[a] += 1                       # Increment for a
                counts[b] += 1                       # Increment for b
        return out                                   # Return capped list

    for split in tmp:                                # For each split key
        tmp[split] = enforce_split_cap(tmp[split])   # Apply per-sentence cap

    if global_balance:                               # If balancing is enabled
        balanced = {}                                # Output dict for balanced splits
        for split in ['train', 'dev', 'test']:       # Process splits in fixed order
            arr = tmp[split]                         # Pairs in the current split
            pos = [p for p in arr if p['label'] == 1]  # Positive pairs
            neg = [p for p in arr if p['label'] == 0]  # Negative pairs
            rng.shuffle(pos); rng.shuffle(neg)       # Shuffle class lists
            m = min(len(pos), len(neg))              # Target balanced size
            balanced[split] = pos[:m] + neg[:m]      # Truncate to balance
            rng.shuffle(balanced[split])             # Shuffle final split
        return balanced                              # Return balanced splits

    return tmp                                       # Return unbalanced splits if balancing disabled


# Function: main
# Purpose : Orchestrate the full pipeline: load WSD rows, generate capped WiC pairs,
# split and balance them, write train/dev/test JSON files, and save summary.
# Inputs  :
#   uses constants from the config section.
# Returns :
#   writes files to OUT_DIR and prints a summary.
def main():                                          # Program entry point
    print("Loading WSD")                             # Status message
    df = load_wsd(WSD_CSV)                           # Load and clean WSD dataframe

    all_pairs = []                                   # Accumulator for all pairs across lemmas
    lemmas = sorted(df['lemma'].unique())            # Unique lemmas in deterministic order
    print(f"Found {len(lemmas)} lemmas.")            # Report number of lemmas discovered

    # build pairs per lemma with per-sentence cap
    for lem in lemmas:                               # Iterate each lemma
        sub = df[df['lemma'] == lem].copy()          # Subset rows for this lemma
        if sub.shape[0] < 2:                         # Skip lemmas with fewer than 2 examples
            continue                                 # Proceed to next lemma
        pos_pairs, neg_pairs = build_pairs_capped(sub, max_partners=MAX_PARTNERS_PER_SENT)  # Generate pairs
        if not pos_pairs and not neg_pairs:          # If no pairs were produced
            continue                                 # Proceed to next lemma
        pairs = pos_pairs + neg_pairs                # Merge positive and negative pairs
        all_pairs.extend(pairs)                      # Accumulate

    # split & enforce constraints
    splits = split_and_balance(all_pairs, per_sentence_cap=MAX_PARTNERS_PER_SENT, global_balance=GLOBAL_BALANCE)  # Split and balance
    train, dev, test = splits['train'], splits['dev'], splits['test']  # Unpack splits

    # final shuffle & dump
    rng.shuffle(train); rng.shuffle(dev); rng.shuffle(test)  # Shuffle each split before saving

    def dump_json(filename, data):                  # Helper to save a split to JSON
        path = os.path.join(OUT_DIR, filename)      # Compose output path
        with open(path, 'w', encoding='utf-8') as f:  # Open file for writing
            json.dump(data, f, ensure_ascii=False, indent=2)  # Write JSON with indentation
        print(f"Saved {filename}: {len(data)} pairs")  # Report save status

    dump_json('wic_train.json', train)              # Save training split
    dump_json('wic_dev.json',   dev)                # Save development split
    dump_json('wic_test.json',  test)               # Save test split

    # small summary
    def stats(arr):                                  # Helper to compute (pos, neg, pos_ratio)
        if not arr: return (0,0,0.0)                 # Handle empty splits
        pos = sum(1 for x in arr if x['label']==1)   # Count positives
        neg = len(arr) - pos                         # Count negatives
        return (pos, neg, round(pos/len(arr), 3))    # Return counts and positive ratio

    summary = {                                      # Build summary information
        'train': {'total': len(train), 'pos_neg': stats(train)},  # Train totals and balance
        'dev':   {'total': len(dev),   'pos_neg': stats(dev)},    # Dev totals and balance
        'test':  {'total': len(test),  'pos_neg': stats(test)},   # Test totals and balance
        'caps':  {'MAX_PARTNERS_PER_SENT': MAX_PARTNERS_PER_SENT} # Cap configuration
    }
    with open(os.path.join(OUT_DIR, 'summary.json'), 'w', encoding='utf-8') as f:  # Open summary path
        json.dump(summary, f, ensure_ascii=False, indent=2)                         # Write summary JSON
    print("Summary:", summary)                    # Print summary to console

if __name__ == "__main__":                        # Standard Python entry guard
    main()                                        # Execute the pipeline


Loading WSD
Found 43 lemmas.
Saved wic_train.json: 5106 pairs
Saved wic_dev.json: 218 pairs
Saved wic_test.json: 194 pairs
Summary: {'train': {'total': 5106, 'pos_neg': (2553, 2553, 0.5)}, 'dev': {'total': 218, 'pos_neg': (109, 109, 0.5)}, 'test': {'total': 194, 'pos_neg': (97, 97, 0.5)}, 'caps': {'MAX_PARTNERS_PER_SENT': 32}}


**Creating Bangla WiC Dataset  — Uncapped Pairing (No Partner Cap, MAX_PARTNERS_PER_SENT = None)**

In [None]:
import os                                             # Filesystem paths, directory creation
import json                                           # JSON serialization for output files
import random                                         # Deterministic shuffling and hashing for splits
from collections import defaultdict, Counter          # Grouping by sense and tracking partner caps
import pandas as pd                                   # Tabular data loading and processing

# Configuration
WSD_CSV = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wsd_dataset.csv'    # your WSD-style file with: lemma,sentence,sense_id,start,end,sent_id[,senses]
OUT_DIR = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wic_dataset_uncapped'  # Output directory for WiC JSONs and summary
os.makedirs(OUT_DIR, exist_ok=True)                  # Create output directory if it does not exist

SEED = 42                                            # Global seed for reproducible randomness
TRAIN_P, DEV_P, TEST_P = 0.70, 0.15, 0.15            # Proportions for train/dev/test splits (sum to 1)

# WiC constraints
MAX_PARTNERS_PER_SENT = None   # cap: each sentence appears with at most 32 partners (across whole dataset)  # Partner cap per sentence across the entire dataset
GLOBAL_BALANCE = True        # make each split 50/50 by downsampling the larger class                      # Balance labels within each split

rng = random.Random(SEED)                            # Dedicated RNG instance seeded for determinism


# Function: load_wsd
# Purpose : Load the consolidated WSD CSV, validate required columns, normalize names,
# coerce data types, keep only annotated rows, and return a clean DataFrame.
# Inputs  :
#   path (str): Path to the WSD CSV file.
# Returns :
#   pandas.DataFrame: Cleaned dataframe with required columns and types.
def load_wsd(path):                                   # Define loader for WSD CSV
    df = pd.read_csv(path, encoding='utf-8-sig')      # Read CSV (UTF-8 with BOM handling)
    cols = {c.lower(): c for c in df.columns}         # Map lowercase column names to originals

    required = ['lemma','sentence','sense_id','start','end','sent_id']  # Required schema
    missing = [c for c in required if c not in cols]  # Identify missing required columns
    if missing:                                       # If any required column is missing
        raise ValueError(f"Missing required columns: {missing}")  # Fail fast with clear message

    df = df.rename(columns={                          # Normalize to canonical column names
        cols['lemma']: 'lemma',
        cols['sentence']: 'sentence',
        cols['sense_id']: 'sense_id',
        cols['start']: 'start',
        cols['end']: 'end',
        cols['sent_id']: 'sent_id'
    })
    # optional sense text column
    if 'senses' in cols:                              # If a senses column exists in source
        df['senses'] = df[cols['senses']].astype(str) # Keep it and ensure string type
    else:                                             # Otherwise
        df['senses'] = ''                             # Initialize a blank senses column

    # types + keep only annotated
    df['sense_id'] = pd.to_numeric(df['sense_id'], errors='coerce').astype('Int64')  # Coerce to nullable int
    df = df[df['sense_id'].notna()].copy()            # Keep rows with a valid sense_id
    df['sense_id'] = df['sense_id'].astype(int)       # Cast to concrete int
    df['start'] = pd.to_numeric(df['start'], errors='coerce').fillna(0).astype(int)  # Start index as int
    df['end']   = pd.to_numeric(df['end'],   errors='coerce').fillna(0).astype(int)  # End index as int
    df['sent_id'] = df['sent_id'].astype(str)         # Sentence ID as string
    df['lemma']   = df['lemma'].astype(str)           # Lemma as string
    df['sentence']= df['sentence'].astype(str)        # Sentence text as string
    return df                                         # Return cleaned dataframe


# Sentence-based hashing -> split (no sentence overlap)

# Function: sentence_bin
# Purpose : Assign a sentence deterministically to 'train', 'dev', or 'test' using a
# seeded random draw keyed by (SEED|sent_id). Ensures no sentence appears
# in more than one split.
# Inputs  :
#   sent_id (str): Unique sentence identifier.
# Returns :
#   str: One of {'train', 'dev', 'test'}.
def sentence_bin(sent_id: str) -> str:                # Deterministic split assignment
    r = random.Random()                               # Fresh RNG instance
    r.seed(f"{SEED}|{sent_id}")                       # Seed with global seed and sent_id
    x = r.random()                                    # Draw uniform value in [0, 1)
    if x < TRAIN_P: return 'train'                    # Map into train range
    if x < TRAIN_P + DEV_P: return 'dev'              # Map into dev range
    return 'test'                                     # Otherwise assign to test


# Function: make_pair
# Purpose : Create a WiC pair dictionary from two sentence records plus a binary label.
# Enforces a canonical order by sent_id to avoid duplicate pair permutations.
# Inputs  :
#   r1, r2 (row-like): Records with fields used below.
#   label (int): 1 if same sense, 0 if different sense.
# Returns :
#   dict: WiC pair entry with spans, labels, and optional sense texts.
def make_pair(r1, r2, label):                         # Construct a single WiC pair
    # canonicalize order by sent_id to avoid duplicate pairs
    a, b = (r1, r2) if r1['sent_id'] <= r2['sent_id'] else (r2, r1)  # Deterministic ordering
    return {                                          # Build pair payload
        "lemma": a['lemma'],                          # Lemma shared by both sentences
        "sentence1": a['sentence'],                   # First sentence text
        "sentence2": b['sentence'],                   # Second sentence text
        "sent_id1": a['sent_id'],                     # First sentence ID
        "sent_id2": b['sent_id'],                     # Second sentence ID
        "start1": int(a['start']),                    # Span start in sentence1
        "end1": int(a['end']),                        # Span end in sentence1
        "start2": int(b['start']),                    # Span start in sentence2
        "end2": int(b['end']),                        # Span end in sentence2
        "label": int(label),                          # 1 = same sense, 0 = different sense
        "sense_id1": int(a['sense_id']),              # Sense ID for sentence1
        "sense_id2": int(b['sense_id']),              # Sense ID for sentence2
        "sense1": a.get('senses', ''),                # Optional sense text for sentence1
        "sense2": b.get('senses', '')                 # Optional sense text for sentence2
    }


# Function: build_pairs_capped
# Purpose : For a given lemma subset, generate positive pairs (within-sense) and
# negative pairs (cross-sense) while enforcing a per-sentence partner cap and
# preventing duplicate unordered pairs.
# Inputs  :
#   df_lem (DataFrame): Rows for a single lemma.
#   max_partners (int or None): Partner cap per sentence; None disables the cap.
# Returns :
#   (list, list): Lists of positive and negative pair dicts.
def build_pairs_capped(df_lem, max_partners=MAX_PARTNERS_PER_SENT):  # Pair generation with cap
    # group rows by sense
    by_sense = defaultdict(list)                    # Map sense_id → list of rows
    for _, r in df_lem.iterrows():                  # Iterate lemma rows
        by_sense[int(r['sense_id'])].append(r)      # Bucket row under its sense

    senses = sorted(by_sense.keys())                # Sorted list of senses for determinism
    for s in senses:                                # For each sense bucket
        rng.shuffle(by_sense[s])                    # Shuffle rows to diversify pairings

    used_pairs = set()        # track (sent_id1, sent_id2) to avoid duplicates  # Unordered pair keys to deduplicate
    partner_count = Counter() # per-sentence partner count                      # Track partners per sentence
    pos_pairs, neg_pairs = [], []                   # Accumulators for outputs

    unlimited = (max_partners is None)              # Flag indicating no cap

    # helper: try adding a pair if both sentences under cap (or unlimited)
    def try_add_pair(r1, r2, label):                # Local helper to add a pair under constraints
        if r1['sent_id'] == r2['sent_id']:          # Skip pairing a sentence with itself
            return                                  # No action
        key = tuple(sorted((r1['sent_id'], r2['sent_id'])))  # Unordered key for deduplication
        if key in used_pairs:                       # If already paired
            return                                  # Skip duplicate
        if not unlimited:                           # Enforce partner caps when enabled
            if partner_count[r1['sent_id']] >= max_partners: return  # Respect cap for r1
            if partner_count[r2['sent_id']] >= max_partners: return  # Respect cap for r2
        p = make_pair(r1, r2, label)                # Create WiC pair
        used_pairs.add(key)                         # Record the pair key
        partner_count[r1['sent_id']] += 1           # Increment partner count for r1
        partner_count[r2['sent_id']] += 1           # Increment partner count for r2
        (pos_pairs if label == 1 else neg_pairs).append(p)  # Append to positive or negative list

    # positives
    for s in senses:                                # For each sense bucket
        rows = by_sense[s]                          # Rows within the same sense
        for i in range(len(rows)):                  # Pairwise iteration (upper triangle)
            for j in range(i+1, len(rows)):         # Avoid symmetric duplicates
                try_add_pair(rows[i], rows[j], label=1)  # Add positive pair

    # negatives (cross-sense)
    for s in senses:                                # For each sense bucket
        rows_s = by_sense[s]                        # Rows in the current sense
        others = [r for t in senses if t != s for r in by_sense[t]]  # All rows in other senses
        rng.shuffle(others)                          # Shuffle cross-sense candidates
        for r1 in rows_s:                            # For each row in current sense
            if not unlimited and partner_count[r1['sent_id']] >= max_partners:  # Cap check for r1
                continue                             # Skip if capped
            for r2 in others:                        # Iterate cross-sense rows
                if not unlimited and partner_count[r1['sent_id']] >= max_partners:  # Re-check during loop
                    break                             # Stop if r1 reached cap
                try_add_pair(r1, r2, label=0)        # Add negative pair

    return pos_pairs, neg_pairs                      # Return both lists


# Function: split_and_balance
# Purpose : Assign pairs to train/dev/test based on deterministic per-sentence hashing
# (both sentences must land in the same split), enforce an optional per-sentence
# cap within each split, and balance labels per split if requested.
# Inputs  :
#   pairs (list): All WiC pair dicts.
#   per_sentence_cap (int or None): Cap per sentence inside each split; None disables it.
#   global_balance (bool): Whether to balance labels per split.
# Returns :
#   dict: {'train': [...], 'dev': [...], 'test': [...]} with processed pairs.
def split_and_balance(pairs, per_sentence_cap=MAX_PARTNERS_PER_SENT, global_balance=True):  # Split and balance
    # keep only pairs whose two sentences hash to the same split
    tmp = {'train': [], 'dev': [], 'test': []}       # Containers for intermediate splits
    for p in pairs:                                  # Iterate all pairs
        b1, b2 = sentence_bin(p['sent_id1']), sentence_bin(p['sent_id2'])  # Compute bins for both sentences
        if b1 == b2:                                 # Only keep if both map to the same split
            tmp[b1].append(p)                        # Append to that split

    # enforce per-split cap (skip if None)
    def enforce_split_cap(arr):                      # Local helper to apply per-sentence cap
        if per_sentence_cap is None:                 # If cap is disabled
            return list(arr)                         # Return all pairs unmodified
        counts = Counter()                           # Per-sentence counts inside this split
        out = []                                     # Output list after capping
        for p in rng.sample(arr, len(arr)):          # Shuffle then greedily keep under caps
            a, b = p['sent_id1'], p['sent_id2']      # Sentence IDs
            if counts[a] < per_sentence_cap and counts[b] < per_sentence_cap:  # Check both caps
                out.append(p)                        # Keep the pair
                counts[a] += 1                       # Increment for a
                counts[b] += 1                       # Increment for b
        return out                                   # Return capped list

    for split in tmp:                                # For each split key
        tmp[split] = enforce_split_cap(tmp[split])   # Apply per-sentence cap

    if global_balance:                               # If balancing is enabled
        balanced = {}                                # Output dict for balanced splits
        for split in ['train', 'dev', 'test']:       # Process splits in fixed order
            arr = tmp[split]                         # Pairs in the current split
            pos = [p for p in arr if p['label'] == 1]  # Positive pairs
            neg = [p for p in arr if p['label'] == 0]  # Negative pairs
            rng.shuffle(pos); rng.shuffle(neg)       # Shuffle class lists
            m = min(len(pos), len(neg))              # Target balanced size
            balanced[split] = pos[:m] + neg[:m]      # Truncate to balance
            rng.shuffle(balanced[split])             # Shuffle final split
        return balanced                              # Return balanced splits

    return tmp                                       # Return unbalanced splits if balancing disabled


# Function: main
# Purpose : Orchestrate the full pipeline: load WSD rows, generate capped WiC pairs,
# split and balance them, write train/dev/test JSON files, and save summary.
# Inputs  :
#   uses constants from the config section.
# Returns : 
#   writes files to OUT_DIR and prints a summary.
def main():                                          # Program entry point
    print("Loading WSD")                             # Status message
    df = load_wsd(WSD_CSV)                           # Load and clean WSD dataframe

    all_pairs = []                                   # Accumulator for all pairs across lemmas
    lemmas = sorted(df['lemma'].unique())            # Unique lemmas in deterministic order
    print(f"Found {len(lemmas)} lemmas.")            # Report number of lemmas discovered

    # build pairs per lemma with per-sentence cap
    for lem in lemmas:                               # Iterate each lemma
        sub = df[df['lemma'] == lem].copy()          # Subset rows for this lemma
        if sub.shape[0] < 2:                         # Skip lemmas with fewer than 2 examples
            continue                                 # Proceed to next lemma
        pos_pairs, neg_pairs = build_pairs_capped(sub, max_partners=MAX_PARTNERS_PER_SENT)  # Generate pairs
        if not pos_pairs and not neg_pairs:          # If no pairs were produced
            continue                                 # Proceed to next lemma
        pairs = pos_pairs + neg_pairs                # Merge positive and negative pairs
        all_pairs.extend(pairs)                      # Accumulate

    # split & enforce constraints
    splits = split_and_balance(all_pairs, per_sentence_cap=MAX_PARTNERS_PER_SENT, global_balance=GLOBAL_BALANCE)  # Split and balance
    train, dev, test = splits['train'], splits['dev'], splits['test']  # Unpack splits

    # final shuffle & dump
    rng.shuffle(train); rng.shuffle(dev); rng.shuffle(test)  # Shuffle each split before saving

    def dump_json(filename, data):                  # Helper to save a split to JSON
        path = os.path.join(OUT_DIR, filename)      # Compose output path
        with open(path, 'w', encoding='utf-8') as f:  # Open file for writing
            json.dump(data, f, ensure_ascii=False, indent=2)  # Write JSON with indentation
        print(f"Saved {filename}: {len(data)} pairs")  # Report save status

    dump_json('wic_train.json', train)              # Save training split
    dump_json('wic_dev.json',   dev)                # Save development split
    dump_json('wic_test.json',  test)               # Save test split

    # small summary
    def stats(arr):                                  # Helper to compute (pos, neg, pos_ratio)
        if not arr: return (0,0,0.0)                 # Handle empty splits
        pos = sum(1 for x in arr if x['label']==1)   # Count positives
        neg = len(arr) - pos                         # Count negatives
        return (pos, neg, round(pos/len(arr), 3))    # Return counts and positive ratio

    summary = {                                      # Build summary information
        'train': {'total': len(train), 'pos_neg': stats(train)},  # Train totals and balance
        'dev':   {'total': len(dev),   'pos_neg': stats(dev)},    # Dev totals and balance
        'test':  {'total': len(test),  'pos_neg': stats(test)},   # Test totals and balance
        'caps':  {'MAX_PARTNERS_PER_SENT': MAX_PARTNERS_PER_SENT} # Cap configuration
    }
    with open(os.path.join(OUT_DIR, 'summary.json'), 'w', encoding='utf-8') as f:  # Open summary path
        json.dump(summary, f, ensure_ascii=False, indent=2)                         # Write summary JSON
    print("Summary:", summary)                    # Print summary to console

if __name__ == "__main__":                        # Standard Python entry guard
    main()                                        # Execute the pipeline


Loading WSD
Found 43 lemmas.
Saved wic_train.json: 103714 pairs
Saved wic_dev.json: 4714 pairs
Saved wic_test.json: 4270 pairs
Summary: {'train': {'total': 103714, 'pos_neg': (51857, 51857, 0.5)}, 'dev': {'total': 4714, 'pos_neg': (2357, 2357, 0.5)}, 'test': {'total': 4270, 'pos_neg': (2135, 2135, 0.5)}, 'caps': {'MAX_PARTNERS_PER_SENT': None}}


**Zero-Shot WiC Evaluator for Bangla Polysemy (No Training) on Capped WIC dataset**

This script evaluates Word-in-Context (WiC) pairs in a zero-shot manner.
It inserts [TGT]…[/TGT] markers around target spans, encodes each sentence
with a chosen multilingual model (SentenceTransformers backend when
available; Hugging Face Transformers + mean pooling otherwise), computes
cosine similarity between sentence embeddings, calibrates a single
similarity threshold on the dev split (maximizing F1), and reports
F1/Accuracy on the test split. It also saves per-pair predictions and a
run summary for all models listed in MODELS.

 1) Load WiC dev/test JSON files.
 2) Insert target markers using provided offsets.
 3) Encode sentence pairs with the specified backbone.
 4) Compute cosine similarity between embeddings.
 5) Sweep thresholds on dev to maximize F1 (calibration).
 6) Apply the best threshold to test; compute F1/Accuracy.
 7) Save predictions and a CSV summary per model.

In [8]:
import os                              # OS utilities for paths and directories
import json                            # JSON I/O for WiC files and summaries
import math                            # math.isclose used in threshold tie-breaking
import time                            # simple wall-clock timing
from collections import defaultdict    # imported (not strictly used); safe to keep

import numpy as np                     # vector ops and cosine components
import pandas as pd                    # tabular export of predictions/summary
import torch                           # device detection and no-grad inference
from sklearn.metrics import f1_score, accuracy_score  # evaluation metrics

# Backends
from sentence_transformers import SentenceTransformer  # ST models (LaBSE/e5)
from transformers import AutoTokenizer, AutoModel      # HF base models

# Configuration
WIC_DEV  = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wic_dataset_capped\wic_dev.json'   # path to dev WiC JSON
WIC_TEST = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wic_dataset_capped\wic_test.json'  # path to test WiC JSON

OUT_DIR = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Result\zero_shot_result_capped'    # output directory
os.makedirs(OUT_DIR, exist_ok=True)   # ensure output directory exists

MODELS = [
    ('sahajbert',  'neuropark/sahajBERT'),               # HF base model -> Transformers mean pooling
    ('muril',      'google/muril-base-cased'),           # HF base model -> Transformers mean pooling
    ('labse',      'sentence-transformers/LaBSE'),       # ST model      -> SentenceTransformer
    ('e5',         'intfloat/multilingual-e5-base'),     # ST model      -> SentenceTransformer (with "query:" prefix)
    ('banglabert', 'sagorsarker/bangla-bert-base'),      # HF base model -> Transformers mean pooling
]  # list of (short_name, HF model id)

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'  # pick GPU if available
BATCH_SIZE = 64                                          # batch size for encoding
MAX_LEN = 256                                            # max sequence length for HF tokenizers

# Use the same target markers as Code 2
L_MARK, R_MARK = '[TGT]', '[/TGT]'                       # target span markers


# Function: maybe_prefix
# Purpose : Add the "query:" prefix required by e5 models (kept symmetric on both sides).
# Inputs  : 
#   model_id (str) -> HF repo id; 
#   text (str) -> input sentence (with markers).
# Outputs : 
#   (str) -> possibly prefixed text.
def maybe_prefix(model_id, text):
    return f"query: {text}" if 'e5' in model_id.lower() else text  # add "query:" for e5; no change otherwise


# Helpers

# Function: load_wic
# Purpose : Load a WiC JSON file into a Python list of dicts.
# Inputs  : 
#   path (str) -> filesystem path to a WiC JSON file.
# Outputs : 
#   (list[dict]) -> each dict contains WiC fields including sentences,
#   offsets, labels, and IDs.
def load_wic(path):
    with open(path, 'r', encoding='utf-8') as f:  # open the JSON file with UTF-8
        return json.load(f)                       # parse and return as Python objects


# Function: insert_markers
# Purpose : Surround the target span in a sentence with [TGT]…[/TGT] based on offsets.
# Inputs  : 
#   text (str) -> sentence string;
#   start (int), end (int) -> character offsets (start inclusive, end exclusive);
#   l_mark (str), r_mark (str) -> left/right marker tokens.
# Outputs : 
#   (str) -> sentence with markers inserted or original text if offsets invalid.
def insert_markers(text, start, end, l_mark=L_MARK, r_mark=R_MARK):
    """Insert [TGT] .. [/TGT] around the target span; fallback to raw text if indices invalid."""
    try:                                                # bounds safety
        if 0 <= start <= end <= len(text):              # ensure valid offsets
            return text[:start] + l_mark + text[start:end] + r_mark + text[end:]  # splice in markers
    except Exception:                                   # any unexpected error falls back
        pass
    return text                                         # fallback: return original sentence


# Function: cosine_sim
# Purpose : Compute cosine similarity for aligned rows of two embedding arrays.
# Inputs  : 
#   a (np.ndarray), b (np.ndarray) -> shape [N, D] embeddings.
# Outputs : 
#   (np.ndarray) -> shape [N] array of cosine similarities.
def cosine_sim(a, b):
    a = np.asarray(a, dtype=np.float32)                            # cast to float32
    b = np.asarray(b, dtype=np.float32)                            # cast to float32
    a_norm = a / (np.linalg.norm(a, axis=1, keepdims=True) + 1e-12)  # row-wise L2 normalize
    b_norm = b / (np.linalg.norm(b, axis=1, keepdims=True) + 1e-12)  # row-wise L2 normalize
    return np.sum(a_norm * b_norm, axis=1)                         # cosine = dot of normalized rows


# Function: best_threshold
# Purpose : Select the similarity threshold that maximizes F1 on dev, tie-break by accuracy.
# Inputs  : 
#   sims (array-like[float]) -> similarity scores; labels (array-like[int]) -> gold 0/1.
# Outputs : 
#   (tuple) -> (best_threshold: float, best_f1: float, best_acc: float).
def best_threshold(sims, labels):
    """Pick the similarity threshold that maximizes F1 on dev (break ties by higher accuracy)."""
    sims = np.asarray(sims, dtype=float)                     # vectorize similarities
    y = np.asarray(labels, dtype=int)                        # vectorize labels
    uniq = np.unique(sims)                                   # distinct sims to form sweep points
    if len(uniq) == 1:                                       # degenerate: all identical scores
        t_candidates = [uniq[0]]                             # only that one threshold
    else:
        mids = (uniq[:-1] + uniq[1:]) / 2.0                  # midpoints between sorted neighbors
        t_candidates = [uniq[0]-1e-6] + list(mids) + [uniq[-1]+1e-6]  # include small margins
    best_t, best_f1, best_acc = None, -1.0, -1.0             # initialize bests
    for t in t_candidates:                                   # sweep thresholds
        pred = (sims >= t).astype(int)                       # predict same-sense if sim ≥ t
        f1  = f1_score(y, pred)                              # compute F1
        acc = accuracy_score(y, pred)                        # compute Accuracy
        if (f1 > best_f1) or (math.isclose(f1, best_f1) and acc > best_acc):  # tie-break by Acc
            best_t, best_f1, best_acc = float(t), float(f1), float(acc)       # store new best
    return best_t, best_f1, best_acc                         # return optimal threshold and scores



# Unified encoding backend (ST for ST repos, HF+mean-pooling otherwise)
_st_cache = {}                                               # cache for SentenceTransformer models
_hf_cache = {}                                               # cache for (tokenizer, HF model) tuples


# Function: _is_sentence_transformers_repo
# Purpose : Decide whether to use SentenceTransformers or HF+mean pooling for a repo id.
# Inputs  : 
#   model_id (str) -> HF repository identifier.
# Outputs : 
#   (bool) -> True if SentenceTransformers API should be used.
def _is_sentence_transformers_repo(model_id: str) -> bool:
    mid = model_id.lower()                                   # normalize case
    return ('sentence-transformers/' in mid) or ('/e5' in mid) or mid.startswith('intfloat/')  # heuristic


# Function: _get_st_encoder
# Purpose : Lazy-load and cache a SentenceTransformer encoder callable.
# Inputs  : 
#   model_id (str) -> SentenceTransformers-compatible repo id.
# Outputs : 
#   (callable) -> encode(texts: list[str], batch_size: int) -> np.ndarray [N, D]
def _get_st_encoder(model_id):
    if model_id not in _st_cache:                                            # if not cached
        _st_cache[model_id] = SentenceTransformer(model_id, device=DEVICE)   # load ST model
    st_model = _st_cache[model_id]                                          # fetch cached model
    def encode(texts, batch_size=BATCH_SIZE):                               # encoder closure
        with torch.inference_mode():                                        # no gradients
            return st_model.encode(                                         # SentenceTransformers encode
                texts,
                batch_size=batch_size,
                convert_to_numpy=True,
                show_progress_bar=False,
                normalize_embeddings=False  # leave cosine normalization to cosine_sim()
            )
    return encode                                                            # return callable


# Function: _get_hf_encoder
# Purpose : Lazy-load and cache a Hugging Face base model + tokenizer, returning
# a callable that encodes texts via mean pooling of last hidden states.
# Inputs  : 
#   model_id (str) -> Hugging Face repo id (non-ST).
# Outputs : 
#   (callable) -> encode(texts: list[str], batch_size: int, max_length: int) -> np.ndarray [N, D]
def _get_hf_encoder(model_id):
    if model_id not in _hf_cache:                                           # if not cached
        tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)        # tokenizer
        mdl = AutoModel.from_pretrained(model_id).to(DEVICE)                # base model
        mdl.eval()                                                          # inference mode
        _hf_cache[model_id] = (tok, mdl)                                    # cache tuple
    tok, mdl = _hf_cache[model_id]                                          # unpack cache

    def mean_pool(last_hidden_state, attention_mask):                       # pooling helper
        mask = attention_mask.unsqueeze(-1)                                 # [B,T,1] expand mask
        summed = (last_hidden_state * mask).sum(dim=1)                      # sum masked states
        counts = mask.sum(dim=1).clamp(min=1e-9)                            # token counts per row
        return (summed / counts)                                            # mean = sum / count

    def encode(texts, batch_size=BATCH_SIZE, max_length=MAX_LEN):           # encoder closure
        embs = []                                                           # accumulator
        with torch.inference_mode():                                        # no gradients
            for i in range(0, len(texts), batch_size):                      # mini-batches
                batch = texts[i:i+batch_size]                               # slice texts
                inputs = tok(                                               # tokenize
                    batch, padding=True, truncation=True,
                    max_length=max_length, return_tensors='pt'
                ).to(DEVICE)
                outputs = mdl(**inputs)                                     # forward pass
                pooled = mean_pool(outputs.last_hidden_state, inputs['attention_mask'])  # mean pool
                embs.append(pooled.detach().cpu().numpy())                  # to CPU numpy
        return np.vstack(embs)                                              # stack to [N, D]
    return encode                                                           # return callable


# Function: get_encoder
# Purpose : Factory that returns a text->embedding encoder function for a repo id.
# Inputs  : 
#   model_id (str) -> HF repository identifier.
# Outputs : 
#   (callable) -> encode(texts: list[str], batch_size: int, [max_length]) -> np.ndarray [N, D]
def get_encoder(model_id):
    return _get_st_encoder(model_id) if _is_sentence_transformers_repo(model_id) else _get_hf_encoder(model_id)  # pick backend


# Evaluation

# Function: eval_model
# Purpose : End-to-end calibrated zero-shot evaluation for one backbone.
# Inputs  : 
#   model_name (str) -> short label for reports/filenames;
#   model_id (str)   -> HF repo id;
#   dev_data (list[dict])  -> WiC dev examples;
#   test_data (list[dict]) -> WiC test examples;
#   out_dir (str)     -> directory to write predictions.
# Outputs : 
#   (dict) -> summary with threshold, dev/test F1/Acc, and path to predictions CSV.
def eval_model(model_name, model_id, dev_data, test_data, out_dir=OUT_DIR):
    """
    Calibrated zero-shot:
      - insert [TGT]…[/TGT] around spans (same as Code 2’s logic),
      - encode each side with the same backbone as Code 2,
      - cosine similarity,
      - choose threshold on dev (max F1),
      - evaluate on test,
      - save predictions and a summary row.
    """
    print(f"\n{model_name} | {model_id}")          # header for the current model
    encoder = get_encoder(model_id)                         # obtain encoding callable for this repo

    # DEV
    dev_left, dev_right, dev_labels = [], [], []           # containers for left/right texts and labels
    for ex in dev_data:                                     # iterate dev examples
        s1 = insert_markers(ex['sentence1'], ex['start1'], ex['end1'])  # mark target in sentence1
        s2 = insert_markers(ex['sentence2'], ex['start2'], ex['end2'])  # mark target in sentence2
        dev_left.append(maybe_prefix(model_id, s1))         # prefix (e5) or leave as is
        dev_right.append(maybe_prefix(model_id, s2))        # prefix (e5) or leave as is
        dev_labels.append(int(ex['label']))                 # store gold label

    dev_emb1 = encoder(dev_left)                            # encode left dev sentences
    dev_emb2 = encoder(dev_right)                           # encode right dev sentences
    dev_sims = cosine_sim(dev_emb1, dev_emb2)               # cosine similarities for dev
    thr, dev_f1, dev_acc = best_threshold(dev_sims, dev_labels)  # pick best threshold on dev
    print(f"Dev : best_threshold={thr:.4f} | F1={dev_f1:.4f} | Acc={dev_acc:.4f}")  # report dev calibration

    # TEST
    test_left, test_right, test_labels = [], [], []         # prepare test containers
    for ex in test_data:                                    # iterate test examples
        s1 = insert_markers(ex['sentence1'], ex['start1'], ex['end1'])  # mark sentence1
        s2 = insert_markers(ex['sentence2'], ex['start2'], ex['end2'])  # mark sentence2
        test_left.append(maybe_prefix(model_id, s1))        # prefix if e5
        test_right.append(maybe_prefix(model_id, s2))       # prefix if e5
        test_labels.append(int(ex['label']))                # gold label

    test_emb1 = encoder(test_left)                          # encode left test sentences
    test_emb2 = encoder(test_right)                         # encode right test sentences
    test_sims = cosine_sim(test_emb1, test_emb2)            # cosine similarities
    test_pred = (test_sims >= thr).astype(int)              # predict using calibrated threshold

    test_f1  = f1_score(test_labels, test_pred)             # test F1
    test_acc = accuracy_score(test_labels, test_pred)       # test Accuracy
    print(f"Test : F1={test_f1:.4f} | Acc={test_acc:.4f}")  # report test metrics

    # Save per-pair predictions
    rows = []                                               # per-pair rows to write
    for ex, sim, pred in zip(test_data, test_sims, test_pred):  # iterate results
        rows.append({
            'model': model_name,                            # short model name
            'lemma': ex.get('lemma', ''),                   # lemma (if present)
            'sent_id1': ex.get('sent_id1', ''),             # sentence id 1
            'sent_id2': ex.get('sent_id2', ''),             # sentence id 2
            'sim': float(sim),                              # cosine similarity
            'pred': int(pred),                              # predicted label
            'label': int(ex.get('label', 0)),               # gold label (default 0 if missing)
        })
    pred_path = os.path.join(out_dir, f'{model_name}_zeroshot_test_predictions.csv')  # CSV path
    pd.DataFrame(rows).to_csv(pred_path, index=False, encoding='utf-8-sig')          # write CSV with BOM

    return {
        'model': model_name,                                # echo model name
        'thresh': thr,                                      # chosen threshold
        'dev_f1': dev_f1, 'dev_acc': dev_acc,               # dev metrics
        'test_f1': test_f1, 'test_acc': test_acc,           # test metrics
        'pred_path': pred_path                              # where predictions were saved
    }


# Function: main
# Purpose : Orchestrate the full zero-shot evaluation across configured models.
# Inputs  : 
#   None (uses global config for paths/models).
# Outputs : 
#   None (prints metrics, writes predictions and a summary CSV).
def main():
    dev_data  = load_wic(WIC_DEV)                           # load dev split
    test_data = load_wic(WIC_TEST)                          # load test split

    summaries = []                                          # accumulate per-model summaries
    t0 = time.time()                                        # start timer
    for name, mid in MODELS:                                # iterate configured models
        try:
            s = eval_model(name, mid, dev_data, test_data, out_dir=OUT_DIR)  # evaluate one model
            summaries.append(s)                             # store summary
        except Exception as e:                              # robust loop: continue on error
            print(f"[WARN] {name} failed: {e}")             # report failure

    if summaries:                                           # if we collected any results
        sum_df = pd.DataFrame(summaries)                    # tabularize summaries
        sum_csv = os.path.join(OUT_DIR, 'zeroshot_calibrated_summary.csv')  # summary path
        sum_df.to_csv(sum_csv, index=False, encoding='utf-8-sig')           # write summary CSV
        print("\nCalibrated zero-shot summary")       # header
        print(sum_df.to_string(index=False))                # pretty-print table
        print(f"\nSaved summary to: {sum_csv}")             # location info
    print(f"\nDone in {time.time()-t0:.1f}s")               # total elapsed time

if __name__ == '__main__':                                  # script entry point
    main()                                                  # run main



sahajbert | neuropark/sahajBERT
Dev : best_threshold=0.8805 | F1=0.7672 | Acc=0.7523
Test : F1=0.6354 | Acc=0.6392

muril | google/muril-base-cased
Dev : best_threshold=0.9955 | F1=0.7480 | Acc=0.7156
Test : F1=0.6878 | Acc=0.6443

labse | sentence-transformers/LaBSE
Dev : best_threshold=0.2792 | F1=0.6730 | Acc=0.5275
Test : F1=0.6739 | Acc=0.5361

e5 | intfloat/multilingual-e5-base
Dev : best_threshold=0.7857 | F1=0.6885 | Acc=0.5642
Test : F1=0.6494 | Acc=0.5103

banglabert | sagorsarker/bangla-bert-base
Dev : best_threshold=0.5664 | F1=0.6974 | Acc=0.5780
Test : F1=0.6544 | Acc=0.5155

Calibrated zero-shot summary
     model   thresh   dev_f1  dev_acc  test_f1  test_acc                                                                                                                                              pred_path
 sahajbert 0.880528 0.767241 0.752294 0.635417  0.639175  C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Result\zero_shot_result_capp

**Zero-Shot WiC Evaluator for Bangla Polysemy (No Training) on Uncapped WIC dataset**

In [10]:
import os                              # OS utilities for paths and directories
import json                            # JSON I/O for WiC files and summaries
import math                            # math.isclose used in threshold tie-breaking
import time                            # simple wall-clock timing
from collections import defaultdict    # imported (not strictly used); safe to keep

import numpy as np                     # vector ops and cosine components
import pandas as pd                    # tabular export of predictions/summary
import torch                           # device detection and no-grad inference
from sklearn.metrics import f1_score, accuracy_score  # evaluation metrics

# Backends
from sentence_transformers import SentenceTransformer  # ST models (LaBSE/e5)
from transformers import AutoTokenizer, AutoModel      # HF base models

# Configuration
WIC_DEV  = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wic_dataset_uncapped\wic_dev.json'   # path to dev WiC JSON
WIC_TEST = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wic_dataset_uncapped\wic_test.json'  # path to test WiC JSON

OUT_DIR = r'C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Result\zero_shot_result_uncapped'    # output directory
os.makedirs(OUT_DIR, exist_ok=True)   # ensure output directory exists

MODELS = [
    ('sahajbert',  'neuropark/sahajBERT'),               # HF base model -> Transformers mean pooling
    ('muril',      'google/muril-base-cased'),           # HF base model -> Transformers mean pooling
    ('labse',      'sentence-transformers/LaBSE'),       # ST model      -> SentenceTransformer
    ('e5',         'intfloat/multilingual-e5-base'),     # ST model      -> SentenceTransformer (with "query:" prefix)
    ('banglabert', 'sagorsarker/bangla-bert-base'),      # HF base model -> Transformers mean pooling
]  # list of (short_name, HF model id)

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'  # pick GPU if available
BATCH_SIZE = 64                                          # batch size for encoding
MAX_LEN = 256                                            # max sequence length for HF tokenizers

# Use the same target markers as Code 2
L_MARK, R_MARK = '[TGT]', '[/TGT]'                       # target span markers


# Function: maybe_prefix
# Purpose : Add the "query:" prefix required by e5 models (kept symmetric on both sides).
# Inputs  : 
#   model_id (str) -> HF repo id; 
#   text (str) -> input sentence (with markers).
# Outputs : 
#   (str) -> possibly prefixed text.
def maybe_prefix(model_id, text):
    return f"query: {text}" if 'e5' in model_id.lower() else text  # add "query:" for e5; no change otherwise


# Helpers

# Function: load_wic
# Purpose : Load a WiC JSON file into a Python list of dicts.
# Inputs  : 
#   path (str) -> filesystem path to a WiC JSON file.
# Outputs : 
#   (list[dict]) -> each dict contains WiC fields including sentences,
#   offsets, labels, and IDs.
def load_wic(path):
    with open(path, 'r', encoding='utf-8') as f:  # open the JSON file with UTF-8
        return json.load(f)                       # parse and return as Python objects


# Function: insert_markers
# Purpose : Surround the target span in a sentence with [TGT]…[/TGT] based on offsets.
# Inputs  : 
#   text (str) -> sentence string;
#   start (int), end (int) -> character offsets (start inclusive, end exclusive);
#   l_mark (str), r_mark (str) -> left/right marker tokens.
# Outputs : 
#   (str) -> sentence with markers inserted or original text if offsets invalid.
def insert_markers(text, start, end, l_mark=L_MARK, r_mark=R_MARK):
    """Insert [TGT] .. [/TGT] around the target span; fallback to raw text if indices invalid."""
    try:                                                # bounds safety
        if 0 <= start <= end <= len(text):              # ensure valid offsets
            return text[:start] + l_mark + text[start:end] + r_mark + text[end:]  # splice in markers
    except Exception:                                   # any unexpected error falls back
        pass
    return text                                         # fallback: return original sentence


# Function: cosine_sim
# Purpose : Compute cosine similarity for aligned rows of two embedding arrays.
# Inputs  : 
#   a (np.ndarray), b (np.ndarray) -> shape [N, D] embeddings.
# Outputs : 
#   (np.ndarray) -> shape [N] array of cosine similarities.
def cosine_sim(a, b):
    a = np.asarray(a, dtype=np.float32)                            # cast to float32
    b = np.asarray(b, dtype=np.float32)                            # cast to float32
    a_norm = a / (np.linalg.norm(a, axis=1, keepdims=True) + 1e-12)  # row-wise L2 normalize
    b_norm = b / (np.linalg.norm(b, axis=1, keepdims=True) + 1e-12)  # row-wise L2 normalize
    return np.sum(a_norm * b_norm, axis=1)                         # cosine = dot of normalized rows


# Function: best_threshold
# Purpose : Select the similarity threshold that maximizes F1 on dev, tie-break by accuracy.
# Inputs  : 
#   sims (array-like[float]) -> similarity scores; labels (array-like[int]) -> gold 0/1.
# Outputs : 
#   (tuple) -> (best_threshold: float, best_f1: float, best_acc: float).
def best_threshold(sims, labels):
    """Pick the similarity threshold that maximizes F1 on dev (break ties by higher accuracy)."""
    sims = np.asarray(sims, dtype=float)                     # vectorize similarities
    y = np.asarray(labels, dtype=int)                        # vectorize labels
    uniq = np.unique(sims)                                   # distinct sims to form sweep points
    if len(uniq) == 1:                                       # degenerate: all identical scores
        t_candidates = [uniq[0]]                             # only that one threshold
    else:
        mids = (uniq[:-1] + uniq[1:]) / 2.0                  # midpoints between sorted neighbors
        t_candidates = [uniq[0]-1e-6] + list(mids) + [uniq[-1]+1e-6]  # include small margins
    best_t, best_f1, best_acc = None, -1.0, -1.0             # initialize bests
    for t in t_candidates:                                   # sweep thresholds
        pred = (sims >= t).astype(int)                       # predict same-sense if sim ≥ t
        f1  = f1_score(y, pred)                              # compute F1
        acc = accuracy_score(y, pred)                        # compute Accuracy
        if (f1 > best_f1) or (math.isclose(f1, best_f1) and acc > best_acc):  # tie-break by Acc
            best_t, best_f1, best_acc = float(t), float(f1), float(acc)       # store new best
    return best_t, best_f1, best_acc                         # return optimal threshold and scores



# Unified encoding backend (ST for ST repos, HF+mean-pooling otherwise)
_st_cache = {}                                               # cache for SentenceTransformer models
_hf_cache = {}                                               # cache for (tokenizer, HF model) tuples


# Function: _is_sentence_transformers_repo
# Purpose : Decide whether to use SentenceTransformers or HF+mean pooling for a repo id.
# Inputs  : 
#   model_id (str) -> HF repository identifier.
# Outputs : 
#   (bool) -> True if SentenceTransformers API should be used.
def _is_sentence_transformers_repo(model_id: str) -> bool:
    mid = model_id.lower()                                   # normalize case
    return ('sentence-transformers/' in mid) or ('/e5' in mid) or mid.startswith('intfloat/')  # heuristic


# Function: _get_st_encoder
# Purpose : Lazy-load and cache a SentenceTransformer encoder callable.
# Inputs  : 
#   model_id (str) -> SentenceTransformers-compatible repo id.
# Outputs : 
#   (callable) -> encode(texts: list[str], batch_size: int) -> np.ndarray [N, D]
def _get_st_encoder(model_id):
    if model_id not in _st_cache:                                            # if not cached
        _st_cache[model_id] = SentenceTransformer(model_id, device=DEVICE)   # load ST model
    st_model = _st_cache[model_id]                                          # fetch cached model
    def encode(texts, batch_size=BATCH_SIZE):                               # encoder closure
        with torch.inference_mode():                                        # no gradients
            return st_model.encode(                                         # SentenceTransformers encode
                texts,
                batch_size=batch_size,
                convert_to_numpy=True,
                show_progress_bar=False,
                normalize_embeddings=False  # leave cosine normalization to cosine_sim()
            )
    return encode                                                            # return callable


# Function: _get_hf_encoder
# Purpose : Lazy-load and cache a Hugging Face base model + tokenizer, returning
# a callable that encodes texts via mean pooling of last hidden states.
# Inputs  : 
#   model_id (str) -> Hugging Face repo id (non-ST).
# Outputs : 
#   (callable) -> encode(texts: list[str], batch_size: int, max_length: int) -> np.ndarray [N, D]
def _get_hf_encoder(model_id):
    if model_id not in _hf_cache:                                           # if not cached
        tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)        # tokenizer
        mdl = AutoModel.from_pretrained(model_id).to(DEVICE)                # base model
        mdl.eval()                                                          # inference mode
        _hf_cache[model_id] = (tok, mdl)                                    # cache tuple
    tok, mdl = _hf_cache[model_id]                                          # unpack cache

    def mean_pool(last_hidden_state, attention_mask):                       # pooling helper
        mask = attention_mask.unsqueeze(-1)                                 # [B,T,1] expand mask
        summed = (last_hidden_state * mask).sum(dim=1)                      # sum masked states
        counts = mask.sum(dim=1).clamp(min=1e-9)                            # token counts per row
        return (summed / counts)                                            # mean = sum / count

    def encode(texts, batch_size=BATCH_SIZE, max_length=MAX_LEN):           # encoder closure
        embs = []                                                           # accumulator
        with torch.inference_mode():                                        # no gradients
            for i in range(0, len(texts), batch_size):                      # mini-batches
                batch = texts[i:i+batch_size]                               # slice texts
                inputs = tok(                                               # tokenize
                    batch, padding=True, truncation=True,
                    max_length=max_length, return_tensors='pt'
                ).to(DEVICE)
                outputs = mdl(**inputs)                                     # forward pass
                pooled = mean_pool(outputs.last_hidden_state, inputs['attention_mask'])  # mean pool
                embs.append(pooled.detach().cpu().numpy())                  # to CPU numpy
        return np.vstack(embs)                                              # stack to [N, D]
    return encode                                                           # return callable


# Function: get_encoder
# Purpose : Factory that returns a text->embedding encoder function for a repo id.
# Inputs  : 
#   model_id (str) -> HF repository identifier.
# Outputs : 
#   (callable) -> encode(texts: list[str], batch_size: int, [max_length]) -> np.ndarray [N, D]
def get_encoder(model_id):
    return _get_st_encoder(model_id) if _is_sentence_transformers_repo(model_id) else _get_hf_encoder(model_id)  # pick backend


# Evaluation

# Function: eval_model
# Purpose : End-to-end calibrated zero-shot evaluation for one backbone.
# Inputs  : 
#   model_name (str) -> short label for reports/filenames;
#   model_id (str)   -> HF repo id;
#   dev_data (list[dict])  -> WiC dev examples;
#   test_data (list[dict]) -> WiC test examples;
#   out_dir (str)     -> directory to write predictions.
# Outputs : 
#   (dict) -> summary with threshold, dev/test F1/Acc, and path to predictions CSV.
def eval_model(model_name, model_id, dev_data, test_data, out_dir=OUT_DIR):
    """
    Calibrated zero-shot:
      - insert [TGT]…[/TGT] around spans (same as Code 2’s logic),
      - encode each side with the same backbone as Code 2,
      - cosine similarity,
      - choose threshold on dev (max F1),
      - evaluate on test,
      - save predictions and a summary row.
    """
    print(f"\n{model_name} | {model_id}")          # header for the current model
    encoder = get_encoder(model_id)                         # obtain encoding callable for this repo

    # DEV
    dev_left, dev_right, dev_labels = [], [], []           # containers for left/right texts and labels
    for ex in dev_data:                                     # iterate dev examples
        s1 = insert_markers(ex['sentence1'], ex['start1'], ex['end1'])  # mark target in sentence1
        s2 = insert_markers(ex['sentence2'], ex['start2'], ex['end2'])  # mark target in sentence2
        dev_left.append(maybe_prefix(model_id, s1))         # prefix (e5) or leave as is
        dev_right.append(maybe_prefix(model_id, s2))        # prefix (e5) or leave as is
        dev_labels.append(int(ex['label']))                 # store gold label

    dev_emb1 = encoder(dev_left)                            # encode left dev sentences
    dev_emb2 = encoder(dev_right)                           # encode right dev sentences
    dev_sims = cosine_sim(dev_emb1, dev_emb2)               # cosine similarities for dev
    thr, dev_f1, dev_acc = best_threshold(dev_sims, dev_labels)  # pick best threshold on dev
    print(f"Dev : best_threshold={thr:.4f} | F1={dev_f1:.4f} | Acc={dev_acc:.4f}")  # report dev calibration

    # TEST
    test_left, test_right, test_labels = [], [], []         # prepare test containers
    for ex in test_data:                                    # iterate test examples
        s1 = insert_markers(ex['sentence1'], ex['start1'], ex['end1'])  # mark sentence1
        s2 = insert_markers(ex['sentence2'], ex['start2'], ex['end2'])  # mark sentence2
        test_left.append(maybe_prefix(model_id, s1))        # prefix if e5
        test_right.append(maybe_prefix(model_id, s2))       # prefix if e5
        test_labels.append(int(ex['label']))                # gold label

    test_emb1 = encoder(test_left)                          # encode left test sentences
    test_emb2 = encoder(test_right)                         # encode right test sentences
    test_sims = cosine_sim(test_emb1, test_emb2)            # cosine similarities
    test_pred = (test_sims >= thr).astype(int)              # predict using calibrated threshold

    test_f1  = f1_score(test_labels, test_pred)             # test F1
    test_acc = accuracy_score(test_labels, test_pred)       # test Accuracy
    print(f"Test : F1={test_f1:.4f} | Acc={test_acc:.4f}")  # report test metrics

    # Save per-pair predictions
    rows = []                                               # per-pair rows to write
    for ex, sim, pred in zip(test_data, test_sims, test_pred):  # iterate results
        rows.append({
            'model': model_name,                            # short model name
            'lemma': ex.get('lemma', ''),                   # lemma (if present)
            'sent_id1': ex.get('sent_id1', ''),             # sentence id 1
            'sent_id2': ex.get('sent_id2', ''),             # sentence id 2
            'sim': float(sim),                              # cosine similarity
            'pred': int(pred),                              # predicted label
            'label': int(ex.get('label', 0)),               # gold label (default 0 if missing)
        })
    pred_path = os.path.join(out_dir, f'{model_name}_zeroshot_test_predictions.csv')  # CSV path
    pd.DataFrame(rows).to_csv(pred_path, index=False, encoding='utf-8-sig')          # write CSV with BOM

    return {
        'model': model_name,                                # echo model name
        'thresh': thr,                                      # chosen threshold
        'dev_f1': dev_f1, 'dev_acc': dev_acc,               # dev metrics
        'test_f1': test_f1, 'test_acc': test_acc,           # test metrics
        'pred_path': pred_path                              # where predictions were saved
    }


# Function: main
# Purpose : Orchestrate the full zero-shot evaluation across configured models.
# Inputs  : 
#   None (uses global config for paths/models).
# Outputs : 
#   None (prints metrics, writes predictions and a summary CSV).
def main():
    dev_data  = load_wic(WIC_DEV)                           # load dev split
    test_data = load_wic(WIC_TEST)                          # load test split

    summaries = []                                          # accumulate per-model summaries
    t0 = time.time()                                        # start timer
    for name, mid in MODELS:                                # iterate configured models
        try:
            s = eval_model(name, mid, dev_data, test_data, out_dir=OUT_DIR)  # evaluate one model
            summaries.append(s)                             # store summary
        except Exception as e:                              # robust loop: continue on error
            print(f"[WARN] {name} failed: {e}")             # report failure

    if summaries:                                           # if we collected any results
        sum_df = pd.DataFrame(summaries)                    # tabularize summaries
        sum_csv = os.path.join(OUT_DIR, 'zeroshot_calibrated_summary.csv')  # summary path
        sum_df.to_csv(sum_csv, index=False, encoding='utf-8-sig')           # write summary CSV
        print("\nCalibrated zero-shot summary")       # header
        print(sum_df.to_string(index=False))                # pretty-print table
        print(f"\nSaved summary to: {sum_csv}")             # location info
    print(f"\nDone in {time.time()-t0:.1f}s")               # total elapsed time

if __name__ == '__main__':                                  # script entry point
    main()                                                  # run main



sahajbert | neuropark/sahajBERT
Dev : best_threshold=0.8781 | F1=0.7309 | Acc=0.7092
Test : F1=0.7008 | Acc=0.6740

muril | google/muril-base-cased
Dev : best_threshold=0.9957 | F1=0.7140 | Acc=0.6805
Test : F1=0.6975 | Acc=0.6541

labse | sentence-transformers/LaBSE
Dev : best_threshold=0.2902 | F1=0.6737 | Acc=0.5350
Test : F1=0.6627 | Acc=0.5206

e5 | intfloat/multilingual-e5-base
Dev : best_threshold=0.7939 | F1=0.6753 | Acc=0.5545
Test : F1=0.6690 | Acc=0.5412

banglabert | sagorsarker/bangla-bert-base
Dev : best_threshold=0.6041 | F1=0.6725 | Acc=0.5602
Test : F1=0.6715 | Acc=0.5614

Calibrated zero-shot summary
     model   thresh   dev_f1  dev_acc  test_f1  test_acc                                                                                                                                                pred_path
 sahajbert 0.878064 0.730913 0.709164 0.700774  0.674005  C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Result\zero_shot_result_un

In [None]:
# Installs/updates Hugging Face Transformers (>=4.30) and Accelerate (>=0.20)
pip install -U "transformers>=4.30" "accelerate>=0.20"

Defaulting to user installation because normal site-packages is not writeable
Collecting transformers>=4.30
  Downloading transformers-4.55.0-py3-none-any.whl.metadata (39 kB)
Collecting accelerate>=0.20
  Downloading accelerate-1.10.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers>=4.30)
  Downloading huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Downloading transformers-4.55.0-py3-none-any.whl (11.3 MB)
   ---------------------------------------- 0.0/11.3 MB ? eta -:--:--
   -- ------------------------------------- 0.8/11.3 MB 4.6 MB/s eta 0:00:03
   ------ --------------------------------- 1.8/11.3 MB 4.6 MB/s eta 0:00:03
   ---------- ----------------------------- 2.9/11.3 MB 4.6 MB/s eta 0:00:02
   ------------- -------------------------- 3.7/11.3 MB 4.6 MB/s eta 0:00:02
   ---------------- ----------------------- 4.7/11.3 MB 4.6 MB/s eta 0:00:02
   -------------------- ------------------- 5.8/11.3 MB 4.6 MB/s eta 0:00:02
  

**Few-/Full-Shot Fine-Tuning for Bangla WiC with a Trainer-Free PyTorch Loop (on Uncapped WIC dataset)**

This script fine-tunes multiple Hugging Face transformer backbones on a Bangla Word-in-Context (WiC) dataset using a trainer-free PyTorch loop. It injects [TGT] ... [/TGT] markers around the target spans, supports few-shot regimes (e.g., 5%, 10%, 20%, 30%) and full-shot (100%), performs early stopping on dev F1, and saves the best checkpoint plus CSV summaries. It runs each (model × regime) combination reproducibly (seeded), evaluates on the dev set per epoch and on the test set at the end, and writes consolidated results and a pivot table under the output directory.

 1) Define input/output folders, model IDs, regimes, and training hyperparameters.
 2) Select device (cuda if available) and set random seeds for Python/NumPy/Transformers for reproducibility.
 3) Read wic_train.json, wic_dev.json, and wic_test.json from DATA_DIR via load_json_array.
 4) Convert raw JSON lists to tidy DataFrames with standardized columns using to_dataframe.
 5) Wrap target spans with [TGT] and [/TGT] using with_markers.
 6) Tokenize paired sentences with a Hugging Face tokenizer in tokenize_pairs (no padding here; a collator pads dynamically).
 7) For each regime fraction, draw a stratified subset of the training DataFrame (preserving label balance) using stratified_fraction.
 8) Wrap encodings/labels into a lightweight WiCDataset.
 9) Build DataLoaders for train/dev/test with DataCollatorWithPadding for efficient dynamic padding.
 10) Load AutoModelForSequenceClassification (binary head) and tokenizer per model.
 11) Add special tokens ([TGT], [/TGT]) and resize embeddings accordingly.
 12) Train with AdamW and a linear warmup + decay schedule (get_linear_schedule_with_warmup).
 13) Compute total and warmup steps from loader size and MAX_EPOCHS.
 14) For each epoch:
    - Forward pass with labels → compute cross-entropy loss.
    - Backprop, gradient clipping, optimizer and scheduler steps.
    - Evaluate on the dev set using epoch_eval to get accuracy and F1. 
 15) Track best dev F1; when improved, save model + tokenizer and a small JSON with dev metrics.
 16) Stop early after PATIENCE epochs without dev-F1 improvement.
 17) Reload the best checkpoint (if available) and evaluate on the test set (accuracy, F1)
 18) Save per-run train_history.csv and summary.csv in a run-specific folder.
 19) Aggregate all runs into ALL_results.csv and a sorted pivot ALL_results_pivot.csv showing test metrics by model and regime.
 20) Print the final pivot table to the console.

In [None]:
import os, json, random, time                               # Std: filesystem, JSON I/O, RNG seeding, timing
from dataclasses import dataclass                           # For lightweight dataset container
from typing import Dict, List                               # Type hints for clarity
import numpy as np                                          # Numerical ops for logits → metrics
import pandas as pd                                         # Tabular wrangling of WiC lists
import torch                                                # Core PyTorch
from torch.utils.data import DataLoader                     # Mini-batching utilities
from sklearn.metrics import accuracy_score, f1_score        # Standard classification metrics

import transformers                                         # HF Transformers library (version printed below)
from transformers import (                                  # Selected utilities/classes from Transformers
    AutoTokenizer, AutoConfig, AutoModelForSequenceClassification,
    DataCollatorWithPadding, get_linear_schedule_with_warmup, set_seed
)

print("[info] transformers version:", transformers.__version__)  # Log Transformers version for reproducibility

# ---------------- CONFIG ----------------
DATA_DIR = r"C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wic_dataset_uncapped"  # Root folder containing wic_train/dev/test.json
OUT_DIR  = r"C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Result\WIC_finetuned_result"   # Output root for checkpoints & CSVs

MODELS = {     # HF model id per short name to iterate over
    "sahajbert": "neuropark/sahajBERT",
    "muril"    : "google/muril-base-cased",
    "labse"    : "sentence-transformers/LaBSE",
    "e5"       : "intfloat/multilingual-e5-base",
    "banglabert":"sagorsarker/bangla-bert-base",
    # Note: all are fine-tuned as sequence classifiers
}

REGIMES = [0.05, 0.10, 0.20, 0.30, 1.00]                     # Few-shot fractions and full-shot (100%)

SEED = 42                                                    # Global reproducibility seed
BATCH_SIZE = 16                                              # Per-step batch size
MAX_EPOCHS = 5                                               # Maximum epochs before stopping
LR = 2e-5                                                    # AdamW learning rate
WARMUP_RATIO = 0.06                                          # Linear warmup proportion of total steps
PATIENCE = 2                                                 # Early-stopping: epochs without dev-F1 improvement
MAX_LEN = 256                                                # Max sequence length for tokenization

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Prefer GPU if available


# IO & UTILS

# Function: load_json_array
# Purpose: Read a WiC split from disk (JSON list of dicts) and return it as a Python list.
# Inputs:
#   path (str): filesystem path to a JSON file containing the WiC examples.
# Outputs:
#   List[Dict]: a Python list where each element is a dictionary for one WiC example.
def load_json_array(path):
    with open(path, "r", encoding="utf-8") as f:            # Open the JSON file in UTF-8 mode
        return json.load(f)                                  # Parse and return as Python list


# Function: with_markers
# Purpose: Insert explicit [TGT] and [/TGT] markers into a sentence around a target span.
# Inputs:
#   sentence (str): the full sentence text.
#   start (int): character start index (inclusive) of the target span.
#   end (int): character end index (exclusive) of the target span.
#   open_tok (str): left marker token (default "[TGT]").
#   close_tok (str): right marker token (default "[/TGT]").
# Outputs:
#   str: the sentence with markers injected around the specified span (clamped to safe range).
def with_markers(sentence: str, start: int, end: int, open_tok="[TGT]", close_tok="[/TGT]"):
    start = max(0, int(start)); end = max(start, int(end))   # Clamp indices and enforce start ≤ end
    return sentence[:start] + open_tok + sentence[start:end] + close_tok + sentence[end:]  # Return marked sentence


# Function: to_dataframe
# Purpose: Convert a raw WiC JSON list into a tidy DataFrame with standardized columns for training.
# Inputs:
#   wic_list (List[Dict]): list of WiC examples read from JSON.
# Outputs:
#   pd.DataFrame: columns = lemma, s1, s2, sid1, sid2, start1, end1, start2, end2, label (int).
def to_dataframe(wic_list: List[Dict]):
    rows = []                                                # Accumulator for row dicts
    for x in wic_list:                                       # Iterate over raw examples
        rows.append({
            "lemma": x["lemma"],                             # Lemma string
            "s1": x["sentence1"],                            # Left sentence
            "s2": x["sentence2"],                            # Right sentence
            "sid1": x["sent_id1"],                           # Left sentence id
            "sid2": x["sent_id2"],                           # Right sentence id
            "start1": int(x["start1"]), "end1": int(x["end1"]),  # Left target span (char offsets)
            "start2": int(x["start2"]), "end2": int(x["end2"]),  # Right target span (char offsets)
            "label": int(x["label"]),                        # Gold label (1 = same sense, 0 = different)
        })
    return pd.DataFrame(rows)                                # Build DataFrame from rows


# Function: stratified_fraction
# Purpose: Take a stratified sample of the DataFrame by label to achieve a given fraction for few-shot regimes.
# Inputs:
#   df (pd.DataFrame): full training DataFrame with a 'label' column.
#   frac (float): fraction to sample; if ≥ 1.0, return a full shuffle.
#   seed (int): RNG seed for reproducibility.
# Outputs:
#   pd.DataFrame: sampled DataFrame with near-constant label proportions.
def stratified_fraction(df: pd.DataFrame, frac: float, seed: int):
    if frac >= 1.0:                                          # Full-shot case
        return df.sample(frac=1.0, random_state=seed).reset_index(drop=True)  # Return shuffled copy
    parts = []                                               # Accumulate per-class samples
    for y, sub in df.groupby("label"):                       # Split by class label
        k = max(1, int(round(frac * len(sub))))              # Class-size specific sample count
        parts.append(sub.sample(n=k, random_state=seed))     # Sample that many rows from this class
    return pd.concat(parts, axis=0).sample(frac=1.0, random_state=seed).reset_index(drop=True)  # Recombine & shuffle


# Dataclass: WiCDataset
# Purpose: Minimal Dataset wrapper holding token encodings and labels for the Trainer-free loop.
# Fields:
#   encodings (Dict[str, List[List[int]]]): tokenized features (input_ids, attention_mask, etc.) as lists.
#   labels (List[int]): gold labels aligned to encodings.
@dataclass
class WiCDataset(torch.utils.data.Dataset):
    encodings: Dict[str, List[List[int]]]                    # Token features dictionary
    labels: List[int]                                        # Gold labels per example
    def __len__(self): return len(self.labels)               # Return dataset size
    def __getitem__(self, idx):                              # Retrieve one item by index
        # return python lists; collator will pad & convert to tensors
        item = {k: v[idx] for k, v in self.encodings.items()}  # Slice each feature list
        item["labels"] = self.labels[idx]                    # Attach the corresponding label
        return item                                          # Return a dict expected by the model


# Function: tokenize_pairs
# Purpose: Insert markers into sentence pairs and tokenize them with a given tokenizer.
# Inputs:
#   df (pd.DataFrame): must include s1, s2, start1, end1, start2, end2, and label.
#   tokenizer: a Hugging Face tokenizer instance.
# Outputs:
#   (encodings, labels): encodings is a dict of token lists; labels is a list[int].
def tokenize_pairs(df: pd.DataFrame, tokenizer):
    OPEN, CLOSE = "[TGT]", "[/TGT]"                          # Consistent marker tokens
    s1 = [with_markers(a, s, e, OPEN, CLOSE) for a, s, e in zip(df["s1"], df["start1"], df["end1"])]  # Left marked
    s2 = [with_markers(b, s, e, OPEN, CLOSE) for b, s, e in zip(df["s2"], df["start2"], df["end2"])]  # Right marked
    enc = tokenizer(                                         # Tokenize sentence pairs
        s1, s2,
        truncation=True,                                     # Truncate to MAX_LEN
        max_length=MAX_LEN,
        padding=False                                        # Dynamic padding handled by collator
    )
    labels = df["label"].astype(int).tolist()                # Extract labels as ints
    return enc, labels                                       # Return features and labels


# Function: compute_metrics_from_logits
# Purpose: Convert raw model logits to predictions and compute accuracy and F1.
# Inputs:
#   logits (np.ndarray): shape [N, 2] for binary classification.
#   labels (List[int]): gold labels (0/1) of length N.
# Outputs:
#   (acc, f1): tuple of floats with accuracy and F1 score.
def compute_metrics_from_logits(logits: np.ndarray, labels: List[int]):
    preds = logits.argmax(axis=-1)                           # Predicted class = argmax over logits
    acc = accuracy_score(labels, preds)                      # Compute accuracy
    f1  = f1_score(labels, preds)                            # Compute F1 (binary average='binary')
    return acc, f1                                           # Return both metrics


# TRAIN / EVAL

# Function: epoch_eval
# Purpose: Evaluate a model over a dataloader and compute accuracy and F1 from accumulated logits.
# Inputs:
#   model (nn.Module): sequence classification model.
#   dataloader (DataLoader): batched dataset iterator.
#   device (torch.device): 'cuda' or 'cpu' to run inference on.
# Outputs:
#   (acc, f1): floats with accuracy and F1 on the provided dataloader.
def epoch_eval(model, dataloader, device):
    model.eval()                                             # Switch to eval mode (no dropout, etc.)
    all_logits = []                                          # Accumulator for logits
    all_labels = []                                          # Accumulator for gold labels
    with torch.no_grad():                                    # Disable gradient tracking
        for batch in dataloader:                             # Iterate over batches
            labels = batch.pop("labels").to(device)          # Move labels to device and remove from features
            batch = {k: v.to(device) for k, v in batch.items()}  # Move all features to device
            outputs = model(**batch)                         # Forward pass
            logits = outputs.logits.detach().cpu().numpy()   # Collect logits on CPU as NumPy
            all_logits.append(logits)                        # Append batch logits
            all_labels.append(labels.cpu().numpy())          # Append batch labels
    all_logits = np.concatenate(all_logits, axis=0) if all_logits else np.zeros((0,2))  # Stack logits
    all_labels = np.concatenate(all_labels, axis=0).astype(int) if all_labels else np.zeros((0,), dtype=int)  # Stack labels
    return compute_metrics_from_logits(all_logits, all_labels)  # Compute and return metrics


# Function: train_eval_one
# Purpose: Fine-tune one backbone at a given regime, with early stopping on dev F1; save best; evaluate on test.
# Inputs:
#   model_id (str): Hugging Face model identifier.
#   model_name (str): short alias for naming outputs.
#   df_train, df_dev, df_test (pd.DataFrame): split dataframes from WiC JSONs.
#   regime_frac (float): fraction of the training set to sample (few-/full-shot).
#   out_root (str): output directory root for this model/regime run.
# Outputs:
#   Dict: a single summary row containing sizes, best dev F1, and test metrics.
def train_eval_one(model_id, model_name, df_train, df_dev, df_test, regime_frac, out_root):
    set_seed(SEED)                                           # Set HF/torch/random seeds

    # Tokenizer & model
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)  # Load tokenizer
    tokenizer.add_special_tokens({"additional_special_tokens": ["[TGT]", "[/TGT]"]})  # Add markers to vocab

    config = AutoConfig.from_pretrained(model_id, num_labels=2)  # Binary classification head
    model = AutoModelForSequenceClassification.from_pretrained(model_id, config=config)  # Load base weights
    model.resize_token_embeddings(len(tokenizer))            # Resize embeddings due to added tokens
    model.to(DEVICE)                                        # Move model to GPU/CPU

    # Few-shot sample
    df_subtrain = stratified_fraction(df_train, regime_frac, seed=SEED)  # Stratified downsample of train

    # Tokenize
    enc_tr, y_tr = tokenize_pairs(df_subtrain, tokenizer)    # Tokenize sampled train
    enc_dv, y_dv = tokenize_pairs(df_dev, tokenizer)         # Tokenize dev
    enc_ts, y_ts = tokenize_pairs(df_test, tokenizer)        # Tokenize test

    # Datasets / Loaders
    collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8 if torch.cuda.is_available() else None)  # Dynamic pad
    ds_tr = WiCDataset(enc_tr, y_tr)                         # Train dataset
    ds_dv = WiCDataset(enc_dv, y_dv)                         # Dev dataset
    ds_ts = WiCDataset(enc_ts, y_ts)                         # Test dataset

    train_loader = DataLoader(ds_tr, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collator, num_workers=0)   # Train loader
    dev_loader   = DataLoader(ds_dv, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collator, num_workers=0)  # Dev loader
    test_loader  = DataLoader(ds_ts, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collator, num_workers=0)  # Test loader

    # Optim / Scheduler
    total_steps = max(1, len(train_loader) * MAX_EPOCHS)     # Total train steps for scheduler
    warmup_steps = int(WARMUP_RATIO * total_steps)           # Num warmup steps
    optimizer = torch.optim.AdamW(model.parameters(), lr=LR) # AdamW optimizer
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)  # Linear warmup/decay

    # Output dir
    tag = f"{model_name}_frac{int(regime_frac*100)}"         # Tag includes model alias and regime
    out_dir = os.path.join(out_root, tag); os.makedirs(out_dir, exist_ok=True)  # Make output folder

    best_dev_f1 = -1.0                                       # Track best dev F1 for early stopping
    no_improve = 0                                           # Counter for patience
    history = []                                             # Per-epoch logs

    print(f"\n[run] {model_name} | frac={regime_frac} | train={len(ds_tr)} dev={len(ds_dv)} test={len(ds_ts)}")  # Run header

    for epoch in range(1, MAX_EPOCHS+1):                     # Epoch loop
        model.train()                                        # Train mode
        t0 = time.time()                                     # Epoch timer
        total_loss = 0.0                                     # Reset loss accumulator
        for batch in train_loader:                           # Iterate training batches
            labels = batch.pop("labels").to(DEVICE)          # Extract and move labels
            batch  = {k: v.to(DEVICE) for k, v in batch.items()}  # Move features to device

            outputs = model(**batch, labels=labels)          # Forward pass with labels
            loss = outputs.loss                              # Cross-entropy loss

            optimizer.zero_grad(set_to_none=True)            # Clear grads
            loss.backward()                                  # Backprop
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # Gradient clipping for stability
            optimizer.step()                                 # Optimizer step
            scheduler.step()                                 # Scheduler step

            total_loss += loss.item()                        # Accumulate batch loss

        # Eval on dev each epoch
        dev_acc, dev_f1 = epoch_eval(model, dev_loader, DEVICE)  # Validate on dev
        avg_loss = total_loss / max(1, len(train_loader))    # Compute mean train loss for epoch
        history.append({"epoch": epoch, "train_loss": avg_loss, "dev_acc": dev_acc, "dev_f1": dev_f1})  # Log row
        print(f"  epoch {epoch:02d} | loss {avg_loss:.4f} | dev_acc {dev_acc:.4f} | dev_f1 {dev_f1:.4f} | {time.time()-t0:.1f}s")  # Epoch summary

        # Early stopping on dev F1
        if dev_f1 > best_dev_f1:                             # If improved dev F1
            best_dev_f1 = dev_f1                             # Update best
            no_improve = 0                                   # Reset patience counter
            # Save best
            model.save_pretrained(out_dir)                   # Persist model weights/config
            tokenizer.save_pretrained(out_dir)               # Persist tokenizer artifacts
            with open(os.path.join(out_dir, "dev_best.json"), "w", encoding="utf-8") as f:  # Save best dev metrics
                json.dump({"epoch": epoch, "dev_acc": dev_acc, "dev_f1": dev_f1}, f, ensure_ascii=False, indent=2)
        else:
            no_improve += 1                                  # No improvement this epoch
            if no_improve >= PATIENCE:                       # Hit patience threshold
                print(f"  early stop: no dev F1 improvement for {PATIENCE} epoch(s).")  # Log early stop
                break                                        # Exit training loop

    # Load best (already saved) and evaluate on test
    # (Re-use in-memory 'model' which holds last epoch; for exact best, reload)
    try:
        best_model = AutoModelForSequenceClassification.from_pretrained(out_dir).to(DEVICE)  # Reload best if present
    except Exception:
        best_model = model                                  # Otherwise use last-epoch model
    test_acc, test_f1 = epoch_eval(best_model, test_loader, DEVICE)  # Final test evaluation

    # Save summary row
    row = {
        "model": model_name, "hf_id": model_id, "regime_frac": regime_frac,  # Identity fields
        "train_size": len(ds_tr), "dev_size": len(ds_dv), "test_size": len(ds_ts),  # Dataset sizes
        "best_dev_f1": float(best_dev_f1), "test_acc": float(test_acc), "test_f1": float(test_f1)  # Metrics
    }
    pd.DataFrame(history).to_csv(os.path.join(out_dir, "train_history.csv"), index=False, encoding="utf-8-sig")  # Persist per-epoch log
    pd.DataFrame([row]).to_csv(os.path.join(out_dir, "summary.csv"), index=False, encoding="utf-8-sig")          # Persist run summary
    return row                                               # Return summary dict to caller


# MAIN 

# Function: main
# Purpose: End-to-end orchestration—load splits, prepare DataFrames, and iterate model × regime runs; collect summaries.
# Inputs: 
#   None (uses module-level config paths and constants).
# Outputs: 
#   None (Files written under OUT_DIR; prints run tables).
def main():
    random.seed(SEED); np.random.seed(SEED); set_seed(SEED)  # Seed Python, NumPy, and Transformers

    train = load_json_array(os.path.join(DATA_DIR, "wic_train.json"))  # Read train split JSON
    dev   = load_json_array(os.path.join(DATA_DIR, "wic_dev.json"))    # Read dev split JSON
    test  = load_json_array(os.path.join(DATA_DIR, "wic_test.json"))   # Read test split JSON

    df_train = to_dataframe(train)                          # Convert train JSON → DataFrame
    df_dev   = to_dataframe(dev)                            # Convert dev JSON → DataFrame
    df_test  = to_dataframe(test)                           # Convert test JSON → DataFrame

    runs_dir = os.path.join(OUT_DIR, "runs"); os.makedirs(runs_dir, exist_ok=True)  # Ensure runs folder exists
    results = []                                            # Accumulate per-run summaries
    for model_name, model_id in MODELS.items():             # Iterate models
        for frac in REGIMES:                                # Iterate data regimes
            try:
                res = train_eval_one(model_id, model_name, df_train, df_dev, df_test, frac, runs_dir)  # Run a job
                results.append(res)                         # Keep summary row
            except Exception as e:                          # Robustness: catch & log errors per run
                print(f"[WARN] Run failed for {model_name} (frac={frac}): {e}")

    if results:                                             # If at least one successful run
        df_res = pd.DataFrame(results)                      # Build a table of summaries
        df_res.to_csv(os.path.join(OUT_DIR, "ALL_results.csv"), index=False, encoding="utf-8-sig")  # Save summaries CSV
        piv = df_res.pivot_table(index=["model","regime_frac"], values=["test_acc","test_f1"], aggfunc="mean")  # Pivot by model/regime
        piv = piv.reset_index().sort_values(["model","regime_frac"])  # Sort for readability
        piv.to_csv(os.path.join(OUT_DIR, "ALL_results_pivot.csv"), index=False, encoding="utf-8-sig")  # Save pivot CSV
        print("\n== Final (test) results ==\n", piv)        # Print final pivot table
    else:
        print("No successful runs to summarize.")           # Inform if nothing completed

if __name__ == "__main__":                                  # Script entry point
    main()                                                  # Launch the pipeline


Code cell output from Google colab:

[info] transformers version: 4.55.1

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at neuropark/sahajBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] sahajbert | frac=0.05 | train=5186 dev=4714 test=4270
-  epoch 01 | loss 0.6360 | dev_acc 0.7567 | dev_f1 0.7624 | 133.6s
-  epoch 02 | loss 0.4427 | dev_acc 0.7804 | dev_f1 0.7946 | 134.0s
-  epoch 03 | loss 0.2491 | dev_acc 0.7885 | dev_f1 0.7977 | 133.8s
-  epoch 04 | loss 0.1333 | dev_acc 0.8046 | dev_f1 0.8066 | 132.9s
-  epoch 05 | loss 0.0686 | dev_acc 0.8089 | dev_f1 0.8086 | 131.7s

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at neuropark/sahajBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] sahajbert | frac=0.1 | train=10372 dev=4714 test=4270
-  epoch 01 | loss 0.5921 | dev_acc 0.7683 | dev_f1 0.7762 | 234.0s
-  epoch 02 | loss 0.2871 | dev_acc 0.8318 | dev_f1 0.8361 | 231.7s
-  epoch 03 | loss 0.1438 | dev_acc 0.8437 | dev_f1 0.8464 | 232.7s
-  epoch 04 | loss 0.0582 | dev_acc 0.8498 | dev_f1 0.8526 | 234.2s
-  epoch 05 | loss 0.0194 | dev_acc 0.8477 | dev_f1 0.8510 | 233.5s

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at neuropark/sahajBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] sahajbert | frac=0.2 | train=20742 dev=4714 test=4270
-  epoch 01 | loss 0.4881 | dev_acc 0.8210 | dev_f1 0.8281 | 434.5s
-  epoch 02 | loss 0.1722 | dev_acc 0.8471 | dev_f1 0.8483 | 434.9s
-  epoch 03 | loss 0.0590 | dev_acc 0.8581 | dev_f1 0.8597 | 437.0s
-  epoch 04 | loss 0.0122 | dev_acc 0.8572 | dev_f1 0.8595 | 434.0s
-  epoch 05 | loss 0.0033 | dev_acc 0.8579 | dev_f1 0.8625 | 435.4s

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at neuropark/sahajBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] sahajbert | frac=0.3 | train=31114 dev=4714 test=4270
-  epoch 01 | loss 0.3978 | dev_acc 0.8475 | dev_f1 0.8531 | 635.6s
-  epoch 02 | loss 0.0974 | dev_acc 0.8729 | dev_f1 0.8768 | 637.6s
-  epoch 03 | loss 0.0202 | dev_acc 0.8649 | dev_f1 0.8722 | 635.0s
-  epoch 04 | loss 0.0077 | dev_acc 0.8782 | dev_f1 0.8816 | 636.5s
-  epoch 05 | loss 0.0011 | dev_acc 0.8793 | dev_f1 0.8839 | 632.9s

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at neuropark/sahajBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] sahajbert | frac=1.0 | train=103714 dev=4714 test=4270
-  epoch 01 | loss 0.2277 | dev_acc 0.8681 | dev_f1 0.8764 | 2033.9s
-  epoch 02 | loss 0.0230 | dev_acc 0.8608 | dev_f1 0.8655 | 2029.9s
-  epoch 03 | loss 0.0094 | dev_acc 0.8808 | dev_f1 0.8855 | 2029.7s
-  epoch 04 | loss 0.0029 | dev_acc 0.8840 | dev_f1 0.8883 | 2036.9s
-  epoch 05 | loss 0.0000 | dev_acc 0.8863 | dev_f1 0.8894 | 2037.5s

tokenizer_config.json: 100%
 206/206 [00:00<00:00, 25.5kB/s]

config.json: 100%
 411/411 [00:00<00:00, 48.4kB/s]

vocab.txt: 
 3.16M/? [00:00<00:00, 81.7MB/s]

special_tokens_map.json: 100%
 113/113 [00:00<00:00, 14.6kB/s]

pytorch_model.bin: 100%
 953M/953M [00:06<00:00, 243MB/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/muril-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

model.safetensors: 100%
 953M/953M [00:05<00:00, 486MB/s]

[run] muril | frac=0.05 | train=5186 dev=4714 test=4270
-  epoch 01 | loss 0.6872 | dev_acc 0.6860 | dev_f1 0.6043 | 40.5s
-  epoch 02 | loss 0.5551 | dev_acc 0.7811 | dev_f1 0.7625 | 40.8s
-  epoch 03 | loss 0.3982 | dev_acc 0.8065 | dev_f1 0.8080 | 40.3s
-  epoch 04 | loss 0.2852 | dev_acc 0.8004 | dev_f1 0.8092 | 40.6s
-  epoch 05 | loss 0.2208 | dev_acc 0.8125 | dev_f1 0.8131 | 40.7s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/muril-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] muril | frac=0.1 | train=10372 dev=4714 test=4270
-  epoch 01 | loss 0.6618 | dev_acc 0.7359 | dev_f1 0.7354 | 72.0s
-  epoch 02 | loss 0.4744 | dev_acc 0.8010 | dev_f1 0.7971 | 72.2s
-  epoch 03 | loss 0.2978 | dev_acc 0.8254 | dev_f1 0.8217 | 72.2s
-  epoch 04 | loss 0.2062 | dev_acc 0.8422 | dev_f1 0.8448 | 72.5s
-  epoch 05 | loss 0.1516 | dev_acc 0.8403 | dev_f1 0.8402 | 72.4s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/muril-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] muril | frac=0.2 | train=20742 dev=4714 test=4270
-  epoch 01 | loss 0.5644 | dev_acc 0.7951 | dev_f1 0.8168 | 136.7s
-  epoch 02 | loss 0.2661 | dev_acc 0.8564 | dev_f1 0.8579 | 136.0s
-  epoch 03 | loss 0.1479 | dev_acc 0.8570 | dev_f1 0.8598 | 136.8s
-  epoch 04 | loss 0.0813 | dev_acc 0.8619 | dev_f1 0.8633 | 135.5s
-  epoch 05 | loss 0.0476 | dev_acc 0.8647 | dev_f1 0.8665 | 136.0s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/muril-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] muril | frac=0.3 | train=31114 dev=4714 test=4270
-  epoch 01 | loss 0.4953 | dev_acc 0.8377 | dev_f1 0.8300 | 200.9s
-  epoch 02 | loss 0.1802 | dev_acc 0.8691 | dev_f1 0.8666 | 199.8s
-  epoch 03 | loss 0.0691 | dev_acc 0.8664 | dev_f1 0.8689 | 199.9s
-  epoch 04 | loss 0.0290 | dev_acc 0.8721 | dev_f1 0.8721 | 199.0s
-  epoch 05 | loss 0.0132 | dev_acc 0.8746 | dev_f1 0.8733 | 199.3s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/muril-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] muril | frac=1.0 | train=103714 dev=4714 test=4270
-  epoch 01 | loss 0.2852 | dev_acc 0.8729 | dev_f1 0.8709 | 641.1s
-  epoch 02 | loss 0.0216 | dev_acc 0.8602 | dev_f1 0.8553 | 641.9s
-  epoch 03 | loss 0.0069 | dev_acc 0.8753 | dev_f1 0.8773 | 641.1s
-  epoch 04 | loss 0.0027 | dev_acc 0.8801 | dev_f1 0.8809 | 642.4s
-  epoch 05 | loss 0.0006 | dev_acc 0.8829 | dev_f1 0.8849 | 641.6s

tokenizer_config.json: 100%
 397/397 [00:00<00:00, 45.9kB/s]

config.json: 100%
 804/804 [00:00<00:00, 107kB/s]

vocab.txt: 
 5.22M/? [00:00<00:00, 68.3MB/s]

tokenizer.json: 
 9.62M/? [00:00<00:00, 93.1MB/s]

special_tokens_map.json: 100%
 112/112 [00:00<00:00, 14.7kB/s]

model.safetensors: 100%
 1.88G/1.88G [00:08<00:00, 288MB/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/LaBSE and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] labse | frac=0.05 | train=5186 dev=4714 test=4270
-  epoch 01 | loss 0.6548 | dev_acc 0.7183 | dev_f1 0.7002 | 49.9s
-  epoch 02 | loss 0.4340 | dev_acc 0.7862 | dev_f1 0.7784 | 49.8s
-  epoch 03 | loss 0.2442 | dev_acc 0.7991 | dev_f1 0.7975 | 50.2s
-  epoch 04 | loss 0.1357 | dev_acc 0.8021 | dev_f1 0.7988 | 50.2s
-  epoch 05 | loss 0.0691 | dev_acc 0.8112 | dev_f1 0.8103 | 49.7s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/LaBSE and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] labse | frac=0.1 | train=10372 dev=4714 test=4270  epoch 01 | loss 0.5804 | dev_acc 0.8023 | dev_f1 0.8131 | 89.8s
-  epoch 01 | loss 0.5804 | dev_acc 0.8023 | dev_f1 0.8131 | 89.8s
-  epoch 02 | loss 0.2603 | dev_acc 0.8426 | dev_f1 0.8437 | 90.0s
-  epoch 03 | loss 0.1340 | dev_acc 0.8551 | dev_f1 0.8587 | 90.4s
-  epoch 04 | loss 0.0724 | dev_acc 0.8555 | dev_f1 0.8504 | 89.9s
-  epoch 05 | loss 0.0323 | dev_acc 0.8583 | dev_f1 0.8585 | 89.4s
-  early stop: no dev F1 improvement for 2 epoch(s).

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/LaBSE and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] labse | frac=0.2 | train=20742 dev=4714 test=4270
-  epoch 01 | loss 0.4618 | dev_acc 0.8424 | dev_f1 0.8365 | 171.0s
-  epoch 02 | loss 0.1545 | dev_acc 0.8606 | dev_f1 0.8612 | 170.7s
-  epoch 03 | loss 0.0608 | dev_acc 0.8727 | dev_f1 0.8705 | 170.7s
-  epoch 04 | loss 0.0242 | dev_acc 0.8717 | dev_f1 0.8724 | 170.7s
-  epoch 05 | loss 0.0079 | dev_acc 0.8731 | dev_f1 0.8749 | 171.1s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/LaBSE and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] labse | frac=0.3 | train=31114 dev=4714 test=4270
-  epoch 01 | loss 0.3962 | dev_acc 0.8511 | dev_f1 0.8598 | 251.3s
-  epoch 02 | loss 0.1028 | dev_acc 0.8740 | dev_f1 0.8748 | 251.4s
-  epoch 03 | loss 0.0293 | dev_acc 0.8757 | dev_f1 0.8774 | 249.8s
-  epoch 04 | loss 0.0129 | dev_acc 0.8689 | dev_f1 0.8710 | 250.3s
-  epoch 05 | loss 0.0051 | dev_acc 0.8731 | dev_f1 0.8744 | 251.6s
-  early stop: no dev F1 improvement for 2 epoch(s).

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/LaBSE and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] labse | frac=1.0 | train=103714 dev=4714 test=4270
-  epoch 01 | loss 0.2199 | dev_acc 0.9005 | dev_f1 0.8997 | 814.5s
-  epoch 02 | loss 0.0187 | dev_acc 0.8871 | dev_f1 0.8915 | 811.8s
-  epoch 03 | loss 0.0070 | dev_acc 0.8931 | dev_f1 0.8936 | 812.9s
-  early stop: no dev F1 improvement for 2 epoch(s).

tokenizer_config.json: 100%
 418/418 [00:00<00:00, 54.1kB/s]

sentencepiece.bpe.model: 100%
 5.07M/5.07M [00:01<00:00, 3.92MB/s]

tokenizer.json: 100%
 17.1M/17.1M [00:01<00:00, 14.1MB/s]

special_tokens_map.json: 100%
 280/280 [00:00<00:00, 36.4kB/s]

config.json: 100%
 694/694 [00:00<00:00, 88.7kB/s]

model.safetensors: 100%
 1.11G/1.11G [00:04<00:00, 441MB/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] e5 | frac=0.05 | train=5186 dev=4714 test=4270
-  epoch 01 | loss 0.6743 | dev_acc 0.6655 | dev_f1 0.6992 | 50.3s
-  epoch 02 | loss 0.4945 | dev_acc 0.7484 | dev_f1 0.7641 | 50.4s
-  epoch 03 | loss 0.2965 | dev_acc 0.7679 | dev_f1 0.7799 | 50.1s
-  epoch 04 | loss 0.1942 | dev_acc 0.7707 | dev_f1 0.7856 | 50.6s
-  epoch 05 | loss 0.1405 | dev_acc 0.7811 | dev_f1 0.7851 | 50.2s

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] e5 | frac=0.1 | train=10372 dev=4714 test=4270
-  epoch 01 | loss 0.5921 | dev_acc 0.7650 | dev_f1 0.7465 | 89.2s
-  epoch 02 | loss 0.3399 | dev_acc 0.8036 | dev_f1 0.8049 | 89.6s
-  epoch 03 | loss 0.2139 | dev_acc 0.8065 | dev_f1 0.7966 | 89.7s
-  epoch 04 | loss 0.1459 | dev_acc 0.8127 | dev_f1 0.8077 | 89.5s
-  epoch 05 | loss 0.1049 | dev_acc 0.8127 | dev_f1 0.8083 | 90.0s

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] e5 | frac=0.2 | train=20742 dev=4714 test=4270
-  epoch 01 | loss 0.5105 | dev_acc 0.7966 | dev_f1 0.7822 | 169.1s
-  epoch 02 | loss 0.2292 | dev_acc 0.8095 | dev_f1 0.8067 | 169.3s
-  epoch 03 | loss 0.1411 | dev_acc 0.8303 | dev_f1 0.8285 | 169.5s
-  epoch 04 | loss 0.0729 | dev_acc 0.8339 | dev_f1 0.8383 | 169.8s
-  epoch 05 | loss 0.0404 | dev_acc 0.8360 | dev_f1 0.8387 | 169.3s

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] e5 | frac=0.3 | train=31114 dev=4714 test=4270
-  epoch 01 | loss 0.4667 | dev_acc 0.8154 | dev_f1 0.8099 | 248.1s
-  epoch 02 | loss 0.1847 | dev_acc 0.8261 | dev_f1 0.8285 | 249.1s
-  epoch 03 | loss 0.0857 | dev_acc 0.8375 | dev_f1 0.8443 | 249.6s
-  epoch 04 | loss 0.0386 | dev_acc 0.8502 | dev_f1 0.8553 | 248.4s
-  epoch 05 | loss 0.0133 | dev_acc 0.8464 | dev_f1 0.8499 | 248.3s

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] e5 | frac=1.0 | train=103714 dev=4714 test=4270
-  epoch 01 | loss 0.2843 | dev_acc 0.8541 | dev_f1 0.8583 | 801.1s
-  epoch 02 | loss 0.0379 | dev_acc 0.8513 | dev_f1 0.8512 | 803.1s
-  epoch 03 | loss 0.0118 | dev_acc 0.8430 | dev_f1 0.8521 | 802.4s
-  early stop: no dev F1 improvement for 2 epoch(s).

config.json: 100%
 491/491 [00:00<00:00, 63.9kB/s]

vocab.txt: 
 2.24M/? [00:00<00:00, 58.3MB/s]

model.safetensors: 100%
 660M/660M [00:01<00:00, 433MB/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] banglabert | frac=0.05 | train=5186 dev=4714 test=4270
-  epoch 01 | loss 0.6050 | dev_acc 0.7367 | dev_f1 0.7250 | 43.9s
-  epoch 02 | loss 0.3134 | dev_acc 0.7552 | dev_f1 0.7602 | 43.9s
-  epoch 03 | loss 0.1632 | dev_acc 0.7563 | dev_f1 0.7608 | 43.9s
-  epoch 04 | loss 0.0815 | dev_acc 0.7592 | dev_f1 0.7631 | 43.7s
-  epoch 05 | loss 0.0368 | dev_acc 0.7650 | dev_f1 0.7610 | 43.8s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] banglabert | frac=0.1 | train=10372 dev=4714 test=4270
-  epoch 01 | loss 0.5391 | dev_acc 0.7603 | dev_f1 0.7354 | 77.7s
-  epoch 02 | loss 0.2399 | dev_acc 0.7764 | dev_f1 0.7846 | 77.4s
-  epoch 03 | loss 0.1205 | dev_acc 0.7720 | dev_f1 0.7623 | 77.4s
-  epoch 04 | loss 0.0671 | dev_acc 0.7813 | dev_f1 0.7676 | 77.5s
-  early stop: no dev F1 improvement for 2 epoch(s).

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] banglabert | frac=0.2 | train=20742 dev=4714 test=4270
-  epoch 01 | loss 0.4503 | dev_acc 0.7864 | dev_f1 0.7847 | 145.5s
-  epoch 02 | loss 0.1583 | dev_acc 0.7951 | dev_f1 0.7934 | 145.8s
-  epoch 03 | loss 0.0587 | dev_acc 0.8055 | dev_f1 0.8031 | 146.0s
-  epoch 04 | loss 0.0210 | dev_acc 0.7995 | dev_f1 0.8047 | 145.6s
-  epoch 05 | loss 0.0073 | dev_acc 0.8036 | dev_f1 0.8088 | 145.1s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] banglabert | frac=0.3 | train=31114 dev=4714 test=4270
-  epoch 01 | loss 0.3921 | dev_acc 0.7864 | dev_f1 0.7712 | 214.2s
-  epoch 02 | loss 0.1022 | dev_acc 0.8038 | dev_f1 0.8103 | 213.1s
-  epoch 03 | loss 0.0309 | dev_acc 0.8108 | dev_f1 0.8151 | 213.5s
-  epoch 04 | loss 0.0150 | dev_acc 0.8127 | dev_f1 0.8156 | 213.7s
-  epoch 05 | loss 0.0037 | dev_acc 0.8159 | dev_f1 0.8195 | 213.4s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[run] banglabert | frac=1.0 | train=103714 dev=4714 test=4270
-  epoch 01 | loss 0.2225 | dev_acc 0.8023 | dev_f1 0.8169 | 686.4s
-  epoch 02 | loss 0.0230 | dev_acc 0.8063 | dev_f1 0.8066 | 687.4s
-  epoch 03 | loss 0.0071 | dev_acc 0.8129 | dev_f1 0.8201 | 686.8s
-  epoch 04 | loss 0.0025 | dev_acc 0.8017 | dev_f1 0.8176 | 688.1s
-  epoch 05 | loss 0.0009 | dev_acc 0.8106 | dev_f1 0.8218 | 686.3s

Final (test) results

          model..regime_frac..test_acc..test_f1
0)   banglabert_________0.05__0.740281__0.745584
1)   banglabert_________0.10__0.763700__0.779550
2)   banglabert_________0.20__0.798829__0.805788
3)   banglabert_________0.30__0.799063__0.804467
4)   banglabert_________1.00__0.791335__0.807684
5)   e5__________________0.05__0.775176__0.790026
6)   e5__________________0.10__0.804450__0.803391
7)   e5__________________0.20__0.832084__0.837966
8)   e5__________________0.30__0.842623__0.849462
9)   e5__________________1.00__0.846136__0.852658
10)  labse_______________0.05__0.797892__0.798506
11)  labse_______________0.10__0.836534__0.843146
12)  labse_______________0.20__0.867681__0.870204
13)  labse_______________0.30__0.888993__0.891085
14)  labse_______________1.00__0.884543__0.885800
15)  muril_______________0.05__0.797892__0.796510
16)  muril_______________0.10__0.826932__0.828658
17)  muril_______________0.20__0.869087__0.870452
18)  muril_______________0.30__0.876347__0.875413
19)  muril_______________1.00__0.881499__0.883141
20)  sahajbert___________0.05__0.784543__0.783936
21)  sahajbert___________0.10__0.830679__0.838436
22)  sahajbert___________0.20__0.854333__0.861470
23)  sahajbert___________0.30__0.863700__0.872424
24)  sahajbert___________1.00__0.885714__0.891169

**Few-/Full-Shot Fine-Tuning for Bangla WiC with a Trainer-Free PyTorch Loop (on Capped WIC dataset)**

In [None]:
import os, json, random, time                               # Std: filesystem, JSON I/O, RNG seeding, timing
from dataclasses import dataclass                           # For lightweight dataset container
from typing import Dict, List                               # Type hints for clarity
import numpy as np                                          # Numerical ops for logits → metrics
import pandas as pd                                         # Tabular wrangling of WiC lists
import torch                                                # Core PyTorch
from torch.utils.data import DataLoader                     # Mini-batching utilities
from sklearn.metrics import accuracy_score, f1_score        # Standard classification metrics

import transformers                                         # HF Transformers library (version printed below)
from transformers import (                                  # Selected utilities/classes from Transformers
    AutoTokenizer, AutoConfig, AutoModelForSequenceClassification,
    DataCollatorWithPadding, get_linear_schedule_with_warmup, set_seed
)

print("[info] transformers version:", transformers.__version__)  # Log Transformers version for reproducibility

# ---------------- CONFIG ----------------
DATA_DIR = r"C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Dataset\bangla_wic_dataset_capped"  # Root folder containing wic_train/dev/test.json
OUT_DIR  = r"C:\Users\Student\Downloads\projecting_sentences-main\projecting_sentences-main\Result\WIC_finetuned_result_capped"   # Output root for checkpoints & CSVs

MODELS = {     # HF model id per short name to iterate over
    "sahajbert": "neuropark/sahajBERT",
    "muril"    : "google/muril-base-cased",
    "labse"    : "sentence-transformers/LaBSE",
    "e5"       : "intfloat/multilingual-e5-base",
    "banglabert":"sagorsarker/bangla-bert-base",
    # Note: all are fine-tuned as sequence classifiers
}

REGIMES = [0.05, 0.10, 0.20, 0.30, 1.00]                     # Few-shot fractions and full-shot (100%)

SEED = 42                                                    # Global reproducibility seed
BATCH_SIZE = 16                                              # Per-step batch size
MAX_EPOCHS = 5                                               # Maximum epochs before stopping
LR = 2e-5                                                    # AdamW learning rate
WARMUP_RATIO = 0.06                                          # Linear warmup proportion of total steps
PATIENCE = 2                                                 # Early-stopping: epochs without dev-F1 improvement
MAX_LEN = 256                                                # Max sequence length for tokenization

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Prefer GPU if available


# IO & UTILS

# Function: load_json_array
# Purpose: Read a WiC split from disk (JSON list of dicts) and return it as a Python list.
# Inputs:
#   path (str): filesystem path to a JSON file containing the WiC examples.
# Outputs:
#   List[Dict]: a Python list where each element is a dictionary for one WiC example.
def load_json_array(path):
    with open(path, "r", encoding="utf-8") as f:            # Open the JSON file in UTF-8 mode
        return json.load(f)                                  # Parse and return as Python list


# Function: with_markers
# Purpose: Insert explicit [TGT] and [/TGT] markers into a sentence around a target span.
# Inputs:
#   sentence (str): the full sentence text.
#   start (int): character start index (inclusive) of the target span.
#   end (int): character end index (exclusive) of the target span.
#   open_tok (str): left marker token (default "[TGT]").
#   close_tok (str): right marker token (default "[/TGT]").
# Outputs:
#   str: the sentence with markers injected around the specified span (clamped to safe range).
def with_markers(sentence: str, start: int, end: int, open_tok="[TGT]", close_tok="[/TGT]"):
    start = max(0, int(start)); end = max(start, int(end))   # Clamp indices and enforce start ≤ end
    return sentence[:start] + open_tok + sentence[start:end] + close_tok + sentence[end:]  # Return marked sentence


# Function: to_dataframe
# Purpose: Convert a raw WiC JSON list into a tidy DataFrame with standardized columns for training.
# Inputs:
#   wic_list (List[Dict]): list of WiC examples read from JSON.
# Outputs:
#   pd.DataFrame: columns = lemma, s1, s2, sid1, sid2, start1, end1, start2, end2, label (int).
def to_dataframe(wic_list: List[Dict]):
    rows = []                                                # Accumulator for row dicts
    for x in wic_list:                                       # Iterate over raw examples
        rows.append({
            "lemma": x["lemma"],                             # Lemma string
            "s1": x["sentence1"],                            # Left sentence
            "s2": x["sentence2"],                            # Right sentence
            "sid1": x["sent_id1"],                           # Left sentence id
            "sid2": x["sent_id2"],                           # Right sentence id
            "start1": int(x["start1"]), "end1": int(x["end1"]),  # Left target span (char offsets)
            "start2": int(x["start2"]), "end2": int(x["end2"]),  # Right target span (char offsets)
            "label": int(x["label"]),                        # Gold label (1 = same sense, 0 = different)
        })
    return pd.DataFrame(rows)                                # Build DataFrame from rows


# Function: stratified_fraction
# Purpose: Take a stratified sample of the DataFrame by label to achieve a given fraction for few-shot regimes.
# Inputs:
#   df (pd.DataFrame): full training DataFrame with a 'label' column.
#   frac (float): fraction to sample; if ≥ 1.0, return a full shuffle.
#   seed (int): RNG seed for reproducibility.
# Outputs:
#   pd.DataFrame: sampled DataFrame with near-constant label proportions.
def stratified_fraction(df: pd.DataFrame, frac: float, seed: int):
    if frac >= 1.0:                                          # Full-shot case
        return df.sample(frac=1.0, random_state=seed).reset_index(drop=True)  # Return shuffled copy
    parts = []                                               # Accumulate per-class samples
    for y, sub in df.groupby("label"):                       # Split by class label
        k = max(1, int(round(frac * len(sub))))              # Class-size specific sample count
        parts.append(sub.sample(n=k, random_state=seed))     # Sample that many rows from this class
    return pd.concat(parts, axis=0).sample(frac=1.0, random_state=seed).reset_index(drop=True)  # Recombine & shuffle


# Dataclass: WiCDataset
# Purpose: Minimal Dataset wrapper holding token encodings and labels for the Trainer-free loop.
# Fields:
#   encodings (Dict[str, List[List[int]]]): tokenized features (input_ids, attention_mask, etc.) as lists.
#   labels (List[int]): gold labels aligned to encodings.
@dataclass
class WiCDataset(torch.utils.data.Dataset):
    encodings: Dict[str, List[List[int]]]                    # Token features dictionary
    labels: List[int]                                        # Gold labels per example
    def __len__(self): return len(self.labels)               # Return dataset size
    def __getitem__(self, idx):                              # Retrieve one item by index
        # return python lists; collator will pad & convert to tensors
        item = {k: v[idx] for k, v in self.encodings.items()}  # Slice each feature list
        item["labels"] = self.labels[idx]                    # Attach the corresponding label
        return item                                          # Return a dict expected by the model


# Function: tokenize_pairs
# Purpose: Insert markers into sentence pairs and tokenize them with a given tokenizer.
# Inputs:
#   df (pd.DataFrame): must include s1, s2, start1, end1, start2, end2, and label.
#   tokenizer: a Hugging Face tokenizer instance.
# Outputs:
#   (encodings, labels): encodings is a dict of token lists; labels is a list[int].
def tokenize_pairs(df: pd.DataFrame, tokenizer):
    OPEN, CLOSE = "[TGT]", "[/TGT]"                          # Consistent marker tokens
    s1 = [with_markers(a, s, e, OPEN, CLOSE) for a, s, e in zip(df["s1"], df["start1"], df["end1"])]  # Left marked
    s2 = [with_markers(b, s, e, OPEN, CLOSE) for b, s, e in zip(df["s2"], df["start2"], df["end2"])]  # Right marked
    enc = tokenizer(                                         # Tokenize sentence pairs
        s1, s2,
        truncation=True,                                     # Truncate to MAX_LEN
        max_length=MAX_LEN,
        padding=False                                        # Dynamic padding handled by collator
    )
    labels = df["label"].astype(int).tolist()                # Extract labels as ints
    return enc, labels                                       # Return features and labels


# Function: compute_metrics_from_logits
# Purpose: Convert raw model logits to predictions and compute accuracy and F1.
# Inputs:
#   logits (np.ndarray): shape [N, 2] for binary classification.
#   labels (List[int]): gold labels (0/1) of length N.
# Outputs:
#   (acc, f1): tuple of floats with accuracy and F1 score.
def compute_metrics_from_logits(logits: np.ndarray, labels: List[int]):
    preds = logits.argmax(axis=-1)                           # Predicted class = argmax over logits
    acc = accuracy_score(labels, preds)                      # Compute accuracy
    f1  = f1_score(labels, preds)                            # Compute F1 (binary average='binary')
    return acc, f1                                           # Return both metrics


# TRAIN / EVAL

# Function: epoch_eval
# Purpose: Evaluate a model over a dataloader and compute accuracy and F1 from accumulated logits.
# Inputs:
#   model (nn.Module): sequence classification model.
#   dataloader (DataLoader): batched dataset iterator.
#   device (torch.device): 'cuda' or 'cpu' to run inference on.
# Outputs:
#   (acc, f1): floats with accuracy and F1 on the provided dataloader.
def epoch_eval(model, dataloader, device):
    model.eval()                                             # Switch to eval mode (no dropout, etc.)
    all_logits = []                                          # Accumulator for logits
    all_labels = []                                          # Accumulator for gold labels
    with torch.no_grad():                                    # Disable gradient tracking
        for batch in dataloader:                             # Iterate over batches
            labels = batch.pop("labels").to(device)          # Move labels to device and remove from features
            batch = {k: v.to(device) for k, v in batch.items()}  # Move all features to device
            outputs = model(**batch)                         # Forward pass
            logits = outputs.logits.detach().cpu().numpy()   # Collect logits on CPU as NumPy
            all_logits.append(logits)                        # Append batch logits
            all_labels.append(labels.cpu().numpy())          # Append batch labels
    all_logits = np.concatenate(all_logits, axis=0) if all_logits else np.zeros((0,2))  # Stack logits
    all_labels = np.concatenate(all_labels, axis=0).astype(int) if all_labels else np.zeros((0,), dtype=int)  # Stack labels
    return compute_metrics_from_logits(all_logits, all_labels)  # Compute and return metrics


# Function: train_eval_one
# Purpose: Fine-tune one backbone at a given regime, with early stopping on dev F1; save best; evaluate on test.
# Inputs:
#   model_id (str): Hugging Face model identifier.
#   model_name (str): short alias for naming outputs.
#   df_train, df_dev, df_test (pd.DataFrame): split dataframes from WiC JSONs.
#   regime_frac (float): fraction of the training set to sample (few-/full-shot).
#   out_root (str): output directory root for this model/regime run.
# Outputs:
#   Dict: a single summary row containing sizes, best dev F1, and test metrics.
def train_eval_one(model_id, model_name, df_train, df_dev, df_test, regime_frac, out_root):
    set_seed(SEED)                                           # Set HF/torch/random seeds

    # Tokenizer & model
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)  # Load tokenizer
    tokenizer.add_special_tokens({"additional_special_tokens": ["[TGT]", "[/TGT]"]})  # Add markers to vocab

    config = AutoConfig.from_pretrained(model_id, num_labels=2)  # Binary classification head
    model = AutoModelForSequenceClassification.from_pretrained(model_id, config=config)  # Load base weights
    model.resize_token_embeddings(len(tokenizer))            # Resize embeddings due to added tokens
    model.to(DEVICE)                                        # Move model to GPU/CPU

    # Few-shot sample
    df_subtrain = stratified_fraction(df_train, regime_frac, seed=SEED)  # Stratified downsample of train

    # Tokenize
    enc_tr, y_tr = tokenize_pairs(df_subtrain, tokenizer)    # Tokenize sampled train
    enc_dv, y_dv = tokenize_pairs(df_dev, tokenizer)         # Tokenize dev
    enc_ts, y_ts = tokenize_pairs(df_test, tokenizer)        # Tokenize test

    # Datasets / Loaders
    collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8 if torch.cuda.is_available() else None)  # Dynamic pad
    ds_tr = WiCDataset(enc_tr, y_tr)                         # Train dataset
    ds_dv = WiCDataset(enc_dv, y_dv)                         # Dev dataset
    ds_ts = WiCDataset(enc_ts, y_ts)                         # Test dataset

    train_loader = DataLoader(ds_tr, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collator, num_workers=0)   # Train loader
    dev_loader   = DataLoader(ds_dv, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collator, num_workers=0)  # Dev loader
    test_loader  = DataLoader(ds_ts, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collator, num_workers=0)  # Test loader

    # Optim / Scheduler
    total_steps = max(1, len(train_loader) * MAX_EPOCHS)     # Total train steps for scheduler
    warmup_steps = int(WARMUP_RATIO * total_steps)           # Num warmup steps
    optimizer = torch.optim.AdamW(model.parameters(), lr=LR) # AdamW optimizer
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)  # Linear warmup/decay

    # Output dir
    tag = f"{model_name}_frac{int(regime_frac*100)}"         # Tag includes model alias and regime
    out_dir = os.path.join(out_root, tag); os.makedirs(out_dir, exist_ok=True)  # Make output folder

    best_dev_f1 = -1.0                                       # Track best dev F1 for early stopping
    no_improve = 0                                           # Counter for patience
    history = []                                             # Per-epoch logs

    print(f"\n[run] {model_name} | frac={regime_frac} | train={len(ds_tr)} dev={len(ds_dv)} test={len(ds_ts)}")  # Run header

    for epoch in range(1, MAX_EPOCHS+1):                     # Epoch loop
        model.train()                                        # Train mode
        t0 = time.time()                                     # Epoch timer
        total_loss = 0.0                                     # Reset loss accumulator
        for batch in train_loader:                           # Iterate training batches
            labels = batch.pop("labels").to(DEVICE)          # Extract and move labels
            batch  = {k: v.to(DEVICE) for k, v in batch.items()}  # Move features to device

            outputs = model(**batch, labels=labels)          # Forward pass with labels
            loss = outputs.loss                              # Cross-entropy loss

            optimizer.zero_grad(set_to_none=True)            # Clear grads
            loss.backward()                                  # Backprop
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # Gradient clipping for stability
            optimizer.step()                                 # Optimizer step
            scheduler.step()                                 # Scheduler step

            total_loss += loss.item()                        # Accumulate batch loss

        # Eval on dev each epoch
        dev_acc, dev_f1 = epoch_eval(model, dev_loader, DEVICE)  # Validate on dev
        avg_loss = total_loss / max(1, len(train_loader))    # Compute mean train loss for epoch
        history.append({"epoch": epoch, "train_loss": avg_loss, "dev_acc": dev_acc, "dev_f1": dev_f1})  # Log row
        print(f"  epoch {epoch:02d} | loss {avg_loss:.4f} | dev_acc {dev_acc:.4f} | dev_f1 {dev_f1:.4f} | {time.time()-t0:.1f}s")  # Epoch summary

        # Early stopping on dev F1
        if dev_f1 > best_dev_f1:                             # If improved dev F1
            best_dev_f1 = dev_f1                             # Update best
            no_improve = 0                                   # Reset patience counter
            # Save best
            model.save_pretrained(out_dir)                   # Persist model weights/config
            tokenizer.save_pretrained(out_dir)               # Persist tokenizer artifacts
            with open(os.path.join(out_dir, "dev_best.json"), "w", encoding="utf-8") as f:  # Save best dev metrics
                json.dump({"epoch": epoch, "dev_acc": dev_acc, "dev_f1": dev_f1}, f, ensure_ascii=False, indent=2)
        else:
            no_improve += 1                                  # No improvement this epoch
            if no_improve >= PATIENCE:                       # Hit patience threshold
                print(f"  early stop: no dev F1 improvement for {PATIENCE} epoch(s).")  # Log early stop
                break                                        # Exit training loop

    # Load best (already saved) and evaluate on test
    # (Re-use in-memory 'model' which holds last epoch; for exact best, reload)
    try:
        best_model = AutoModelForSequenceClassification.from_pretrained(out_dir).to(DEVICE)  # Reload best if present
    except Exception:
        best_model = model                                  # Otherwise use last-epoch model
    test_acc, test_f1 = epoch_eval(best_model, test_loader, DEVICE)  # Final test evaluation

    # Save summary row
    row = {
        "model": model_name, "hf_id": model_id, "regime_frac": regime_frac,  # Identity fields
        "train_size": len(ds_tr), "dev_size": len(ds_dv), "test_size": len(ds_ts),  # Dataset sizes
        "best_dev_f1": float(best_dev_f1), "test_acc": float(test_acc), "test_f1": float(test_f1)  # Metrics
    }
    pd.DataFrame(history).to_csv(os.path.join(out_dir, "train_history.csv"), index=False, encoding="utf-8-sig")  # Persist per-epoch log
    pd.DataFrame([row]).to_csv(os.path.join(out_dir, "summary.csv"), index=False, encoding="utf-8-sig")          # Persist run summary
    return row                                               # Return summary dict to caller


# MAIN 

# Function: main
# Purpose: End-to-end orchestration—load splits, prepare DataFrames, and iterate model × regime runs; collect summaries.
# Inputs: 
#   None (uses module-level config paths and constants).
# Outputs: 
#   None (Files written under OUT_DIR; prints run tables).
def main():
    random.seed(SEED); np.random.seed(SEED); set_seed(SEED)  # Seed Python, NumPy, and Transformers

    train = load_json_array(os.path.join(DATA_DIR, "wic_train.json"))  # Read train split JSON
    dev   = load_json_array(os.path.join(DATA_DIR, "wic_dev.json"))    # Read dev split JSON
    test  = load_json_array(os.path.join(DATA_DIR, "wic_test.json"))   # Read test split JSON

    df_train = to_dataframe(train)                          # Convert train JSON → DataFrame
    df_dev   = to_dataframe(dev)                            # Convert dev JSON → DataFrame
    df_test  = to_dataframe(test)                           # Convert test JSON → DataFrame

    runs_dir = os.path.join(OUT_DIR, "runs"); os.makedirs(runs_dir, exist_ok=True)  # Ensure runs folder exists
    results = []                                            # Accumulate per-run summaries
    for model_name, model_id in MODELS.items():             # Iterate models
        for frac in REGIMES:                                # Iterate data regimes
            try:
                res = train_eval_one(model_id, model_name, df_train, df_dev, df_test, frac, runs_dir)  # Run a job
                results.append(res)                         # Keep summary row
            except Exception as e:                          # Robustness: catch & log errors per run
                print(f"[WARN] Run failed for {model_name} (frac={frac}): {e}")

    if results:                                             # If at least one successful run
        df_res = pd.DataFrame(results)                      # Build a table of summaries
        df_res.to_csv(os.path.join(OUT_DIR, "ALL_results.csv"), index=False, encoding="utf-8-sig")  # Save summaries CSV
        piv = df_res.pivot_table(index=["model","regime_frac"], values=["test_acc","test_f1"], aggfunc="mean")  # Pivot by model/regime
        piv = piv.reset_index().sort_values(["model","regime_frac"])  # Sort for readability
        piv.to_csv(os.path.join(OUT_DIR, "ALL_results_pivot.csv"), index=False, encoding="utf-8-sig")  # Save pivot CSV
        print("\n== Final (test) results ==\n", piv)        # Print final pivot table
    else:
        print("No successful runs to summarize.")           # Inform if nothing completed

if __name__ == "__main__":                                  # Script entry point
    main()                                                  # Launch the pipeline


[info] transformers version: 4.55.0
[skip] already finished: sahajbert_frac1 (test_f1=0.4329896907216495)
[skip] already finished: sahajbert_frac5 (test_f1=0.6220095693779905)
[skip] already finished: sahajbert_frac10 (test_f1=0.5806451612903226)
[skip] already finished: sahajbert_frac20 (test_f1=0.7302904564315352)
[skip] already finished: sahajbert_frac30 (test_f1=0.6857142857142857)
[skip] already finished: sahajbert_frac100 (test_f1=0.734375)
[skip] already finished: muril_frac1 (test_f1=0.6666666666666666)
[skip] already finished: muril_frac5 (test_f1=0.662020905923345)
[skip] already finished: muril_frac10 (test_f1=0.6666666666666666)
[skip] already finished: muril_frac20 (test_f1=0.6890756302521008)
[skip] already finished: muril_frac30 (test_f1=0.7083333333333334)
[skip] already finished: muril_frac100 (test_f1=0.7175572519083969)
[skip] already finished: labse_frac1 (test_f1=0.6167400881057269)
[skip] already finished: labse_frac5 (test_f1=0.5520361990950227)
[skip] already fi

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/LaBSE and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`



[run] labse | frac=1.0 | train=5106 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.5212 | dev_acc 0.5321 | dev_f1 0.6792 | 2268.9s
  epoch 02 | loss 0.1671 | dev_acc 0.6697 | dev_f1 0.7097 | 2576.0s
  epoch 03 | loss 0.0549 | dev_acc 0.6789 | dev_f1 0.7222 | 2340.7s
  epoch 04 | loss 0.0221 | dev_acc 0.6468 | dev_f1 0.7200 | 2497.6s
  epoch 05 | loss 0.0103 | dev_acc 0.6468 | dev_f1 0.7200 | 2516.1s
  early stop: no dev F1 improvement for 2 epoch(s).


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] e5 | frac=0.01 | train=52 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.7019 | dev_acc 0.5000 | dev_f1 0.6175 | 45.9s
  epoch 02 | loss 0.6851 | dev_acc 0.5367 | dev_f1 0.5121 | 44.9s
  epoch 03 | loss 0.6500 | dev_acc 0.4771 | dev_f1 0.1618 | 45.6s
  early stop: no dev F1 improvement for 2 epoch(s).


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] e5 | frac=0.05 | train=256 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.7178 | dev_acc 0.4817 | dev_f1 0.2098 | 143.2s
  epoch 02 | loss 0.6744 | dev_acc 0.5138 | dev_f1 0.6558 | 130.8s
  epoch 03 | loss 0.6332 | dev_acc 0.5138 | dev_f1 0.4804 | 140.4s
  epoch 04 | loss 0.5858 | dev_acc 0.5092 | dev_f1 0.5023 | 145.7s
  early stop: no dev F1 improvement for 2 epoch(s).


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] e5 | frac=0.1 | train=510 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.7058 | dev_acc 0.4587 | dev_f1 0.3295 | 250.8s
  epoch 02 | loss 0.6717 | dev_acc 0.5046 | dev_f1 0.5424 | 241.0s
  epoch 03 | loss 0.6184 | dev_acc 0.4817 | dev_f1 0.5498 | 241.7s
  epoch 04 | loss 0.5576 | dev_acc 0.5183 | dev_f1 0.5783 | 242.8s
  epoch 05 | loss 0.4955 | dev_acc 0.5138 | dev_f1 0.6015 | 239.2s


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] e5 | frac=0.2 | train=1022 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.6873 | dev_acc 0.5046 | dev_f1 0.6687 | 456.2s
  epoch 02 | loss 0.6039 | dev_acc 0.5688 | dev_f1 0.6643 | 444.5s
  epoch 03 | loss 0.4523 | dev_acc 0.6193 | dev_f1 0.6029 | 442.7s
  early stop: no dev F1 improvement for 2 epoch(s).


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] e5 | frac=0.3 | train=1532 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.6889 | dev_acc 0.5275 | dev_f1 0.5422 | 645.3s
  epoch 02 | loss 0.5338 | dev_acc 0.5642 | dev_f1 0.6468 | 625.0s
  epoch 03 | loss 0.3177 | dev_acc 0.6055 | dev_f1 0.6884 | 636.9s
  epoch 04 | loss 0.1915 | dev_acc 0.6606 | dev_f1 0.7218 | 621.5s
  epoch 05 | loss 0.1052 | dev_acc 0.6606 | dev_f1 0.7109 | 618.8s


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] e5 | frac=1.0 | train=5106 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.5925 | dev_acc 0.6101 | dev_f1 0.6352 | 2102.8s
  epoch 02 | loss 0.2400 | dev_acc 0.6101 | dev_f1 0.6863 | 2136.0s
  epoch 03 | loss 0.1160 | dev_acc 0.6147 | dev_f1 0.7000 | 2108.6s
  epoch 04 | loss 0.0691 | dev_acc 0.6284 | dev_f1 0.7216 | 2146.3s
  epoch 05 | loss 0.0293 | dev_acc 0.6376 | dev_f1 0.7228 | 2127.2s


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] banglabert | frac=0.01 | train=52 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.6991 | dev_acc 0.4908 | dev_f1 0.5277 | 39.6s
  epoch 02 | loss 0.5935 | dev_acc 0.4908 | dev_f1 0.5394 | 37.0s
  epoch 03 | loss 0.5742 | dev_acc 0.5092 | dev_f1 0.5244 | 37.2s
  epoch 04 | loss 0.4696 | dev_acc 0.5183 | dev_f1 0.4670 | 37.7s
  early stop: no dev F1 improvement for 2 epoch(s).


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] banglabert | frac=0.05 | train=256 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.6948 | dev_acc 0.5046 | dev_f1 0.6687 | 107.8s
  epoch 02 | loss 0.6173 | dev_acc 0.6193 | dev_f1 0.6103 | 92.8s
  epoch 03 | loss 0.4496 | dev_acc 0.6239 | dev_f1 0.6940 | 107.9s
  epoch 04 | loss 0.3373 | dev_acc 0.6789 | dev_f1 0.7083 | 125.8s
  epoch 05 | loss 0.2548 | dev_acc 0.6606 | dev_f1 0.7016 | 122.4s


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] banglabert | frac=0.1 | train=510 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.7182 | dev_acc 0.5596 | dev_f1 0.6757 | 202.0s
  epoch 02 | loss 0.5454 | dev_acc 0.6881 | dev_f1 0.7444 | 208.4s
  epoch 03 | loss 0.3634 | dev_acc 0.6743 | dev_f1 0.7300 | 179.2s
  epoch 04 | loss 0.2017 | dev_acc 0.6927 | dev_f1 0.7173 | 159.9s
  early stop: no dev F1 improvement for 2 epoch(s).


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] banglabert | frac=0.2 | train=1022 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.6670 | dev_acc 0.6422 | dev_f1 0.6100 | 333.4s
  epoch 02 | loss 0.3805 | dev_acc 0.6743 | dev_f1 0.7149 | 330.1s
  epoch 03 | loss 0.1623 | dev_acc 0.6514 | dev_f1 0.6960 | 373.5s
  epoch 04 | loss 0.0702 | dev_acc 0.6606 | dev_f1 0.7063 | 379.8s
  early stop: no dev F1 improvement for 2 epoch(s).


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] banglabert | frac=0.3 | train=1532 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.6394 | dev_acc 0.6422 | dev_f1 0.6977 | 514.5s
  epoch 02 | loss 0.3249 | dev_acc 0.6560 | dev_f1 0.7148 | 512.0s
  epoch 03 | loss 0.1454 | dev_acc 0.6606 | dev_f1 0.6992 | 522.5s
  epoch 04 | loss 0.0549 | dev_acc 0.6835 | dev_f1 0.7206 | 504.5s
  epoch 05 | loss 0.0183 | dev_acc 0.6697 | dev_f1 0.7000 | 516.2s


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



[run] banglabert | frac=1.0 | train=5106 dev=218 test=194 (start_epoch=1)
  epoch 01 | loss 0.4836 | dev_acc 0.6560 | dev_f1 0.7331 | 1779.0s
  epoch 02 | loss 0.1595 | dev_acc 0.6743 | dev_f1 0.7171 | 1776.1s
  epoch 03 | loss 0.0585 | dev_acc 0.6927 | dev_f1 0.7491 | 1766.5s
  epoch 04 | loss 0.0198 | dev_acc 0.6927 | dev_f1 0.7452 | 1778.6s
  epoch 05 | loss 0.0088 | dev_acc 0.6881 | dev_f1 0.7500 | 1779.6s

== Final (test) results ==
          model  regime_frac  test_acc   test_f1
0   banglabert         0.01  0.505155  0.596639
1   banglabert         0.05  0.623711  0.675556
2   banglabert         0.10  0.613402  0.683544
3   banglabert         0.20  0.664948  0.716157
4   banglabert         0.30  0.685567  0.726457
5   banglabert         1.00  0.634021  0.719368
6           e5         0.01  0.469072  0.579592
7           e5         0.05  0.510309  0.649446
8           e5         0.10  0.577320  0.637168
9           e5         0.20  0.494845  0.662069
10          e5         0.30 