# Prepare MGSM Dataset for CCL

This notebook prepares the MGSM (Multilingual Grade School Math) dataset for use with CCL (Causal-aware In-context Learning).

## Overview

MGSM is a benchmark of grade-school math problems, proposed in the paper [Language models are multilingual chain-of-thought reasoners](http://arxiv.org/abs/2210.03057). The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages:
- Spanish (es)
- French (fr)
- German (de)
- Russian (ru)
- Chinese (zh)
- Japanese (ja)
- Thai (th)
- Swahili (sw)
- Bengali (bn)
- Telugu (te)

The data is available as `.tsv` files in the `existing_work/url_nlp/mgsm/` directory. Each file contains question-answer pairs (tab-separated), where the answer is a numeric value.

The CCL framework requires data in a specific format with embeddings:
- **X**: Question embeddings
- **E**: Environment embeddings (language)
- **T**: Task embeddings  
- **Y**: Answer embeddings
- **Y_eq**: Equation format answer embeddings
- **Index_E**: Environment index (language ID)
- **Index_T**: Task index
- **answer**: Actual answer string

We'll use OpenAI's `text-embedding-3-small` model to generate embeddings as specified in the CCL README.

**Data Source**: https://github.com/google-research/url-nlp/tree/main/mgsm


In [1]:
import os
import sys
import pandas as pd
import numpy as np
import pickle
from pathlib import Path
from tqdm import tqdm
import time
from openai import OpenAI
from collections import defaultdict

# Configuration
EMB_MODEL = "text-embedding-3-small"  # As specified in CCL README

# OpenAI API setup (only needed for in-context learning, not for loading embeddings)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", 'xx')
print(OPENAI_API_KEY)

xx


In [2]:

if OPENAI_API_KEY and OPENAI_API_KEY != "xxxxx":
    client = OpenAI(api_key=OPENAI_API_KEY)
    print("✓ OpenAI API client initialized (for in-context learning)")
else:
    client = None
    print("⚠ OpenAI API key not set - embeddings will be loaded from saved files")
    print("  Set OPENAI_API_KEY environment variable if you need to use API for in-context learning")

# Path to MGSM data directory (TSV files)
MGSM_DATA_DIR = Path("existing_work/url_nlp/mgsm")

# Path to saved embeddings
EMB_DIR = Path("data/mgsm/embs")
ID_EMB_PATH = EMB_DIR / "dataset_gpt_emb_ID.pkl"
OOD_EMB_PATH = EMB_DIR / "dataset_gpt_emb_OOD.pkl"

# Language mapping for MGSM
# Based on the 10 languages from the MGSM paper plus English
LANGUAGES = {
    'en': 0,  # English (ID)
    'es': 1,  # Spanish (OOD)
    'fr': 2,  # French (OOD)
    'de': 3,  # German (OOD)
    'ru': 4,  # Russian (OOD)
    'zh': 5,  # Chinese (OOD)
    'ja': 6,  # Japanese (OOD)
    'th': 7,  # Thai (OOD)
    'sw': 8,  # Swahili (OOD)
    'bn': 9,  # Bengali (OOD)
    'te': 10, # Telugu (OOD)
}

print(f"Embedding model: {EMB_MODEL}")
print(f"Languages to process: {list(LANGUAGES.keys())}")
print(f"MGSM data directory: {MGSM_DATA_DIR}")
print(f"Embeddings directory: {EMB_DIR}")


✓ OpenAI API client initialized (for in-context learning)
Embedding model: text-embedding-3-small
Languages to process: ['en', 'es', 'fr', 'de', 'ru', 'zh', 'ja', 'th', 'sw', 'bn', 'te']
MGSM data directory: existing_work/url_nlp/mgsm
Embeddings directory: data/mgsm/embs


In [3]:
# API functions for in-context learning (only used when querying LLMs)
# These are NOT used for loading embeddings - embeddings are loaded from saved files

def get_embedding(text, model=EMB_MODEL, max_retries=3):
    """
    Get embedding for a text using OpenAI API with retry logic.
    
    NOTE: This function is only used for in-context learning (when querying LLMs).
    For loading dataset embeddings, use the saved pickle files instead.
    """
    if client is None:
        raise ValueError("OpenAI API client not initialized. Set OPENAI_API_KEY environment variable.")
    
    for attempt in range(max_retries):
        try:
            response = client.embeddings.create(
                model=model,
                input=text
            )
            return response.data[0].embedding
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
                continue
            else:
                print(f"Error getting embedding for text: {text[:50]}... Error: {e}")
                raise
    
def get_embeddings_batch(texts, model=EMB_MODEL, batch_size=100):
    """
    Get embeddings for a batch of texts using OpenAI API.
    
    NOTE: This function is only used for in-context learning (when querying LLMs).
    For loading dataset embeddings, use the saved pickle files instead.
    """
    if client is None:
        raise ValueError("OpenAI API client not initialized. Set OPENAI_API_KEY environment variable.")
    
    embeddings = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Getting embeddings"):
        batch = texts[i:i+batch_size]
        try:
            response = client.embeddings.create(
                model=model,
                input=batch
            )
            batch_embeddings = [item.embedding for item in response.data]
            embeddings.extend(batch_embeddings)
            time.sleep(0.1)  # Rate limiting
        except Exception as e:
            print(f"Error in batch {i//batch_size}: {e}")
            # Fallback to individual requests
            for text in batch:
                embeddings.append(get_embedding(text, model))
    return embeddings

print("✓ API functions defined (for in-context learning only)")
print("  Embeddings will be loaded from saved pickle files, not generated via API")


✓ API functions defined (for in-context learning only)
  Embeddings will be loaded from saved pickle files, not generated via API


In [4]:
# Load MGSM data from TSV files
# The data is available as .tsv files in existing_work/url_nlp/mgsm/
# Each file is named mgsm_{lang}.tsv and contains question-answer pairs (tab-separated)

def load_mgsm_tsv(lang_code):
    """Load MGSM data from TSV file for a given language."""
    tsv_file = MGSM_DATA_DIR / f"mgsm_{lang_code}.tsv"
    
    if not tsv_file.exists():
        print(f"Warning: File not found: {tsv_file}")
        return []
    
    # Read TSV file (tab-separated: question \t answer)
    df = pd.read_csv(tsv_file, sep='\t', header=None, names=['question', 'answer'])
    
    # Convert answer to string (it's a numeric value)
    df['answer'] = df['answer'].astype(str)
    
    return df

# Load data for all languages
all_data = []
for lang in LANGUAGES.keys():
    lang_df = load_mgsm_tsv(lang)
    if len(lang_df) > 0:
        lang_df['language'] = lang
        lang_df['Index_E'] = LANGUAGES[lang]
        lang_df['Index_T'] = 0  # Single task for MGSM
        all_data.append(lang_df)
        print(f"Loaded {len(lang_df)} samples from {lang} ({MGSM_DATA_DIR / f'mgsm_{lang}.tsv'})")
    else:
        print(f"No data loaded for {lang}")

# Combine all data
if all_data:
    df = pd.concat(all_data, ignore_index=True)
    print(f"\nTotal samples loaded: {len(df)}")
    print(f"Languages: {df['language'].unique()}")
    print(f"\nSample data:")
    print(df[['question', 'answer', 'language', 'Index_E']].head())
else:
    raise ValueError("No data loaded! Please check that TSV files exist in the MGSM data directory.")


Loaded 250 samples from en (existing_work/url_nlp/mgsm/mgsm_en.tsv)
Loaded 250 samples from es (existing_work/url_nlp/mgsm/mgsm_es.tsv)
Loaded 250 samples from fr (existing_work/url_nlp/mgsm/mgsm_fr.tsv)
Loaded 250 samples from de (existing_work/url_nlp/mgsm/mgsm_de.tsv)
Loaded 250 samples from ru (existing_work/url_nlp/mgsm/mgsm_ru.tsv)
Loaded 250 samples from zh (existing_work/url_nlp/mgsm/mgsm_zh.tsv)
Loaded 250 samples from ja (existing_work/url_nlp/mgsm/mgsm_ja.tsv)
Loaded 250 samples from th (existing_work/url_nlp/mgsm/mgsm_th.tsv)
Loaded 250 samples from sw (existing_work/url_nlp/mgsm/mgsm_sw.tsv)
Loaded 250 samples from bn (existing_work/url_nlp/mgsm/mgsm_bn.tsv)
Loaded 250 samples from te (existing_work/url_nlp/mgsm/mgsm_te.tsv)

Total samples loaded: 2750
Languages: ['en' 'es' 'fr' 'de' 'ru' 'zh' 'ja' 'th' 'sw' 'bn' 'te']

Sample data:
                                            question answer language  Index_E
0  Janet’s ducks lay 16 eggs per day. She eats th...     18     

In [5]:
# Format questions and answers according to CCL template
# Based on icl/prompts/templates.json: "Question: {}\n" and "Answer: {}\n\n"
# The answer from TSV is already a numeric value (as string)
df['answer_number'] = df['answer']  # Keep the numeric answer
df['question_formatted'] = df['question'].apply(lambda x: f"Question: {x}\n")
df['answer_formatted'] = df['answer_number'].apply(lambda x: f"Answer: {x}\n\n")
df['answer_eq'] = df['answer_number']  # Equation format (just the number)

print(f"DataFrame shape: {df.shape}")
print(f"\nSample formatted data:")
print(df[['question', 'answer_number', 'language', 'Index_E']].head())


DataFrame shape: (2750, 9)

Sample formatted data:
                                            question answer_number language  \
0  Janet’s ducks lay 16 eggs per day. She eats th...            18       en   
1  A robe takes 2 bolts of blue fiber and half th...             3       en   
2  Josh decides to try flipping a house.  He buys...         70000       en   
3  James decides to run 3 sprints 3 times a week....           540       en   
4  Every day, Wendi feeds each of her chickens th...            20       en   

   Index_E  
0        0  
1        0  
2        0  
3        0  
4        0  


In [12]:
# Generate embeddings using OpenAI API
# This cell generates embeddings for all the data loaded from TSV files

if client is None:
    print("⚠ Error: OpenAI API client not initialized!")
    print("Please set OPENAI_API_KEY environment variable before running this cell.")
    print("\nYou can set it by running:")
    print("  import os")
    print("  os.environ['OPENAI_API_KEY'] = 'your-api-key-here'")
    print("  Then re-run Cell 1 to initialize the client.")
else:
    print("=" * 80)
    print("Generating Embeddings Using OpenAI API")
    print("=" * 80)
    print(f"Model: {EMB_MODEL}")
    print(f"Total samples to process: {len(df)}")
    print(f"\nThis will generate embeddings for:")
    print(f"  - Questions (X): {len(df)} samples")
    print(f"  - Answers (Y): {len(df)} samples")
    print(f"  - Equation answers (Y_eq): {len(df)} samples")
    print(f"  - Languages (E): {len(df['language'].unique())} unique languages")
    print(f"  - Task (T): 1 task")
    print(f"\nThis may take several minutes depending on API rate limits...")
    print("=" * 80)
    
    # Generate question embeddings (X)
    print("\n[1/5] Generating question embeddings (X)...")
    df['X'] = get_embeddings_batch(df['question_formatted'].tolist(), batch_size=100)
    
    # Generate answer embeddings (Y)
    print("\n[2/5] Generating answer embeddings (Y)...")
    df['Y'] = get_embeddings_batch(df['answer_formatted'].tolist(), batch_size=100)
    
    # Generate equation answer embeddings (Y_eq)
    print("\n[3/5] Generating equation answer embeddings (Y_eq)...")
    df['Y_eq'] = get_embeddings_batch(df['answer_eq'].apply(lambda x: f"Answer: {x}").tolist(), batch_size=100)
    
    # Generate environment embeddings (E) - one per language
    print("\n[4/5] Generating environment embeddings (E)...")
    unique_languages = df['language'].unique()
    lang_embeddings = {}
    for lang in tqdm(unique_languages, desc="  Languages"):
        lang_text = f"Language: {lang}"
        lang_embeddings[lang] = get_embedding(lang_text)
    df['E'] = df['language'].apply(lambda x: lang_embeddings[x])
    
    # Generate task embeddings (T) - single task for MGSM
    print("\n[5/5] Generating task embeddings (T)...")
    task_text = "Task: Grade School Math Problem Solving"
    task_embedding = get_embedding(task_text)
    df['T'] = [task_embedding] * len(df)
    
    # Convert embeddings to numpy arrays for consistency
    print("\nConverting embeddings to numpy arrays...")
    for col in ['X', 'E', 'T', 'Y', 'Y_eq']:
        df[col] = df[col].apply(np.array)
    
    # Verify embedding shapes
    print("\n" + "=" * 80)
    print("✓ Embeddings generated successfully!")
    print("=" * 80)
    print(f"  Question embedding (X) shape: {np.array(df['X'].iloc[0]).shape}")
    print(f"  Answer embedding (Y) shape: {np.array(df['Y'].iloc[0]).shape}")
    print(f"  Environment embedding (E) shape: {np.array(df['E'].iloc[0]).shape}")
    print(f"  Task embedding (T) shape: {np.array(df['T'].iloc[0]).shape}")
    print(f"  Equation answer embedding (Y_eq) shape: {np.array(df['Y_eq'].iloc[0]).shape}")
    print(f"\nTotal samples with embeddings: {len(df)}")
    print("=" * 80)


Generating Embeddings Using OpenAI API
Model: text-embedding-3-small
Total samples to process: 2750

This will generate embeddings for:
  - Questions (X): 2750 samples
  - Answers (Y): 2750 samples
  - Equation answers (Y_eq): 2750 samples
  - Languages (E): 11 unique languages
  - Task (T): 1 task

This may take several minutes depending on API rate limits...

[1/5] Generating question embeddings (X)...


Getting embeddings:   0%|          | 0/28 [00:00<?, ?it/s]

Error in batch 0: Error code: 401 - {'error': {'message': 'Incorrect API key provided: xx. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}


Getting embeddings:   0%|          | 0/28 [00:03<?, ?it/s]

Error getting embedding for text: Question: Janet’s ducks lay 16 eggs per day. She e... Error: Error code: 401 - {'error': {'message': 'Incorrect API key provided: xx. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}





AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: xx. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

Okay

In [13]:
# Load embeddings from data/mgsm/embs directory
# This cell loads embeddings from the saved pickle files in data/mgsm/embs

print("=" * 80)
print("Loading Embeddings from data/mgsm/embs")
print("=" * 80)

# Check if embeddings are already in the dataframe
if 'X' in df.columns and df['X'].iloc[0] is not None:
    print("✓ Embeddings are already present in the dataframe!")
    print(f"  Total samples: {len(df)}")
    print(f"  Embedding columns: {[col for col in ['X', 'Y', 'E', 'T', 'Y_eq'] if col in df.columns]}")
else:
    # Load from pickle files in data/mgsm/embs
    id_path = EMB_DIR / "dataset_gpt_emb_ID.pkl"
    ood_path = EMB_DIR / "dataset_gpt_emb_OOD.pkl"
    
    if id_path.exists() and ood_path.exists():
        print(f"Loading ID embeddings from: {id_path}")
        with open(id_path, 'rb') as f:
            df_id_loaded = pickle.load(f)
        
        print(f"Loading OOD embeddings from: {ood_path}")
        with open(ood_path, 'rb') as f:
            df_ood_loaded = pickle.load(f)
        
        # Combine ID and OOD data
        df = pd.concat([df_id_loaded, df_ood_loaded], ignore_index=True)
        
        print(f"\n✓ Successfully loaded embeddings from data/mgsm/embs!")
        print(f"  ID samples: {len(df_id_loaded)}")
        print(f"  OOD samples: {len(df_ood_loaded)}")
        print(f"  Total samples: {len(df)}")
        
        # Verify embeddings are present
        emb_cols = ['X', 'Y', 'E', 'T', 'Y_eq']
        present_cols = [col for col in emb_cols if col in df.columns and df[col].iloc[0] is not None]
        missing_cols = [col for col in emb_cols if col not in present_cols]
        
        if present_cols:
            print(f"\n✓ Embedding columns found: {present_cols}")
            if 'X' in present_cols:
                print(f"  Question embedding (X) shape: {np.array(df.iloc[0]['X']).shape}")
        if missing_cols:
            print(f"\n⚠ Missing embedding columns: {missing_cols}")
            print("  These embeddings may need to be generated.")
    else:
        print(f"\n⚠ Embedding files not found in data/mgsm/embs!")
        print(f"  Expected ID file: {id_path}")
        print(f"  Expected OOD file: {ood_path}")
        print(f"\nPlease ensure the files exist, or run the embedding generation cells first.")
        print(f"Current working directory: {Path.cwd()}")

print("=" * 80)


Loading Embeddings from data/mgsm/embs
Loading ID embeddings from: data/mgsm/embs/dataset_gpt_emb_ID.pkl
Loading OOD embeddings from: data/mgsm/embs/dataset_gpt_emb_OOD.pkl

✓ Successfully loaded embeddings from data/mgsm/embs!
  ID samples: 250
  OOD samples: 2500
  Total samples: 2750

✓ Embedding columns found: ['X', 'Y', 'E', 'T', 'Y_eq']
  Question embedding (X) shape: (1536,)


In [15]:
df_ood_loaded.columns

Index(['question', 'answer', 'language', 'Index_E', 'Index_T', 'answer_number',
       'question_formatted', 'answer_formatted', 'answer_eq', 'X', 'Y', 'Y_eq',
       'E', 'T'],
      dtype='object')

In [16]:
# Prepare data in CCL format
# Split into ID (English) and OOD (other languages)
df_id = df[df['language'] == 'en'].copy()
df_ood = df[df['language'] != 'en'].copy()

print(f"ID (English) samples: {len(df_id)}")
print(f"OOD (Other languages) samples: {len(df_ood)}")

# Ensure all required columns are present
required_cols = ['X', 'E', 'T', 'Y', 'Y_eq', 'Index_E', 'Index_T', 'answer', 'question']
for col in required_cols:
    if col not in df_id.columns:
        print(f"Warning: Missing column {col} in ID data")
    if col not in df_ood.columns:
        print(f"Warning: Missing column {col} in OOD data")

# Rename answer_number to answer for consistency (answer already exists, but ensure it's the numeric value)
df_id['answer'] = df_id['answer_number']
df_ood['answer'] = df_ood['answer_number']

# Convert embeddings to numpy arrays
for col in ['X', 'E', 'T', 'Y', 'Y_eq']:
    df_id[col] = df_id[col].apply(np.array)
    df_ood[col] = df_ood[col].apply(np.array)

print("\nData prepared in CCL format!")


ID (English) samples: 250
OOD (Other languages) samples: 2500

Data prepared in CCL format!


In [17]:
# Save data to expected CCL location
# Expected: data/mgsm/embs/dataset_{emb_model}_emb_ID.pkl and _OOD.pkl
output_dir = Path("data/mgsm/embs")
output_dir.mkdir(parents=True, exist_ok=True)

# Note: CCL expects 'gpt' as emb_model name in the filename
# Based on the code, it looks for: dataset_gpt_emb_ID.pkl
id_path = output_dir / "dataset_gpt_emb_ID.pkl"
ood_path = output_dir / "dataset_gpt_emb_OOD.pkl"

print(f"Saving ID data to: {id_path}")
with open(id_path, 'wb') as f:
    pickle.dump(df_id, f)

print(f"Saving OOD data to: {ood_path}")
with open(ood_path, 'wb') as f:
    pickle.dump(df_ood, f)

# Automatically create backup copies for safety
print(f"\n" + "=" * 80)
print("Creating Backup Copies")
print("=" * 80)

# Create backup directory
backup_dir = output_dir / "backup"
backup_dir.mkdir(parents=True, exist_ok=True)

# Create timestamped backup filenames
from datetime import datetime
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
id_backup_path = backup_dir / f"dataset_gpt_emb_ID_backup_{timestamp}.pkl"
ood_backup_path = backup_dir / f"dataset_gpt_emb_OOD_backup_{timestamp}.pkl"

# Also create a "latest" backup (overwrites previous latest)
id_latest_backup = backup_dir / "dataset_gpt_emb_ID_backup_latest.pkl"
ood_latest_backup = backup_dir / "dataset_gpt_emb_OOD_backup_latest.pkl"

# Save timestamped backups
print(f"Creating timestamped backup: {id_backup_path.name}")
with open(id_backup_path, 'wb') as f:
    pickle.dump(df_id, f)

print(f"Creating timestamped backup: {ood_backup_path.name}")
with open(ood_backup_path, 'wb') as f:
    pickle.dump(df_ood, f)

# Save "latest" backups (for easy access to most recent backup)
print(f"\nCreating 'latest' backup: {id_latest_backup.name}")
with open(id_latest_backup, 'wb') as f:
    pickle.dump(df_id, f)

print(f"Creating 'latest' backup: {ood_latest_backup.name}")
with open(ood_latest_backup, 'wb') as f:
    pickle.dump(df_ood, f)

print(f"\n✓ Backup copies created successfully!")
print(f"  Backup directory: {backup_dir}")
print(f"  Timestamped backups: {timestamp}")
print(f"  Latest backups: latest (always points to most recent)")
print(f"\nTo restore from backup, copy files from {backup_dir} to {output_dir}")
print("=" * 80)

print("\nData saved successfully!")
print(f"ID samples: {len(df_id)}")
print(f"OOD samples: {len(df_ood)}")
print(f"\nYou can now use this data with CCL training scripts.")


Saving ID data to: data/mgsm/embs/dataset_gpt_emb_ID.pkl
Saving OOD data to: data/mgsm/embs/dataset_gpt_emb_OOD.pkl

Creating Backup Copies
Creating timestamped backup: dataset_gpt_emb_ID_backup_20260104_054813.pkl
Creating timestamped backup: dataset_gpt_emb_OOD_backup_20260104_054813.pkl

Creating 'latest' backup: dataset_gpt_emb_ID_backup_latest.pkl
Creating 'latest' backup: dataset_gpt_emb_OOD_backup_latest.pkl

✓ Backup copies created successfully!
  Backup directory: data/mgsm/embs/backup
  Timestamped backups: 20260104_054813
  Latest backups: latest (always points to most recent)

To restore from backup, copy files from data/mgsm/embs/backup to data/mgsm/embs

Data saved successfully!
ID samples: 250
OOD samples: 2500

You can now use this data with CCL training scripts.


## Next Steps

After running this notebook:

1. **Train CCL model**: Use the training script from `ccl/scripts/mgsm/run_gym_mgms.sh`
   ```bash
   cd ccl/scripts/mgsm
   bash run_gym_mgms.sh
   ```

2. **Generate prompts**: Use the prompt generation script
   ```bash
   cd icl/scripts
   bash prompt_gen_script_mgsm.sh <task> <icl_method> <n_shots>
   ```

3. **Run evaluation**: Use the ICL evaluation scripts

## Notes

- **Data Source**: The MGSM dataset is loaded from TSV files in `existing_work/url_nlp/mgsm/`
  - Each file contains 250 question-answer pairs (tab-separated)
  - Files are named `mgsm_{lang}.tsv` where `{lang}` is the language code
  - Reference: https://github.com/google-research/url-nlp/tree/main/mgsm
- **Embeddings**: Generated using OpenAI's `text-embedding-3-small` model
- **Language Split**: 
  - English (en) is treated as ID (In-Distribution)
  - All other 10 languages are treated as OOD (Out-of-Distribution)
- **Data Format**: Matches what CCL expects based on `ccl/utils/utils_data.py`
- **File Naming**: The saved files use `gpt` as the embedding model name (as expected by CCL code)
- **Dataset Size**: 250 problems × 11 languages = 2,750 total samples


## Mixing Embeddings and Decoding to Text

This section demonstrates how to:
1. Mix two embeddings (e.g., weighted average, concatenation)
2. Decode the mixed embedding back to text using nearest neighbor search

**Note**: OpenAI embeddings don't have a direct decoder. We use nearest neighbor search to find the closest text in the dataset.

## Load Previously Saved Embeddings

If you've already generated and saved embeddings, you can load them here instead of regenerating them.

In [18]:
# Load previously saved embeddings from pickle files
import pickle
from pathlib import Path

# Paths to saved embedding files
emb_dir = Path("data/mgsm/embs")
id_path = emb_dir / "dataset_gpt_emb_ID.pkl"
ood_path = emb_dir / "dataset_gpt_emb_OOD.pkl"

# Check if files exist
if id_path.exists() and ood_path.exists():
    print(f"Loading ID embeddings from: {id_path}")
    with open(id_path, 'rb') as f:
        df_id = pickle.load(f)
    
    print(f"Loading OOD embeddings from: {ood_path}")
    with open(ood_path, 'rb') as f:
        df_ood = pickle.load(f)
    
    # Combine for full dataset (optional, if you need both)
    df = pd.concat([df_id, df_ood], ignore_index=True)
    
    print(f"\n✓ Successfully loaded embeddings!")
    print(f"  ID samples: {len(df_id)}")
    print(f"  OOD samples: {len(df_ood)}")
    print(f"  Total samples: {len(df)}")
    print(f"\nAvailable columns: {list(df_id.columns)}")
    print(f"\nSample data from ID:")
    print(df_id[['question', 'answer', 'language', 'Index_E']].head())
    
    # Verify embeddings are present
    if 'X' in df_id.columns:
        print(f"\n✓ Question embeddings (X) loaded - shape: {np.array(df_id.iloc[0]['X']).shape}")
    if 'Y' in df_id.columns:
        print(f"✓ Answer embeddings (Y) loaded - shape: {np.array(df_id.iloc[0]['Y']).shape}")
    if 'E' in df_id.columns:
        print(f"✓ Environment embeddings (E) loaded - shape: {np.array(df_id.iloc[0]['E']).shape}")
    if 'T' in df_id.columns:
        print(f"✓ Task embeddings (T) loaded - shape: {np.array(df_id.iloc[0]['T']).shape}")
    
else:
    print(f"⚠ Warning: Embedding files not found!")
    print(f"  Expected ID file: {id_path}")
    print(f"  Expected OOD file: {ood_path}")
    print(f"\nPlease run the embedding generation cells above first, or check the file paths.")
    print(f"Current working directory: {Path.cwd()}")

Loading ID embeddings from: data/mgsm/embs/dataset_gpt_emb_ID.pkl
Loading OOD embeddings from: data/mgsm/embs/dataset_gpt_emb_OOD.pkl

✓ Successfully loaded embeddings!
  ID samples: 250
  OOD samples: 2500
  Total samples: 2750

Available columns: ['question', 'answer', 'language', 'Index_E', 'Index_T', 'answer_number', 'question_formatted', 'answer_formatted', 'answer_eq', 'X', 'Y', 'Y_eq', 'E', 'T']

Sample data from ID:
                                            question answer language  Index_E
0  Janet’s ducks lay 16 eggs per day. She eats th...     18       en        0
1  A robe takes 2 bolts of blue fiber and half th...      3       en        0
2  Josh decides to try flipping a house.  He buys...  70000       en        0
3  James decides to run 3 sprints 3 times a week....    540       en        0
4  Every day, Wendi feeds each of her chickens th...     20       en        0

✓ Question embeddings (X) loaded - shape: (1536,)
✓ Answer embeddings (Y) loaded - shape: (1536,)
✓ Env

In [19]:
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

def mix_embeddings(emb1, emb2, method='weighted_avg', weight1=0.5, weight2=0.5):
    """
    Mix two embeddings using different methods.
    
    Args:
        emb1: First embedding (numpy array)
        emb2: Second embedding (numpy array)
        method: 'weighted_avg', 'concat', or 'add'
        weight1: Weight for first embedding (for weighted_avg)
        weight2: Weight for second embedding (for weighted_avg)
    
    Returns:
        Mixed embedding (numpy array)
    """
    emb1 = np.array(emb1)
    emb2 = np.array(emb2)
    
    if method == 'weighted_avg':
        # Weighted average (normalized)
        mixed = weight1 * emb1 + weight2 * emb2
        # Normalize to unit length (optional, but often helpful)
        mixed = mixed / np.linalg.norm(mixed)
        return mixed
    elif method == 'concat':
        # Concatenation
        return np.concatenate([emb1, emb2])
    elif method == 'add':
        # Simple addition (then normalize)
        mixed = emb1 + emb2
        mixed = mixed / np.linalg.norm(mixed)
        return mixed
    else:
        raise ValueError(f"Unknown method: {method}")

def decode_embedding_to_text(mixed_emb, reference_embeddings, reference_texts, 
                             top_k=5, metric='cosine'):
    """
    Decode an embedding back to text by finding nearest neighbors.
    
    Args:
        mixed_emb: The mixed embedding to decode (numpy array)
        reference_embeddings: List/array of reference embeddings to search in
        reference_texts: List of corresponding texts
        top_k: Number of nearest neighbors to return
        metric: 'cosine' or 'euclidean'
    
    Returns:
        List of tuples: (text, similarity_score, index)
    """
    mixed_emb = np.array(mixed_emb).reshape(1, -1)
    ref_embs = np.array(reference_embeddings)
    
    if metric == 'cosine':
        # Cosine similarity
        similarities = cosine_similarity(mixed_emb, ref_embs)[0]
        top_indices = np.argsort(similarities)[::-1][:top_k]
        results = [(reference_texts[i], similarities[i], i) for i in top_indices]
    else:
        # Euclidean distance
        nbrs = NearestNeighbors(n_neighbors=top_k, metric='euclidean', n_jobs=-1)
        nbrs.fit(ref_embs)
        distances, indices = nbrs.kneighbors(mixed_emb)
        results = [(reference_texts[i], 1.0 / (1.0 + d), i) 
                   for d, i in zip(distances[0], indices[0])]
    
    return results

# Example: Mix a question embedding with an answer embedding
print("Example: Mixing Question and Answer Embeddings")
print("=" * 60)

# Get sample embeddings from the dataset
sample_idx = 0
question_emb = np.array(df_id.iloc[sample_idx]['X'])
answer_emb = np.array(df_id.iloc[sample_idx]['Y'])
question_text = df_id.iloc[sample_idx]['question']
answer_text = df_id.iloc[sample_idx]['answer']

print(f"\nOriginal Question: {question_text[:100]}...")
print(f"Original Answer: {answer_text}")

# Method 1: Weighted average (90% question, 10% answer)
mixed_emb_weighted = mix_embeddings(question_emb, answer_emb, 
                                     method='weighted_avg', 
                                     weight1=0.1, weight2=0.9)
print(f"\n1. Weighted Average (90% question, 10% answer):")
print(f"   Mixed embedding shape: {mixed_emb_weighted.shape}")

# Method 2: Simple addition
mixed_emb_add = mix_embeddings(question_emb, answer_emb, method='add')
print(f"\n2. Addition (then normalized):")
print(f"   Mixed embedding shape: {mixed_emb_add.shape}")

# Method 3: Concatenation
mixed_emb_concat = mix_embeddings(question_emb, answer_emb, method='concat')
print(f"\n3. Concatenation:")
print(f"   Mixed embedding shape: {mixed_emb_concat.shape}")

# Decode the mixed embedding back to text
print("\n" + "=" * 60)
print("Decoding Mixed Embedding to Text (using nearest neighbors)")
print("=" * 60)

# Use all question embeddings as reference
reference_embs = np.stack(df_id['X'].values)
reference_texts = df_id['question'].tolist()

# Decode the weighted average mixed embedding
decoded_results = decode_embedding_to_text(
    mixed_emb_weighted, 
    reference_embs, 
    reference_texts, 
    top_k=3,
    metric='cosine'
)

print(f"\nTop 3 nearest questions to the mixed embedding:")
for i, (text, similarity, idx) in enumerate(decoded_results, 1):
    print(f"\n{i}. Similarity: {similarity:.4f}")
    print(f"   Text: {text[:150]}...")
    print(f"   Original answer: {df_id.iloc[idx]['answer']}")

# Also try mixing question with environment (language) embedding
print("\n" + "=" * 60)
print("Example: Mixing Question with Language Embedding")
print("=" * 60)

lang_emb = np.array(df_id.iloc[sample_idx]['E'])
lang = df_id.iloc[sample_idx]['language']
mixed_emb_lang = mix_embeddings(question_emb, lang_emb, 
                                 method='weighted_avg', 
                                 weight1=0.8, weight2=0.2)

decoded_lang_results = decode_embedding_to_text(
    mixed_emb_lang,
    reference_embs,
    reference_texts,
    top_k=3,
    metric='cosine'
)

print(f"\nTop 3 nearest questions (mixing with {lang} language embedding):")
for i, (text, similarity, idx) in enumerate(decoded_lang_results, 1):
    print(f"\n{i}. Similarity: {similarity:.4f}")
    print(f"   Text: {text[:150]}...")



Example: Mixing Question and Answer Embeddings

Original Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
Original Answer: 18

1. Weighted Average (90% question, 10% answer):
   Mixed embedding shape: (1536,)

2. Addition (then normalized):
   Mixed embedding shape: (1536,)

3. Concatenation:
   Mixed embedding shape: (3072,)

Decoding Mixed Embedding to Text (using nearest neighbors)

Top 3 nearest questions to the mixed embedding:

1. Similarity: 0.4536
   Text: In a candy machine, there are 22 more than four times the number of pink gumballs as there are blue gumballs. If there are 12 blue gumballs how many p...
   Original answer: 70

2. Similarity: 0.4522
   Text: Dan plants 3 rose bushes. Each rose bush has 25 roses. Each rose has 8 thorns. How many thorns are there total?...
   Original answer: 600

3. Similarity: 0.4465
   Text: A raspberry bush has 6 clusters of 20 fruit each and 67 individual fruit scattered acr

## Decoder Models for Embedding-to-Text

While OpenAI embeddings don't have a direct decoder, we can use several approaches:

1. **LLM-based Decoding**: Use an LLM to generate text conditioned on the embedding (via similar texts)
2. **Retrieval-Augmented Generation (RAG)**: Retrieve similar texts using the embedding, then generate
3. **Hybrid Approach**: Combine nearest neighbor search with LLM generation

**Note**: Direct embedding-to-text decoders are still research-level and not widely available. The approaches below use LLMs that can work with semantic information.

In [20]:
# Advanced example: Mix multiple embeddings (similar to knn_mix in the codebase)
# This mimics the mixing strategy used in icl/utils/utils.py

def mix_multiple_embeddings(embeddings_dict, weights_dict):
    """
    Mix multiple embeddings with specified weights.
    
    Args:
        embeddings_dict: Dictionary of embedding names to numpy arrays
        weights_dict: Dictionary of embedding names to weights
    
    Returns:
        Mixed embedding (numpy array)
    """
    mixed = None
    total_weight = 0
    
    for name, emb in embeddings_dict.items():
        weight = weights_dict.get(name, 0.0)
        if weight > 0:
            emb_array = np.array(emb)
            if mixed is None:
                mixed = weight * emb_array
            else:
                mixed += weight * emb_array
            total_weight += weight
    
    if total_weight > 0:
        mixed = mixed / total_weight
        # Normalize to unit length
        mixed = mixed / np.linalg.norm(mixed)
    
    return mixed

# Example: Mix question (X), task (T), and environment (E) embeddings
print("Advanced Example: Mixing Multiple Embeddings")
print("=" * 60)

sample_idx = 0
embeddings_to_mix = {
    'X': np.array(df_id.iloc[sample_idx]['X']),  # Question
    'T': np.array(df_id.iloc[sample_idx]['T']),  # Task
    'E': np.array(df_id.iloc[sample_idx]['E']),  # Environment (language)
}

# Different mixing strategies
mixing_strategies = {
    'question_focused': {'X': 0.8, 'T': 0.15, 'E': 0.05},
    'task_focused': {'X': 0.4, 'T': 0.5, 'E': 0.1},
    'balanced': {'X': 0.5, 'T': 0.3, 'E': 0.2},
    'ccl_style': {'X': 0.9, 'T': 0.05, 'E': 0.05},  # Similar to knn_mix style
}

for strategy_name, weights in mixing_strategies.items():
    mixed_emb = mix_multiple_embeddings(embeddings_to_mix, weights)
    print(f"\n{strategy_name} (weights: {weights}):")
    print(f"  Mixed embedding shape: {mixed_emb.shape}")
    print(f"  Embedding norm: {np.linalg.norm(mixed_emb):.4f}")
    
    # Decode to text
    decoded = decode_embedding_to_text(
        mixed_emb,
        reference_embs,
        reference_texts,
        top_k=1,
        metric='cosine'
    )
    print(f"  Closest question: {decoded[0][0][:100]}...")
    print(f"  Similarity: {decoded[0][1]:.4f}")

print("\n" + "=" * 60)
print("Note: The quality of decoding depends on:")
print("  1. The mixing weights used")
print("  2. The size and diversity of the reference corpus")
print("  3. The semantic relationship between the mixed embeddings")
print("=" * 60)

Advanced Example: Mixing Multiple Embeddings

question_focused (weights: {'X': 0.8, 'T': 0.15, 'E': 0.05}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Similarity: 0.9825

task_focused (weights: {'X': 0.4, 'T': 0.5, 'E': 0.1}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Similarity: 0.7173

balanced (weights: {'X': 0.5, 'T': 0.3, 'E': 0.2}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Similarity: 0.8470

ccl_style (weights: {'X': 0.9, 'T': 0.05, 'E': 0.05}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She 

In [21]:
# Example: Using LLM-based decoders
print("Example: Decoding Mixed Embeddings with LLM")
print("=" * 60)

# Get a sample mixed embedding
sample_idx = 0
question_emb = np.array(df_id.iloc[sample_idx]['X'])
answer_emb = np.array(df_id.iloc[sample_idx]['Y'])
mixed_emb = mix_embeddings(question_emb, answer_emb, 
                          method='weighted_avg', 
                          weight1=0.9, weight2=0.1)

print(f"\nOriginal Question: {df_id.iloc[sample_idx]['question'][:150]}...")
print(f"Original Answer: {df_id.iloc[sample_idx]['answer']}")

# Prepare reference data
reference_embs = np.stack(df_id['X'].values)
reference_texts = df_id['question'].tolist()

# Method 1: RAG-based decoding
print("\n" + "-" * 60)
print("Method 1: Retrieval-Augmented Generation (RAG)")
print("-" * 60)
try:
    decoded_rag = decode_with_llm_rag(
        mixed_emb, 
        reference_embs, 
        reference_texts,
        top_k=3,
        max_tokens=150
    )
    if decoded_rag:
        print(f"Generated text: {decoded_rag}")
except Exception as e:
    print(f"Error: {e}")
    print("(This requires OpenAI API access)")

# Method 2: Direct LLM decoding
print("\n" + "-" * 60)
print("Method 2: Direct LLM Generation")
print("-" * 60)
try:
    decoded_direct = decode_with_llm_direct(
        mixed_emb,
        reference_embs,
        reference_texts,
        max_tokens=150
    )
    if decoded_direct:
        print(f"Generated text: {decoded_direct}")
except Exception as e:
    print(f"Error: {e}")
    print("(This requires OpenAI API access)")

# Method 3: Hybrid approach
print("\n" + "-" * 60)
print("Method 3: Hybrid (Nearest Neighbor + LLM)")
print("-" * 60)
try:
    decoded_hybrid, nn_results = decode_hybrid(
        mixed_emb,
        reference_embs,
        reference_texts,
        top_k=3,
        max_tokens=150
    )
    if decoded_hybrid:
        print(f"Generated text: {decoded_hybrid}")
        print(f"\nNearest neighbors used:")
        for i, (text, sim, _) in enumerate(nn_results[:3], 1):
            print(f"  {i}. Similarity: {sim:.3f} - {text[:100]}...")
except Exception as e:
    print(f"Error: {e}")
    print("(This requires OpenAI API access)")

print("\n" + "=" * 60)
print("Note: These methods require OpenAI API access.")
print("For local models, you can use transformers library with models like:")
print("  - GPT-2, GPT-Neo, GPT-J (smaller, local)")
print("  - LLaMA, Mistral (larger, requires more resources)")
print("=" * 60)

Example: Decoding Mixed Embeddings with LLM

Original Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the rem...
Original Answer: 18

------------------------------------------------------------
Method 1: Retrieval-Augmented Generation (RAG)
------------------------------------------------------------
Error: name 'decode_with_llm_rag' is not defined
(This requires OpenAI API access)

------------------------------------------------------------
Method 2: Direct LLM Generation
------------------------------------------------------------
Error: name 'decode_with_llm_direct' is not defined
(This requires OpenAI API access)

------------------------------------------------------------
Method 3: Hybrid (Nearest Neighbor + LLM)
------------------------------------------------------------
Error: name 'decode_hybrid' is not defined
(This requires OpenAI API access)

Note: These methods requi

In [22]:
# Advanced example: Mix multiple embeddings (similar to knn_mix in the codebase)
# This mimics the mixing strategy used in icl/utils/utils.py

def mix_multiple_embeddings(embeddings_dict, weights_dict):
    """
    Mix multiple embeddings with specified weights.
    
    Args:
        embeddings_dict: Dictionary of embedding names to numpy arrays
        weights_dict: Dictionary of embedding names to weights
    
    Returns:
        Mixed embedding (numpy array)
    """
    mixed = None
    total_weight = 0
    
    for name, emb in embeddings_dict.items():
        weight = weights_dict.get(name, 0.0)
        if weight > 0:
            emb_array = np.array(emb)
            if mixed is None:
                mixed = weight * emb_array
            else:
                mixed += weight * emb_array
            total_weight += weight
    
    if total_weight > 0:
        mixed = mixed / total_weight
        # Normalize to unit length
        mixed = mixed / np.linalg.norm(mixed)
    
    return mixed

# Example: Mix question (X), task (T), and environment (E) embeddings
print("Advanced Example: Mixing Multiple Embeddings")
print("=" * 60)

sample_idx = 0
embeddings_to_mix = {
    'X': np.array(df_id.iloc[sample_idx]['X']),  # Question
    'T': np.array(df_id.iloc[sample_idx]['T']),  # Task
    'E': np.array(df_id.iloc[sample_idx]['E']),  # Environment (language)
}

# Different mixing strategies
mixing_strategies = {
    'question_focused': {'X': 0.8, 'T': 0.15, 'E': 0.05},
    'task_focused': {'X': 0.4, 'T': 0.5, 'E': 0.1},
    'balanced': {'X': 0.5, 'T': 0.3, 'E': 0.2},
    'ccl_style': {'X': 0.9, 'T': 0.05, 'E': 0.05},  # Similar to knn_mix style
}

for strategy_name, weights in mixing_strategies.items():
    mixed_emb = mix_multiple_embeddings(embeddings_to_mix, weights)
    print(f"\n{strategy_name} (weights: {weights}):")
    print(f"  Mixed embedding shape: {mixed_emb.shape}")
    print(f"  Embedding norm: {np.linalg.norm(mixed_emb):.4f}")
    
    # Decode to text
    decoded = decode_embedding_to_text(
        mixed_emb,
        reference_embs,
        reference_texts,
        top_k=1,
        metric='cosine'
    )
    print(f"  Closest question: {decoded[0][0][:100]}...")
    print(f"  Similarity: {decoded[0][1]:.4f}")

print("\n" + "=" * 60)
print("Note: The quality of decoding depends on:")
print("  1. The mixing weights used")
print("  2. The size and diversity of the reference corpus")
print("  3. The semantic relationship between the mixed embeddings")
print("=" * 60)

Advanced Example: Mixing Multiple Embeddings

question_focused (weights: {'X': 0.8, 'T': 0.15, 'E': 0.05}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Similarity: 0.9825

task_focused (weights: {'X': 0.4, 'T': 0.5, 'E': 0.1}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Similarity: 0.7173

balanced (weights: {'X': 0.5, 'T': 0.3, 'E': 0.2}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Similarity: 0.8470

ccl_style (weights: {'X': 0.9, 'T': 0.05, 'E': 0.05}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She 

In [23]:
# Advanced example: Mix multiple embeddings (similar to knn_mix in the codebase)
# This mimics the mixing strategy used in icl/utils/utils.py

def mix_multiple_embeddings(embeddings_dict, weights_dict):
    """
    Mix multiple embeddings with specified weights.
    
    Args:
        embeddings_dict: Dictionary of embedding names to numpy arrays
        weights_dict: Dictionary of embedding names to weights
    
    Returns:
        Mixed embedding (numpy array)
    """
    mixed = None
    total_weight = 0
    
    for name, emb in embeddings_dict.items():
        weight = weights_dict.get(name, 0.0)
        if weight > 0:
            emb_array = np.array(emb)
            if mixed is None:
                mixed = weight * emb_array
            else:
                mixed += weight * emb_array
            total_weight += weight
    
    if total_weight > 0:
        mixed = mixed / total_weight
        # Normalize to unit length
        mixed = mixed / np.linalg.norm(mixed)
    
    return mixed

# Example: Mix question (X), task (T), and environment (E) embeddings
print("Advanced Example: Mixing Multiple Embeddings")
print("=" * 60)

sample_idx = 0
embeddings_to_mix = {
    'X': np.array(df_id.iloc[sample_idx]['X']),  # Question
    'T': np.array(df_id.iloc[sample_idx]['T']),  # Task
    'E': np.array(df_id.iloc[sample_idx]['E']),  # Environment (language)
}

# Different mixing strategies
mixing_strategies = {
    'question_focused': {'X': 0.8, 'T': 0.15, 'E': 0.05},
    'task_focused': {'X': 0.4, 'T': 0.5, 'E': 0.1},
    'balanced': {'X': 0.5, 'T': 0.3, 'E': 0.2},
    'ccl_style': {'X': 0.9, 'T': 0.05, 'E': 0.05},  # Similar to knn_mix style
}

for strategy_name, weights in mixing_strategies.items():
    mixed_emb = mix_multiple_embeddings(embeddings_to_mix, weights)
    print(f"\n{strategy_name} (weights: {weights}):")
    print(f"  Mixed embedding shape: {mixed_emb.shape}")
    print(f"  Embedding norm: {np.linalg.norm(mixed_emb):.4f}")
    
    # Decode to text
    decoded = decode_embedding_to_text(
        mixed_emb,
        reference_embs,
        reference_texts,
        top_k=1,
        metric='cosine'
    )
    print(f"  Closest question: {decoded[0][0][:100]}...")
    print(f"  Similarity: {decoded[0][1]:.4f}")

print("\n" + "=" * 60)
print("Note: The quality of decoding depends on:")
print("  1. The mixing weights used")
print("  2. The size and diversity of the reference corpus")
print("  3. The semantic relationship between the mixed embeddings")
print("=" * 60)

Advanced Example: Mixing Multiple Embeddings

question_focused (weights: {'X': 0.8, 'T': 0.15, 'E': 0.05}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Similarity: 0.9825

task_focused (weights: {'X': 0.4, 'T': 0.5, 'E': 0.1}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Similarity: 0.7173

balanced (weights: {'X': 0.5, 'T': 0.3, 'E': 0.2}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Similarity: 0.8470

ccl_style (weights: {'X': 0.9, 'T': 0.05, 'E': 0.05}):
  Mixed embedding shape: (1536,)
  Embedding norm: 1.0000
  Closest question: Janet’s ducks lay 16 eggs per day. She 

## In-Context Learning: Using Similar Examples to Answer Target Queries

This section demonstrates how to use similar examples for in-context learning, similar to how CCL works.

The process:
1. **Find Similar Examples**: Use embeddings to find the most similar examples from ID data
2. **Construct Prompt**: Format examples as "Question: ...\nAnswer: ...\n\n" 
3. **Add Target Query**: Append the target question to the prompt
4. **Answer**: Use the prompt with an LLM to get the answer

This mimics the approach used in `icl/prompt_generation.py` and `icl/utils/utils.py`.


In [24]:
def find_similar_examples(target_emb, demo_embs, demo_data, n_examples=3, metric='cosine'):
    """
    Find similar examples for in-context learning.
    
    Args:
        target_emb: Embedding of the target query (numpy array)
        demo_embs: Embeddings of demo examples (numpy array, shape: [n_demos, emb_dim])
        demo_data: DataFrame with demo examples (must have 'question' and 'answer' columns)
        n_examples: Number of similar examples to retrieve
        metric: 'cosine' or 'euclidean'
    
    Returns:
        List of tuples: (question, answer, similarity_score, index)
    """
    target_emb = np.array(target_emb).reshape(1, -1)
    demo_embs = np.array(demo_embs)
    
    if metric == 'cosine':
        # Cosine similarity
        similarities = cosine_similarity(target_emb, demo_embs)[0]
        top_indices = np.argsort(similarities)[::-1][:n_examples]
        results = [
            (
                demo_data.iloc[i]['question'],
                demo_data.iloc[i]['answer'],
                similarities[i],
                i
            )
            for i in top_indices
        ]
    else:
        # Euclidean distance
        nbrs = NearestNeighbors(n_neighbors=n_examples, metric='euclidean', n_jobs=-1)
        nbrs.fit(demo_embs)
        distances, indices = nbrs.kneighbors(target_emb)
        results = [
            (
                demo_data.iloc[i]['question'],
                demo_data.iloc[i]['answer'],
                1.0 / (1.0 + d),
                i
            )
            for d, i in zip(distances[0], indices[0])
        ]
    
    return results

def construct_icl_prompt(similar_examples, target_question, template_x="Question: {}\n", 
                        template_y="Answer: {}\n\n", template_answer="Answer: "):
    """
    Construct an in-context learning prompt with similar examples.
    
    Args:
        similar_examples: List of (question, answer, similarity, index) tuples
        target_question: The target question to answer
        template_x: Template for questions
        template_y: Template for answers
        template_answer: Template for the answer prompt
    
    Returns:
        Formatted prompt string
    """
    prompt = ""
    
    # Add similar examples
    for question, answer, similarity, idx in similar_examples:
        prompt += template_x.format(question)
        prompt += template_y.format(answer)
    
    # Add target question
    prompt += template_x.format(target_question)
    prompt += template_answer
    
    return prompt

# Example: Find similar examples for a target query
print("=" * 80)
print("In-Context Learning Example: Using Similar Examples")
print("=" * 80)

# Select a target query from OOD data (or ID data)
target_idx = 0  # First example from OOD
target_question = df_ood.iloc[target_idx]['question']
target_answer = df_ood.iloc[target_idx]['answer']
target_emb = np.array(df_ood.iloc[target_idx]['X'])

print(f"\nTarget Query:")
print(f"Question: {target_question}")
print(f"True Answer: {target_answer}")

# Find similar examples from ID data
print(f"\nFinding similar examples from ID data...")
demo_embs = np.stack(df_id['X'].values)
similar_examples = find_similar_examples(
    target_emb, 
    demo_embs, 
    df_id, 
    n_examples=3,
    metric='cosine'
)

print(f"\nTop {len(similar_examples)} Similar Examples:")
print("-" * 80)
for i, (question, answer, similarity, idx) in enumerate(similar_examples, 1):
    print(f"\nExample {i} (Similarity: {similarity:.4f}):")
    print(f"  Question: {question[:100]}...")
    print(f"  Answer: {answer}")

# Construct the ICL prompt
prompt = construct_icl_prompt(similar_examples, target_question)

print("\n" + "=" * 80)
print("Constructed In-Context Learning Prompt:")
print("=" * 80)
print(prompt)
print("=" * 80)


In-Context Learning Example: Using Similar Examples

Target Query:
Question: Los patos de Janet ponen 16 huevos por día. Ella come tres en el desayuno todas las mañanas y usa cuatro para hornear magdalenas para sus amigos todos los días. Vende lo que sobra en el mercado de productores diariamente a $2 el huevo fresco de pato. ¿Cuánto gana en dólares todos los días en el mercado de productores?
True Answer: 18

Finding similar examples from ID data...

Top 3 Similar Examples:
--------------------------------------------------------------------------------

Example 1 (Similarity: 0.7552):
  Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for ...
  Answer: 18

Example 2 (Similarity: 0.5582):
  Question: Lloyd has an egg farm. His chickens produce 252 eggs per day and he sells them for $2 per dozen. How...
  Answer: 294

Example 3 (Similarity: 0.4697):
  Question: Claire makes a 3 egg omelet every morning for breakfast.  How many do

## Using CCL-style Similarity (Causal Representations)

Instead of using raw question embeddings (X), CCL uses causal representations (C_hat) 
to find similar examples. This helps find examples that are similar at the problem level,
not just the surface level.

Let's demonstrate this if we have C_hat embeddings available:


In [26]:
# Compare different embedding types for finding similar examples
# This requires CCL model outputs (C_hat, S_hat) which are generated during training

print("=" * 80)
print("Comparing Different Embedding Types for Similarity Search")
print("=" * 80)

# Method 1: Using raw question embeddings (X) - Standard ICL
print("\n1. Using Raw Question Embeddings (X) - Standard ICL:")
demo_embs_x = np.stack(df_id['X'].values)
target_emb_x = np.array(df_ood.iloc[target_idx]['X'])
similar_x = find_similar_examples(target_emb_x, demo_embs_x, df_id, n_examples=3, metric='cosine')
for i, (q, a, sim, idx) in enumerate(similar_x, 1):
    print(f"   {i}. Similarity: {sim:.4f} - Answer: {a}")

# Note: To use C_hat or S_hat embeddings, you would need to:
# 1. Train a CCL model (using run_ccl.py)
# 2. Run inference to get C_hat and S_hat embeddings
# 3. Load the results and use them here

# Example code structure (commented out since we don't have C_hat yet):
"""
# Method 2: Using causal representations (C_hat) - CCL method
# results_id = pd.read_feather(f"./results/mgsm/results_ccl_gpt_emb_000_ID.feather")
# demo_embs_c = np.stack(results_id['C_hat'].values)
# target_emb_c = np.array(results_ood.iloc[target_idx]['C_hat'])
# similar_c = find_similar_examples(target_emb_c, demo_embs_c, df_id, n_examples=3, metric='cosine')

# Method 3: Using knn_mix (mixing C and S embeddings) - CCL variant
# demo_embs_c = np.stack(results_id['C_hat'].values)
# demo_embs_s = np.stack(results_id['S_hat'].values)
# target_emb_c = np.array(results_ood.iloc[target_idx]['C_hat'])
# target_emb_s = np.array(results_ood.iloc[target_idx]['S_hat'])
# 
# # Mix embeddings: 0.9 * C + 0.1 * S
# demo_embs_mix = 0.9 * demo_embs_c + 0.1 * demo_embs_s
# target_emb_mix = 0.9 * target_emb_c + 0.1 * target_emb_s
# similar_mix = find_similar_examples(target_emb_mix, demo_embs_mix, df_id, n_examples=3, metric='cosine')
"""

print("\n" + "=" * 80)
print("Note: To use CCL-style similarity (C_hat embeddings), you need to:")
print("  1. Train a CCL model: python ccl/run_ccl.py --exp_name mgsm ...")
print("  2. Run inference: python ccl/run_ccl.py --deploy --exp_name mgsm ...")
print("  3. Load the results and use C_hat embeddings for similarity search")
print("=" * 80)


Comparing Different Embedding Types for Similarity Search

1. Using Raw Question Embeddings (X) - Standard ICL:
   1. Similarity: 0.7552 - Answer: 18
   2. Similarity: 0.5582 - Answer: 294
   3. Similarity: 0.4697 - Answer: 7

Note: To use CCL-style similarity (C_hat embeddings), you need to:
  1. Train a CCL model: python ccl/run_ccl.py --exp_name mgsm ...
  2. Run inference: python ccl/run_ccl.py --deploy --exp_name mgsm ...
  3. Load the results and use C_hat embeddings for similarity search


## Perform In-Context Learning with API

Now we'll use the OpenAI API to actually answer the target queries using the constructed prompts with similar examples.


In [27]:
def query_llm_with_prompt(prompt, model="gpt-3.5-turbo", temperature=0.0, max_tokens=50):
    """
    Query an LLM with a prompt to get an answer.
    
    Args:
        prompt: The prompt string (with examples and target question)
        model: OpenAI model to use (gpt-3.5-turbo, gpt-4, etc.)
        temperature: Sampling temperature (0.0 for deterministic)
        max_tokens: Maximum tokens in response
    
    Returns:
        Generated answer text
    """
    if client is None:
        raise ValueError("OpenAI API client not initialized. Set OPENAI_API_KEY environment variable.")
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "user", "content": prompt}
            ],
            temperature=temperature,
            max_tokens=max_tokens
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error querying LLM: {e}")
        return None

def extract_answer_from_response(response_text):
    """
    Extract the numeric answer from LLM response.
    For MGSM, answers are numeric values.
    """
    import re
    # Try to find numbers in the response
    numbers = re.findall(r'\d+', response_text)
    if numbers:
        # Return the first number found (usually the answer)
        return numbers[0]
    return response_text.strip()

# Example: Perform in-context learning for a target query
print("=" * 80)
print("In-Context Learning: Querying LLM with API")
print("=" * 80)

# Use the same target query from before
target_idx = 0
target_question = df_ood.iloc[target_idx]['question']
target_answer = df_ood.iloc[target_idx]['answer']
target_emb = np.array(df_ood.iloc[target_idx]['X'])

print(f"\nTarget Query:")
print(f"Question: {target_question}")
print(f"True Answer: {target_answer}")

# Find similar examples (reuse from previous cell)
demo_embs = np.stack(df_id['X'].values)
similar_examples = find_similar_examples(
    target_emb, 
    demo_embs, 
    df_id, 
    n_examples=3,
    metric='cosine'
)

print(f"\nUsing {len(similar_examples)} similar examples for in-context learning")

# Construct the prompt
prompt = construct_icl_prompt(similar_examples, target_question)

print(f"\n" + "=" * 80)
print("Querying LLM with constructed prompt...")
print("=" * 80)

# Query the LLM
if client is not None:
    print(f"Using model: gpt-3.5-turbo")
    llm_response = query_llm_with_prompt(prompt, model="gpt-3.5-turbo", temperature=0.0, max_tokens=50)
    
    if llm_response:
        extracted_answer = extract_answer_from_response(llm_response)
        
        print(f"\nLLM Response: {llm_response}")
        print(f"Extracted Answer: {extracted_answer}")
        print(f"True Answer: {target_answer}")
        
        # Check if correct
        is_correct = str(extracted_answer) == str(target_answer)
        print(f"\n{'✓ CORRECT!' if is_correct else '✗ INCORRECT'}")
        if not is_correct:
            print(f"  Expected: {target_answer}, Got: {extracted_answer}")
    else:
        print("Failed to get response from LLM")
else:
    print("⚠ OpenAI API client not initialized!")
    print("Set OPENAI_API_KEY environment variable and re-run Cell 1")


In-Context Learning: Querying LLM with API

Target Query:
Question: Los patos de Janet ponen 16 huevos por día. Ella come tres en el desayuno todas las mañanas y usa cuatro para hornear magdalenas para sus amigos todos los días. Vende lo que sobra en el mercado de productores diariamente a $2 el huevo fresco de pato. ¿Cuánto gana en dólares todos los días en el mercado de productores?
True Answer: 18

Using 3 similar examples for in-context learning

Querying LLM with constructed prompt...
Using model: gpt-3.5-turbo
Error querying LLM: Error code: 401 - {'error': {'message': 'Incorrect API key provided: xx. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}
Failed to get response from LLM


## Batch Evaluation: Test Multiple Queries

Let's evaluate in-context learning on multiple target queries to see the performance:


In [None]:
def evaluate_icl_batch(target_queries, target_embs, target_answers, demo_embs, demo_data,
                       n_shots=3, model="gpt-3.5-turbo", n_samples=None):
    """
    Evaluate in-context learning on a batch of queries.
    
    Args:
        target_queries: List of target question strings
        target_embs: Array of target embeddings
        target_answers: List of true answers
        demo_embs: Array of demo embeddings
        demo_data: DataFrame with demo examples
        n_shots: Number of examples to include
        model: LLM model to use
        n_samples: Number of samples to evaluate (None = all)
    
    Returns:
        Dictionary with results
    """
    if client is None:
        print("⚠ OpenAI API client not initialized!")
        return None
    
    if n_samples is not None:
        indices = np.random.choice(len(target_queries), min(n_samples, len(target_queries)), replace=False)
        target_queries = [target_queries[i] for i in indices]
        target_embs = target_embs[indices]
        target_answers = [target_answers[i] for i in indices]
    
    results = {
        'correct': 0,
        'incorrect': 0,
        'failed': 0,
        'predictions': [],
        'true_answers': [],
        'questions': [],
        'similar_examples': []  # Store similar examples for each query
    }
    
    print(f"Evaluating {len(target_queries)} queries with {n_shots}-shot in-context learning...")
    print(f"Model: {model}")
    print("=" * 80)
    
    for i, (query, emb, true_answer) in enumerate(tqdm(zip(target_queries, target_embs, target_answers), 
                                                         total=len(target_queries), 
                                                         desc="Processing queries")):
        # Find similar examples - get top 5 for display, but use n_shots for prompt
        similar_examples_all = find_similar_examples(
            emb, demo_embs, demo_data, n_examples=5, metric='cosine'
        )
        
        # Use only n_shots for the actual prompt
        similar_examples_for_prompt = similar_examples_all[:n_shots]
        
        # Store all similar examples (top 5) for later display
        results['similar_examples'].append(similar_examples_all)
        
        # Construct prompt using n_shots examples
        prompt = construct_icl_prompt(similar_examples_for_prompt, query)
        
        # Query LLM
        try:
            llm_response = query_llm_with_prompt(prompt, model=model, temperature=0.0, max_tokens=50)
            
            if llm_response:
                extracted_answer = extract_answer_from_response(llm_response)
                is_correct = str(extracted_answer) == str(true_answer)
                
                if is_correct:
                    results['correct'] += 1
                else:
                    results['incorrect'] += 1
                
                results['predictions'].append(extracted_answer)
                results['true_answers'].append(true_answer)
                results['questions'].append(query)
            else:
                results['failed'] += 1
                results['predictions'].append(None)
                results['true_answers'].append(true_answer)
                results['questions'].append(query)
                # similar_examples already stored above
                
        except Exception as e:
            print(f"\nError processing query {i}: {e}")
            results['failed'] += 1
            results['predictions'].append(None)
            results['true_answers'].append(true_answer)
            results['questions'].append(query)
            # Store empty list if error occurred before finding examples
            if i >= len(results['similar_examples']):
                results['similar_examples'].append([])
        
        # Small delay to respect rate limits
        time.sleep(0.1)
    
    # Calculate accuracy
    total = results['correct'] + results['incorrect'] + results['failed']
    accuracy = results['correct'] / total if total > 0 else 0
    
    return results, accuracy

# Run evaluation on a sample of OOD queries
print("=" * 80)
print("Batch Evaluation: In-Context Learning Performance")
print("=" * 80)

# Select a sample of target queries from OOD data
n_eval_samples = 10  # Evaluate on 10 samples (adjust as needed)
eval_indices = np.random.choice(len(df_ood), min(n_eval_samples, len(df_ood)), replace=False)

target_queries_eval = [df_ood.iloc[i]['question'] for i in eval_indices]
target_answers_eval = [df_ood.iloc[i]['answer'] for i in eval_indices]
target_embs_eval = np.stack([np.array(df_ood.iloc[i]['X']) for i in eval_indices])

print(f"\nEvaluating on {len(target_queries_eval)} OOD queries...")
print(f"Using {len(df_id)} ID examples for similarity search")
print(f"Number of shots: 3")

if client is not None:
    results, accuracy = evaluate_icl_batch(
        target_queries_eval,
        target_embs_eval,
        target_answers_eval,
        demo_embs,
        df_id,
        n_shots=3,
        model="gpt-3.5-turbo"
    )
    
    if results:
        print("\n" + "=" * 80)
        print("Evaluation Results")
        print("=" * 80)
        print(f"Total queries: {len(target_queries_eval)}")
        print(f"Correct: {results['correct']}")
        print(f"Incorrect: {results['incorrect']}")
        print(f"Failed: {results['failed']}")
        print(f"\nAccuracy: {accuracy*100:.2f}%")
        print("=" * 80)
        
        # Show only incorrect queries
        incorrect_indices = [i for i in range(len(results['questions'])) 
                            if results['predictions'][i] is not None and 
                            str(results['predictions'][i]) != str(results['true_answers'][i])]
        
        if incorrect_indices:
            print(f"\nIncorrect Queries ({len(incorrect_indices)}) with In-Context Learning Examples:")
            print("=" * 80)
            for idx, i in enumerate(incorrect_indices, 1):
                print(f"\nIncorrect Query {idx} (Index {i}):")
                print(f"  Target Question: {results['questions'][i]}")
                print(f"  True Answer: {results['true_answers'][i]}")
                print(f"  Predicted: {results['predictions'][i]}")
                print(f"  ✗ INCORRECT")
                
                # Show the examples used for in-context learning (top 5, sorted by similarity)
                if i < len(results['similar_examples']) and results['similar_examples'][i]:
                    print(f"\n  Examples Used for In-Context Learning (Top 5, sorted by similarity):")
                    print(f"  {'-' * 76}")
                    # Sort by similarity (descending) - they should already be sorted, but ensure it
                    sorted_examples = sorted(results['similar_examples'][i], key=lambda x: x[2], reverse=True)
                    for ex_idx, (ex_question, ex_answer, similarity, ex_orig_idx) in enumerate(sorted_examples[:5], 1):
                        print(f"    {ex_idx}. Similarity: {similarity:.4f}")
                        print(f"       Question: {ex_question[:120]}...")
                        print(f"       Answer: {ex_answer}")
                
                print("-" * 80)
        else:
            print(f"\n✓ All queries were answered correctly!")
            print("  No incorrect queries to display.")
else:
    print("\n⚠ OpenAI API client not initialized!")
    print("Set OPENAI_API_KEY environment variable and re-run Cell 1")


## Summary: In-Context Learning Workflow

The complete in-context learning workflow:

1. **Load Embeddings** (Cell 17): Load saved embeddings from disk (no API needed)
2. **Find Similar Examples** (Cell 24): Use embeddings to find similar examples from ID data
3. **Construct Prompts** (Cell 24): Format examples + target query into a prompt
4. **Query LLM** (Cell 29): Use OpenAI API to get answers from the LLM
5. **Evaluate** (Cell 31): Test on multiple queries and calculate accuracy

**Key Functions:**
- `find_similar_examples()`: Finds similar examples using cosine similarity
- `construct_icl_prompt()`: Builds formatted prompts
- `query_llm_with_prompt()`: Queries OpenAI API with the prompt
- `evaluate_icl_batch()`: Evaluates on multiple queries
