# Polynomial Decomposition with Transformers

This notebook demonstrates the use of transformer models to recognize and decompose polynomial substructures. The research explores how transformers can learn to identify polynomial-in-polynomial substitutions and reverse the expansion process to recover the original structure.

## Problem Overview

Given an expanded polynomial, can a transformer model learn to identify that it was formed by substituting one polynomial into another? This is the core challenge we address through supervised learning, evaluation, and reinforcement learning fine-tuning.

**Example Problem:**
- Given: `-2816*a^4 -7040*a^3 -7168*a^2 -3460*a -692` (expanded polynomial)
- Find: Outer polynomial `P(b)` and inner polynomial `Q(a)` such that `P(Q(a)) = -2816*a^4 -7040*a^3 -7168*a^2 -3460*a -692`
- (Answer: `Q(a) = -16*a^2 -20*a -7` and `P(b) = -11*b^2 +19*b -20` )

## Notebook Structure

This notebook is organized into four main sections:

### 1. Data Generation

**Objective:** Generate synthetic training data for polynomial decomposition tasks.

**What we do:**
- Generate outer polynomials with degrees 2, 3, or 4 and coefficients in range [-20, 20]
- Generate inner polynomials with the same degree and coefficient constraints
- Perform polynomial substitution: substitute inner polynomial into outer polynomial
- Expand the result to create the target polynomial
- Convert all polynomials to prefix notation tokens for transformer input
- Create training examples in format: `expanded_polynomial ⁇ inner_polynomial`
  - Note: For single variable polynomial decomposition, once we know the inner polynomial, we can uniquely determine the outer polynomial through polynomial division. Therefore, our model only needs to predict the inner polynomial.
  - For multi-variable cases, we would need the full format: `expanded_polynomial ⁇ outer_polynomial & inner_polynomial` as the model needs to predict both polynomials.

**Key Features:**
- Prefix notation tokenization (e.g., `+ * P 2 ^ a P 3 P 4` for `2*a^3 + 4`)
- Systematic generation across different degree combinations
- Validation of generated expressions

### 2. Supervised Learning

**Objective:** Train a transformer model to learn polynomial decomposition through supervised learning.

**What we do:**
- Implement a GPT-style transformer architecture optimized for mathematical expressions
- Train the model on generated polynomial decomposition data
- Use teacher forcing during training with the target decomposition as supervision
- Implement custom tokenization for mathematical symbols and operations
- Track training metrics and convergence

**Model Architecture:**
- Multi-head self-attention layers
- Causal masking for autoregressive generation
- Custom vocabulary for mathematical expressions
- Configurable depth, attention heads, and embedding dimensions

### 3. Evaluation and Beam Search Scaling

**Objective:** Evaluate model performance and analyze the effectiveness of beam search.

**What we do:**
- Test the trained model on held-out polynomial decomposition problems
- Implement beam search for generating multiple candidate solutions
- Compare greedy decoding vs. beam search performance
- Analyze how beam width affects solution quality and computational cost
- Validate generated decompositions by symbolic expansion and comparison
- Measure accuracy across different polynomial degrees and complexity levels

**Evaluation Metrics:**
- Exact match accuracy (symbolic equivalence)
- Beam search hit rate (any beam contains correct answer)
- Computational efficiency analysis

### 4. Rank-Aware Beam GRPO Fine-tuning

**Objective:** Further improve model performance using reinforcement learning with rank-aware beam search.

**What we do:**
- Implement Group Relative Policy Optimization (GRPO) for fine-tuning
- Use rank-aware rewards that consider the position of correct solutions in beam search
- Fine-tune the pre-trained supervised model using RL to improve beam search effectiveness
- Compare performance before and after RL fine-tuning
- Analyze how RL training affects the distribution of correct solutions in beam outputs

**Key Innovations:**
- Rank-aware reward function that incentivizes correct solutions appearing early in beam search
- Integration of symbolic validation into the RL reward signal
- Stable training procedures for mathematical reasoning tasks

# 0. Setup

## 0.1. Environment Setup

**Prerequisites:**
Before running this notebook, you must set up the environment in the repository:

1. **Clone the repository** (if not already done)
2. **Run setup script** from the repository root:
   ```bash
   # Option 1: Using uv (faster, recommended)
   ./setup_with_uv.sh
   
   # Option 2: Using pip (standard)
   ./setup.sh
   ```
3. **Select the virtual environment as your Jupyter kernel**:
   - The setup creates a `.venv` directory in the repository root
   - In Jupyter, select the kernel: `.venv/bin/python`
   - Or install the kernel: `.venv/bin/python -m ipykernel install --user --name polynomial-decomp`

**Note:** This notebook assumes you're running it with the `.venv` kernel from the repository. All dependencies should already be installed.

In [12]:
# =============================================================================
# SETUP: Import modules and configure paths
# =============================================================================
# This cell sets up the Python environment for the notebook.
# It assumes you have already run setup.sh or setup_with_uv.sh in the repository.

import os
import sys
from pathlib import Path

# Ensure we're in the repository root
repo_root = Path.cwd()
if not (repo_root / 'requirements.txt').exists():
    print("⚠️ Warning: Not in repository root. Attempting to find it...")
    for parent in repo_root.parents:
        if (parent / 'requirements.txt').exists():
            os.chdir(parent)
            repo_root = parent
            print(f"✅ Changed to repository root: {repo_root}")
            break
    else:
        raise RuntimeError("❌ Could not find repository root. Please run this notebook from the PolynomialDecomposition directory.")

# Add project paths to Python path for imports
sys.path.insert(0, str(repo_root / 'Training' / 'mingpt'))
sys.path.insert(0, str(repo_root / 'Data_Generation' / 'Using_Sympy'))

# Disable wandb to avoid authentication issues
os.environ['WANDB_DISABLED'] = 'true'
os.environ['WANDB_MODE'] = 'disabled'
os.environ['WANDB_SILENT'] = 'true'

print("✅ Environment configured successfully!")
print(f"📁 Repository root: {repo_root}")
print(f"🐍 Python version: {sys.version.split()[0]}")
print("📊 Wandb logging: Disabled")

✅ Environment configured successfully!
📁 Repository root: /workspace/PolynomialDecomposition
🐍 Python version: 3.11.11
📊 Wandb logging: Disabled


## 0.2. Import Required Modules

This cell ensures all the necessary functions and classes are properly imported and available for use throughout the notebook.

In [13]:
# Import all required modules and functions
import importlib

# Import core modules
try:
    from model import GPTConfig, GPT
    from dataset import SymbolicDataset
    from trainer import Trainer, TrainerConfig
    from using_sympy import (
        generate_dataset_line, 
        polynomial_to_prefix_tokens, 
        parse_prefix_to_sympy, 
        generate_all_datasets_parallel,
        generate_multivariate_dataset_line,
        generate_multivariate_datasets_parallel
    )
    print("✅ All required modules imported successfully!")
    print("✅ Single-variable functions loaded!")
    print("✅ Multi-variable functions loaded!")
    print("🎯 Ready to use polynomial decomposition functions!")
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("Please ensure you've run the setup script and are using the .venv kernel")

✅ All required modules imported successfully!
✅ Single-variable functions loaded!
✅ Multi-variable functions loaded!
🎯 Ready to use polynomial decomposition functions!


## 0.3. Ready to Go!

The setup is now complete! Your environment is ready with:

- **Repository**: Cloned/updated with latest code
- **Virtual Environment**: Isolated Python environment in `.venv/`
- **Dependencies**: All packages installed via fast uv
- **Modules**: All required classes imported and ready to use
- **Configuration**: wandb disabled, paths configured

You can now proceed with the data generation, training, and evaluation sections below.

In [14]:
# Verify setup is working correctly
import sys
from pathlib import Path

print("🔍 Verifying setup...")
print(f"📁 Current directory: {Path.cwd()}")
print(f"🐍 Python executable: {sys.executable}")

# Test that we can import the main modules
try:
    from model import GPT
    from dataset import SymbolicDataset
    from using_sympy import generate_multivariate_dataset_line
    print("✅ Core modules imported successfully")
    print("✅ Multi-variable generation available")
except ImportError as e:
    print(f"❌ Import verification failed: {e}")

print("\n🚀 Everything is ready! You can now run the rest of the notebook.")

🔍 Verifying setup...
📁 Current directory: /workspace/PolynomialDecomposition
🐍 Python executable: /workspace/PolynomialDecomposition/.venv/bin/python
✅ Core modules imported successfully
✅ Multi-variable generation available

🚀 Everything is ready! You can now run the rest of the notebook.


# 1. Data Generation

## Overview
This section demonstrates how to generate single variable polynomial datasets using existing functions. We create training data for polynomial decomposition tasks where the model learns to reverse polynomial expansion.

## Dataset Structure
We generate three types of datasets:
- **1 Training Dataset**: Random degree combinations (outer: [2,3,4], inner: [2,3,4]) - 100,000 samples
- **9 Test Datasets**: Each corresponds to a specific degree combination (2,2), (2,3), (2,4), (3,2), (3,3), (3,4), (4,2), (4,3), (4,4) - 1,000 samples each
- **1 Validation Dataset**: Random degree combinations - 128 samples

## Key Parameters
- **Polynomial Degrees**: Both inner and outer polynomials have degrees randomly chosen from [2, 3, 4]
- **Coefficient Range**: All coefficients are integers in the range [-20, 20]
- **Output Format**: `(expanded_polynomial) ⁇ (inner_polynomial)`
  - Since this is the single variable case, knowing the inner polynomial uniquely determines the outer polynomial through polynomial division

## Data Generation Process

**Step 1: Generate Polynomials**
- Generate outer polynomial P_outer(b) with degree ∈ {2,3,4} and coefficients ∈ [-20,20]
- Generate inner polynomial P_inner(a) with degree ∈ {2,3,4} and coefficients ∈ [-20,20]

**Step 2: Polynomial Substitution & Expansion**
- Substitute: P_outer(P_inner(a)) 
- Expand the result to get the target polynomial

**Step 3: Tokenization**
- Convert all polynomials to prefix notation with special tokens
- Format: `(expanded_polynomial) ⁇ (inner_polynomial)`
  - We only include the inner polynomial since it uniquely determines the decomposition

**Tokenization Rules:**
- **Prefix Notation**: Operators come before operands (e.g., `+ a b` instead of `a + b`)
- **Number Tokenization**: 
  - Positive numbers: `P` followed by space-separated digits (e.g., `23` → `P 2 3`)
  - Negative numbers: `N` followed by space-separated digits (e.g., `-15` → `N 1 5`)
- **Polynomial Terms**:
  - Constant: Just the tokenized number (e.g., `5` → `P 5`)
  - Linear term: `* coefficient variable` (e.g., `3a` → `* P 3 a`)
  - Higher powers: `* coefficient ^ variable power` (e.g., `5a³` → `* P 5 ^ a P 3`)
- **Multiple Terms**: Use nested right-associative addition with `+`
  - Example: `a² + 2a + 3` → `+ * P 1 ^ a P 2 + * P 2 a P 3`
- **Special Symbols**: `⁇` separates input from target, `&` would separate outer from inner (in multi-variable case)

## Example

Generate a single polynomial decomposition sample.
This demonstrates the core data generation process for one training example

In [None]:
# Example: Generate a single polynomial decomposition sample
# This demonstrates the core data generation process for one training example

import random

# Ensure required functions are imported
try:
    from using_sympy import generate_dataset_line
except ImportError:
    print("⚠️ Error: generate_dataset_line not imported. Please run cell 6 first.")
    raise

# Randomly select degrees for outer and inner polynomials from {2, 3, 4}
degree1 = random.choice([2, 3, 4])  # Degree of outer polynomial P_outer(b)
degree2 = random.choice([2, 3, 4])  # Degree of inner polynomial P_inner(a)

print(f"Generating sample with outer degree {degree1} and inner degree {degree2}")

# Generate one training example with debugging output
# Returns: (tokenized_string, (outer_poly, inner_poly, expanded_result))
_, _ = generate_dataset_line(degree1=degree1, degree2=degree2, debug=True)

print("\n" + "="*50)
print("EXPLANATION:")
print("• Outer polynomial: P_outer(b) with coefficients in [-20, 20]")
print("• Inner polynomial: P_inner(a) with coefficients in [-20, 20]")
print("• Substituted: P_outer(P_inner(a)) - substitute inner into outer")
print("• Expanded: Fully expanded form of the substituted polynomial")
print("• Result: Tokenized format for transformer training")
print("  Format: [expanded] ⁇ [outer] & [inner]")

## Generate Dataset

With more than 128 cpus, dataset generation can be done in ~15 mins. However, if there is limited number of cpus, this can take much longer time. In that case, reduce the num_train

In [None]:
# =============================================================================
# EXECUTE IMPROVED PARALLEL DATASET GENERATION
# =============================================================================
# This cell runs the improved parallel dataset generation with proper worker control
# 
# Improvements in this version:
#   - Uses exactly 128 workers (not 257+ batches)
#   - Real-time progress tracking for training data generation  
#   - Proper multiprocessing context for Jupyter compatibility
#   - No more "can only test a child process" errors
#   - Detailed per-worker progress reporting
#
# Generated files:
#   - training_dataset.txt      (300,000 samples, mixed degrees)
#   - validation_dataset.txt    (128 samples, mixed degrees)
#   - test_dataset_2_2.txt      (3,000 samples, degree (2,2))
#   - test_dataset_2_3.txt      (3,000 samples, degree (2,3))
#   - test_dataset_2_4.txt      (3,000 samples, degree (2,4))
#   - test_dataset_3_2.txt      (3,000 samples, degree (3,2))
#   - test_dataset_3_3.txt      (3,000 samples, degree (3,3))
#   - test_dataset_3_4.txt      (3,000 samples, degree (3,4))
#   - test_dataset_4_2.txt      (3,000 samples, degree (4,2))
#   - test_dataset_4_3.txt      (3,000 samples, degree (4,3))
#   - test_dataset_4_4.txt      (3,000 samples, degree (4,4))
# =============================================================================

import time
import os
import sys
import importlib

# Reload the improved multiprocessing functions
if 'using_sympy' in sys.modules:
    importlib.reload(sys.modules['using_sympy'])

from using_sympy import generate_all_datasets_parallel

# Detect system capabilities
total_cpus = os.cpu_count()
print(f"🖥️  System Detection:")
print(f"   Total CPU threads: {total_cpus}")

# Determine optimal settings based on system
if total_cpus >= 128:
    optimal_workers = 128
    print(f"   Large system detected: Using {optimal_workers} workers (optimal for very large systems)")
elif total_cpus >= 64:
    optimal_workers = min(64, total_cpus // 2)
    print(f"   Medium-large system detected: Using {optimal_workers} workers")
elif total_cpus >= 8:
    optimal_workers = total_cpus - 1
    print(f"   Medium system detected: Using {optimal_workers} workers (leaving 1 for system)")
else:
    optimal_workers = max(1, total_cpus)
    print(f"   Small system detected: Using {optimal_workers} workers")

print(f"   Expected speedup: ~{optimal_workers}x over single-threaded")
print()

# Start timing the generation process
start_time = time.perf_counter()

# Create output directory if it doesn't exist
file_directory = "data_storage/dataset/single_variable"
if not os.path.exists(file_directory):
    os.makedirs(file_directory)
    print(f"📁 Created directory: {file_directory}")

# Dataset generation parameters
num_train = 300000  # Training samples (reduced for demo)
num_test = 3000     # Test samples per degree combination
num_valid = 128     # Validation samples

print("🚀 Starting IMPROVED PARALLEL dataset generation...")
print(f"🎯 Target: {num_train:,} training + {9*num_test:,} test + {num_valid} validation = {num_train + 9*num_test + num_valid:,} total samples")
print(f"📐 All polynomials: degrees ∈ {{2,3,4}}, coefficients ∈ [-20,20]")
print(f"📝 Format: expanded_polynomial ⁇ inner_polynomial (inner_only=True)")
print(f"✨ New features: Real-time progress tracking, proper worker control, error prevention")
print("-" * 80)

# Run improved parallel dataset generation
generate_all_datasets_parallel(
    file_directory=file_directory, 
    num_train=num_train,
    num_test=num_test,
    num_valid=num_valid,
    inner_only=True,      # Only include inner polynomial (single variable case)
    num_cpus=None         # Auto-detect optimal CPU usage
)

# Calculate and display generation time
end_time = time.perf_counter()
elapsed_time = end_time - start_time

print("-" * 80)
print(f"🎉 Improved parallel dataset generation completed!")
print(f"⚡ Total time: {elapsed_time:.2f} seconds ({elapsed_time/60:.1f} minutes)")

# Estimate speedup (based on previous runs)
estimated_sequential_time = 7924  # seconds from previous sequential run
if elapsed_time > 0:
    speedup = estimated_sequential_time / elapsed_time
    print(f"🚀 Estimated speedup: ~{speedup:.1f}x faster than sequential generation")

print(f"📁 Output directory: {file_directory}/")
print(f"💾 All datasets saved and ready for training!")

# Show system efficiency
if elapsed_time > 0:
    samples_per_second = (num_train + 9*num_test + num_valid) / elapsed_time
    print(f"🏃 Generation rate: {samples_per_second:.0f} samples/second")
    print(f"🔧 Worker efficiency: {samples_per_second/optimal_workers:.0f} samples/second/worker")

## Verification

In [None]:
# =============================================================================
# DATASET VERIFICATION
# =============================================================================
# This cell examines the generated training dataset to verify format and content
# It validates the decomposition using recursive polynomial division

import sys
import importlib

# Add path if not already added
if 'Training/mingpt' not in sys.path:
    sys.path.append('Training/mingpt')

# Force reload of utils module to get latest changes
if 'utils' in sys.modules:
    importlib.reload(sys.modules['utils'])
else:
    import utils

from utils import is_valid_expression_sympy_single

# Load and examine one sample from the training dataset
file_directory = "data_storage/dataset/single_variable"
dataset_file = file_directory + "/training_dataset.txt"
# dataset_file = file_directory + "/test_dataset_4_4.txt"
print(f"📁 Examining: {dataset_file}")
print("-" * 60)

# Read the first line as an example
with open(dataset_file, encoding="utf-8") as f:
    example = f.readline().strip()

print(f"📝 Raw training example:")
print(f"   {example}")
print(f"")
print(f"📏 Character length: {len(example)}")
print(f"")

# Parse the input and target
if " ⁇ " in example:
    input_tokens, target_tokens = example.split(" ⁇ ")
    print(f"🎯 INPUT (expanded polynomial):")
    print(f"   {input_tokens}")
    print(f"")
    print(f"🎯 TARGET (inner polynomial):")
    print(f"   {target_tokens}")
    print(f"")
    
    # Validate the decomposition using recursive division
    print(f"🔍 VALIDATING DECOMPOSITION WITH RECURSIVE DIVISION:")
    is_valid, outer_poly, _ = is_valid_expression_sympy_single(
        input_tokens, target_tokens, return_details=True
    )
    
    if is_valid:
        print(f"   ✅ Valid decomposition!")
        print(f"   📐 Found outer polynomial P(b): {outer_poly}")
        print(f"   🎯 Given inner polynomial Q(a): {target_tokens}")
        print(f"   ✔️  Recursive division eliminated all 'a' variables from outer polynomial")
        print(f"   ✔️  Verification: P(Q(a)) = expanded polynomial")
    else:
        print(f"   ❌ Invalid decomposition!")
        print(f"   📐 Attempted outer polynomial: {outer_poly}")
        print(f"   ❌ Recursive division did not eliminate all 'a' variables")
        print(f"   ❌ This indicates the inner polynomial cannot produce the expanded form")
else:
    print("⚠️  Format error: Expected ' ⁇ ' separator not found")

print(f"")
print(f"✅ Dataset format verification complete!")
print(f"📊 The model will learn: Expanded → Inner polynomial decomposition")
print(f"📋 Algorithm: Recursive polynomial division until no 'a' variables remain in outer polynomial")

# 2. Supervised Learning on Training Dataset

**Objective:** Train a transformer model to learn polynomial decomposition through supervised learning.

In this section, we implement and train a GPT-style transformer model specifically designed for mathematical reasoning tasks. The model learns to perform polynomial decomposition by training on the synthetic dataset generated in Section 1.

## Why Supervised Learning for Polynomial Decomposition?

Polynomial decomposition is a **sequence-to-sequence** problem where:
- **Input**: An expanded polynomial in prefix notation (e.g., `+ * P 3 ^ a P 2 + * P 2 a P 1`)
- **Output**: The decomposed form `outer_polynomial & inner_polynomial` or `inner_polynomial` for a single variable case.
- **Challenge**: The model must learn to recognize patterns and reverse the expansion process

**Key Learning Objectives:**
1. **Pattern Recognition**: Identify when a polynomial can be decomposed
2. **Inverse Operation**: Learn to "undo" polynomial expansion
3. **Mathematical Reasoning**: Understand polynomial structure and relationships

## Model Architecture Details

Our transformer is based on the GPT architecture with mathematical-specific optimizations:

### Architecture Specifications
- **Model Type**: Decoder-only transformer (GPT-style)
- **Layers**: 6 transformer blocks
- **Attention Heads**: 8 multi-head attention heads per layer
- **Embedding Dimension**: 512
- **Context Window**: 350 tokens (block_size)
- **Vocabulary Size**: ~31 tokens (mathematical symbols + numbers)
- **Parameters**: ~19M parameters

### Key Components
1. **Token Embeddings**: Map mathematical symbols to dense vectors
2. **Positional Encodings**: Help model understand token order in expressions
3. **Multi-Head Attention**: Learn relationships between different parts of polynomials
4. **Causal Masking**: Ensure autoregressive generation (left-to-right)
5. **Layer Normalization**: Stabilize training
6. **Residual Connections**: Enable deep network training

## Training Process Overview

The training proceeds in two main steps:

### 2.1 Model Configuration and Dataset Loading
- Set up GPU device and configure model parameters
- Initialize transformer with mathematical-optimized settings
- Load training and validation datasets with proper tokenization
- Prepare data pipelines for efficient batch processing

### 2.2 Supervised Training Loop
- Configure training hyperparameters and optimization strategy
- Run supervised training with teacher forcing
- Monitor training progress and save best checkpoints
- Implement learning rate scheduling and regularization

**Training Strategy:**
- **Teacher Forcing**: During training, provide ground truth target sequence
- **Cross-Entropy Loss**: Measure prediction accuracy at each token position
- **Adam Optimizer**: Adaptive learning rates for different parameters
- **Learning Rate Scheduling**: Warmup followed by cosine decay
- **Early Stopping**: Save best model based on validation loss

**Training Hyperparameters:**
- **Epochs**: 10 (sufficient for convergence on synthetic data)
- **Batch Size**: 512 (balance memory usage and gradient stability. If there is out of memory issue, reducing batch size is most efficient way to deal with it. )
- **Learning Rate**: 6e-4 (optimal for transformer training)
- **Weight Decay**: 0.1 (L2 regularization)
- **Warmup Tokens**: 10,240 (gradual learning rate increase)
- **Final Tokens**: ~22M (total training tokens)

The trained model will then be evaluated in Section 3 using greedy search and beam search inference methods.

## 2.1. Model Configuration and Dataset Loading

**What we do in this section:**
- Set up the computing environment (GPU/CPU detection)
- Define the mathematical vocabulary for tokenization
- Initialize the transformer model with optimal parameters
- Load and prepare training and validation datasets
- Convert text data to tokenized sequences


In [None]:
import torch

# Ensure required modules are imported
try:
    from model import GPTConfig, GPT
    from dataset import SymbolicDataset
except ImportError:
    print("⚠️ Error: Model classes not imported. Please run cell 6 first.")
    raise

device = torch.cuda.current_device() if torch.cuda.is_available() else "cpu"
block_size = 300
tokens = [
    "□",
    "a","b","c","d","e","x","y","z",
    "⁇","?",
    "a0","a1","b0","b1",
    "N","P","&","+","*","^",
] + [str(i) for i in range(0, 10)]
vocab_size = len(tokens)

In [None]:
# Load model and datasets

# Ensure required modules are imported
try:
    from model import GPTConfig, GPT
    from dataset import SymbolicDataset
except ImportError:
    print("⚠️ Error: Model classes not imported. Please run cells 6 and 20 first.")
    raise

# Load model
model_cfg = GPTConfig(
    vocab_size, block_size, n_layer=6, n_head=8, n_embd=512
)
gpt = GPT(model_cfg)
gpt.to(device)

# Load and encode dataset
file_directory = "data_storage/dataset/single_variable"
train_data_path = file_directory + "/training_dataset.txt"
valid_data_path = file_directory + "/validation_dataset.txt"
test_data_path = [[None,None,None],[None,None,None],[None,None,None]]
for i, deg1 in enumerate([2,3,4]):
    for j, deg2 in enumerate([2,3,4]):
        test_data_path[i][j] = file_directory + f"/test_dataset_{deg1}_{deg2}.txt"

print("Load training dataset")
train_dataset = SymbolicDataset(
    block_size,
    tokens,
    open(train_data_path, encoding="utf-8").read(),
)

valid_dataset = SymbolicDataset(
    block_size,
    tokens,
    open(valid_data_path, encoding="utf-8").read(),
)

## 2.2. Supervised Training Process

**What we do in this section:**
- Configure training hyperparameters and optimization strategy
- Set up model checkpointing and monitoring
- Run the training loop with teacher forcing
- Track training metrics and save the best model


### Model Persistence

**Checkpoint Strategy:**
- Keep best model based on validation performance
- Enable resume training if interrupted
- Preserve optimizer state for consistent training

**Output Artifacts:**
- `single_variable_model.pt`: Final trained model
- `single_variable_model_best.pt`: Best validation performance
- Training metrics and logs for analysis

After training completion, the model is ready for evaluation in Section 3 using both greedy search and beam search inference methods.

In [None]:
####### If you want to train from previously trained model, run this #######
#### If you want to train from the scratch, just skip running this cell ####

import torch

model_directory = "data_storage/model"
reading_params_path = f"{model_directory}/single_variable_model_best.pt"

gpt.load_state_dict(torch.load(reading_params_path))
print("pre trained data loaded")
############################################################################

In [None]:
# Training configuration
import os
import time

# Ensure required modules are imported
try:
    from trainer import TrainerConfig
except ImportError:
    print("⚠️ Error: TrainerConfig not imported. Please run cell 6 first.")
    raise

model_directory = "data_storage/model"
model_path = f"{model_directory}/single_variable_model.pt"
best_model_path = f"{model_directory}/single_variable_model_best.pt"
if not os.path.exists(model_directory):
    os.makedirs(model_directory)

batch_size = 256

tconf = TrainerConfig(
            max_epochs=15,
            batch_size=batch_size,
            learning_rate=6e-4,
            lr_decay=True,
            warmup_tokens=512 * 20,
            final_tokens= batch_size * 3000 * block_size,
            num_workers=4,
            ckpt_path=model_path,
            shuffle = True,
            weight_decay = 0.1,
        )

### Skip supervised training process
If you want to skip training which usually takes 2-3hr, skip running following cell. Trained model is already in the repo.

In [None]:
import time
import torch

# Ensure required modules are imported
try:
    from trainer import Trainer
except ImportError:
    print("⚠️ Error: Trainer not imported. Please run cell 6 first.")
    raise

trainer = Trainer(gpt, train_dataset, valid_dataset, tconf)

# Train the model
start_time = time.perf_counter()
trainer.train()
end_time = time.perf_counter()
elapsed_time = end_time - start_time
print(f"Total time taken: {elapsed_time:.2f} seconds")

# Save the trained model
resulting_model = gpt.module if hasattr(gpt, "module") else gpt
torch.save(resulting_model.state_dict(), model_path)
print(f"Model saved to {model_path}")

# 3. Model Evaluation and Inference Methods

**Objective:** Evaluate the trained transformer model using two different inference strategies: greedy search and beam search.

In this section, we systematically evaluate our trained polynomial decomposition model to understand its performance, strengths, and limitations. We compare two different inference methods to analyze how the search strategy affects the quality of generated polynomial decompositions.

## Evaluation Framework

### Test Dataset Organization

Our evaluation uses stratified test sets based on polynomial degrees:
- **test_dataset_2_2.txt**: Degree 2 outer, Degree 2 inner polynomials
- **test_dataset_2_3.txt**: Degree 2 outer, Degree 3 inner polynomials
- **test_dataset_2_4.txt**: Degree 2 outer, Degree 4 inner polynomials
- **test_dataset_3_2.txt**: Degree 3 outer, Degree 2 inner polynomials
- **test_dataset_3_3.txt**: Degree 3 outer, Degree 3 inner polynomials
- **test_dataset_3_4.txt**: Degree 3 outer, Degree 4 inner polynomials
- **test_dataset_4_2.txt**: Degree 4 outer, Degree 2 inner polynomials
- **test_dataset_4_3.txt**: Degree 4 outer, Degree 3 inner polynomials
- **test_dataset_4_4.txt**: Degree 4 outer, Degree 4 inner polynomials

**Why Stratified Evaluation?**
- Different degree combinations have varying complexity
- Helps identify model strengths and weaknesses
- Enables targeted analysis of challenging cases

### Evaluation Metrics
**1. Symbolic Equivalence:**
- Use SymPy to verify mathematical equivalence
- Account for different but equivalent representations
- More flexible than exact string matching

**2. Beam Search Hit Rate:**
- Does any candidate in the beam contain the correct answer?
- Measures the potential of beam search with perfect ranking

## Inference Methods Comparison

### 3.1 Greedy Search Evaluation

**How Greedy Search Works:**
1. **Single Path**: At each step, select the token with highest probability
2. **Deterministic**: Same input always produces same output
3. **Fast**: O(n) time complexity where n is sequence length

**Advantages:**
- **Speed**: Very fast inference, suitable for real-time applications
- **Simplicity**: Easy to implement and debug
- **Consistency**: Reproducible results
- **Memory Efficient**: Low memory overhead

**Limitations:**
- **Myopic**: Cannot recover from early mistakes
- **Single Solution**: Only explores one path
- **Local Optima**: May get stuck in suboptimal solutions

### 3.2 Beam Search Evaluation

**How Beam Search Works:**
1. **Multiple Paths**: Maintain top-k most probable sequences (beam width)
2. **Probabilistic**: Explores multiple candidate solutions
3. **Pruning**: Keeps only most promising paths at each step
4. **Global View**: Better chance of finding optimal solution

**Key Parameters:**
- **Beam Width**: Number of parallel hypotheses (e.g., 30)
- **Length Penalty**: Bias toward longer/shorter sequences
- **Temperature**: Controls randomness in probability distribution

**Advantages:**
- **Better Solutions**: Higher chance of finding correct decomposition
- **Multiple Candidates**: Provides alternative solutions
- **Flexibility**: Can adjust beam width based on requirements

**Limitations:**
- **Computational Cost**: O(k^2*n) time complexity
- **Memory Usage**: Stores multiple sequences simultaneously
- **Diminishing Returns**: Larger beams don't always improve results
```


In [53]:
# Greedy Search - Run all combinations of 2,2 -> n,m where n,m in [2,3,4]
for n in [2, 3, 4]:
    for m in [2, 3, 4]:
        print(f"\nTesting decomposition from degree outer:{n}, inner:{m}")
        !python Training/mingpt/run.py inequality_evaluate4 \
           --block_size 300 \
           --max_output_length 150 \
           --n_embd 512 \
           --n_head 8 \
           --n_layer 6 \
           --sympy 1 \
           --max_test 300 \
           --evaluate_corpus_path data_storage/dataset/single_variable/test_dataset_{n}_{m}.txt \
           --reading_params_path data_storage/model/single_variable_model_best.pt \
           --outputs_path data_storage/predictions/single_variable/example_{n}_{m}.txt
   


Testing decomposition from degree outer:2, inner:2


block size: 300
number of parameters: 19100672
data has 360947 characters, 31 unique.
100%|███████████████████████████████████████████| 20/20 [00:23<00:00,  1.17s/it]
--------------------------------
Example three lines
expanded forms: ['+ N 2 4 0 + * P 2 1 5 a + * P 6 3 8 ^ a P 2 + * N 3 2 0 ^ a P 3 * N 5 1 2 ^ a P 4 ', '+ P 1 0 8 + * N 5 5 2 a + * N 5 1 2 ^ a P 2 + * P 2 9 6 4 ^ a P 3 * P 3 2 1 1 ^ a P 4 ', '+ N 1 0 2 4 + * P 2 0 5 2 a + * N 2 8 8 9 ^ a P 2 + * P 1 8 4 8 ^ a P 3 * N 8 4 7 ^ a P 4 ']
predicted substitutions: ['+ P 1 0 + * N 5 a * N 1 6 ^ a P 2', '+ N 2 + * N 6 a * N 1 3 ^ a P 2', '+ P 1 2 + * N 1 2 a * N 1 1 ^ a P 2']
--------------------------------
[SymPy Valid Single] Expanded: -512*a**4 - 320*a**3 + 638*a**2 + 215*a - 240
                     Inner: -16*a**2 - 5*a + 10
                     Outer: -2*b**2 - 3*b - 10
                     Has 'a' vars: False
                     Valid: True
[SymPy Valid Single] Expanded: 3211*a**4 + 2964*a**3 - 512*a**2 - 552*a + 108

In [55]:
# Beam Search evaluation on test_dataset_2_4.txt
!python Training/mingpt/run.py debug_beam \
   --block_size 300 \
   --max_output_length 150 \
   --n_embd 512 \
   --n_layer 6 \
   --n_head 8 \
   --beam_width 10 \
   --max_test 100 \
   --sympy 1 \
   --evaluate_corpus_path data_storage/dataset/single_variable/test_dataset_2_4.txt \
   --reading_params_path data_storage/model/single_variable_model_best.pt \
   --outputs_path data_storage/predictions/single_variable/example_beam_search.txt

block size: 300
number of parameters: 19100672
data has 718211 characters, 31 unique.
0it [00:00, ?it/s][DEBUG] input_str: + P 8 2 9 + * P 5 4 6 a + * N 1 5 4 8 ^ a P 2 + * N 2 7 2 4 ^ a P 3 + * N 3 3 6 8 ^ a P 4 + * P 1 0 2 0 ^ a P 5 + * P 4 8 6 0 ^ a P 6 + * P 4 5 6 0 ^ a P 7 * P 3 6 1 0 ^ a P 8 
[DEBUG] pred:  + P 9 + * N 3 a + * N 9 ^ a P 2 + * N 1 2 ^ a P 3 * N 1 9 ^ a P 4 
[SymPy Valid Single] Expanded: 3610*a**8 + 4560*a**7 + 4860*a**6 + 1020*a**5 - 3368*a**4 - 2724*a**3 - 1548*a**2 + 546*a + 829
                     Inner: -19*a**4 - 12*a**3 - 9*a**2 - 3*a + 9
                     Outer: 360*a**2 + 12*a + 10*b**2 + b*(120*a + 2) + 1
                     Has 'a' vars: True
                     Valid: False
Beam 0 : False. Len : 123. LogProb : -3.983269238033671. AverageLogP : -0.03238430274824123 

[DEBUG] input_str: + P 8 2 9 + * P 5 4 6 a + * N 1 5 4 8 ^ a P 2 + * N 2 7 2 4 ^ a P 3 + * N 3 3 6 8 ^ a P 4 + * P 1 0 2 0 ^ a P 5 + * P 4 8 6 0 ^ a P 6 + * P 4 5 6 0 ^ a P 7 * P 3 6 

In [84]:
# Beam Search evaluation on test_dataset_4_4.txt
!python Training/mingpt/run.py debug_beam \
   --block_size 300 \
   --max_output_length 150 \
   --n_embd 512 \
   --n_layer 6 \
   --n_head 8 \
   --beam_width 10 \
   --max_test 100 \
   --sympy 1 \
   --evaluate_corpus_path data_storage/dataset/single_variable/test_dataset_4_4.txt \
   --reading_params_path data_storage/model/single_variable_model_best.pt \
   --outputs_path data_storage/predictions/single_variable/example_beam_search.txt

block size: 300
number of parameters: 19100672
data has 1537021 characters, 31 unique.
0it [00:00, ?it/s][DEBUG] input_str: + P 3 4 7 9 0 + * N 1 4 0 0 2 2 a + * P 5 2 4 2 8 ^ a P 2 + * P 3 5 7 7 6 2 ^ a P 3 + * N 5 1 9 3 0 9 ^ a P 4 + * P 4 1 7 0 8 4 ^ a P 5 + * P 3 4 3 7 4 6 ^ a P 6 + * N 1 4 5 9 9 3 2 ^ a P 7 + * P 8 9 0 6 4 8 ^ a P 8 + * N 3 3 6 0 7 0 ^ a P 9 + * N 9 6 4 9 3 4 ^ a P 1 0 + * P 1 7 1 4 2 4 4 ^ a P 1 1 + * N 3 1 8 0 5 8 ^ a P 1 2 + * P 5 3 4 8 2 0 ^ a P 1 3 + * P 1 0 1 7 2 8 0 ^ a P 1 4 + * N 1 9 6 5 2 0 ^ a P 1 5 * P 4 1 7 6 0 5 ^ a P 1 6 
[DEBUG] pred:  + P 9 + * P 9 a + * P 1 2 ^ a P 2 + * N 2 ^ a P 3 * N 1 7 ^ a P 4 
[SymPy Valid Single] Expanded: 417605*a**16 - 196520*a**15 + 1017280*a**14 + 534820*a**13 - 318058*a**12 + 1714244*a**11 - 964934*a**10 - 336070*a**9 + 890648*a**8 - 1459932*a**7 + 343746*a**6 + 417084*a**5 - 519309*a**4 + 357762*a**3 + 52428*a**2 - 140022*a + 34790
                     Inner: -17*a**4 - 2*a**3 + 12*a**2 + 9*a + 9
                    

# 4. Rank-Aware Beam GRPO Fine-tuning

**Objective:** Further improve model performance using reinforcement learning with rank-aware beam search optimization.

In this section, we implement **Beam Group Relative Policy Optimization (BGRPO)**, a reinforcement learning technique specifically designed to improve the effectiveness of beam search in mathematical reasoning tasks. Building on the evaluation insights from Section 3, we fine-tune our pre-trained model to better rank correct solutions within beam search results.

## Why Reinforcement Learning for Polynomial Decomposition?

**Limitations of Supervised Learning:**
While supervised learning teaches the model to generate correct decompositions, it doesn't optimize for:
- **Beam Search Ranking**: Ensuring correct solutions appear early in beam search
- **Confidence Calibration**: Assigning higher probabilities to correct solutions
- **Search Efficiency**: Reducing the beam width needed to find correct answers

**Reinforcement Learning Solutions:**
RL fine-tuning addresses these issues by:
- **Reward-Based Learning**: Direct optimization for mathematical correctness
- **Exploration**: Discovering alternative solution paths
- **Ranking Optimization**: Training the model to rank correct solutions higher

## Understanding BGRPO (Beam Group Relative Policy Optimization)

### Core Concept

BGRPO is a specialized RL algorithm that:
1. **Groups Beam Results**: Organizes beam search outputs by correctness
2. **Relative Ranking**: Compares solutions within and across groups
3. **Policy Optimization**: Updates model to prefer correct solutions
4. **Rank-Aware Rewards**: Gives higher rewards to correct solutions that appear early

**Training Process:**
1. **Generate Beams**: Run beam search on training examples
2. **Validate Solutions**: Use SymPy to check mathematical correctness
3. **Group Results**: Separate correct vs incorrect solutions
4. **Compute Loss**: Maximize log-probability of correct group
5. **Update Policy**: Backpropagate gradients to improve model

### Two BGRPO Variants

#### 4.1 Simple BGRPO (Without Rank Information)

**Approach:**
- Treats all correct solutions equally regardless of beam position
- Simpler to implement and understand

**Limitations:**
- **No Ranking Preference**: Doesn't prioritize early beam positions
- **Efficiency**: May not improve beam search efficiency

#### 4.2 Rank-Aware BGRPO (With Rank Information)

**Approach:**
- Incorporates beam position into reward calculation
- Higher rewards for correct solutions appearing earlier
- Optimizes both correctness and search efficiency

**Advantages:**
- **Ranking Optimization**: Correct solutions appear earlier in beam


## Implementation Details

### 4.1 Model Conversion and Setup

**What happens in this subsection:**
- Convert the trained model to HuggingFace format for RL training
- Set up the GRPO training environment and dependencies
- Configure hyperparameters for reinforcement learning
- Prepare datasets for RL fine-tuning

### 4.2 BGRPO Training Execution

**Training Hyperparameters:**
- **Learning Rate**: Typically lower than supervised learning (1e-6 to 1e-5)
- **Batch Size**: Smaller batches due to beam search overhead
- **Beam Width**: Consistent with evaluation (e.g., 30)
- **Training Episodes**: Number of RL training iterations
- **Reward Scaling**: Normalization for stable learning


### 4.3 Post-BGRPO Evaluation

**Comprehensive Assessment:**
After BGRPO fine-tuning, we evaluate improvements in:


**Ranking Quality**
- Distribution of correct solutions across beam positions
- Correlation between model confidence and correctness
- Calibration of probability scores



## Expected BGRPO Outcomes

**Performance Improvements:**
- **Higher Accuracy**: Better overall correctness rates
- **Better Ranking**: Correct solutions appear earlier in beam search
- **Improved Confidence**: Model assigns higher probabilities to correct solutions
- **Efficiency Gains**: Requires smaller beam widths for same performance

In [None]:
import os
from pathlib import Path

print(f"📍 Current directory: {Path.cwd()}")
!pwd

By run the cell below, we do BGRPO on our trained model. Every 20 iterationm checkpoint would be saved.

In [135]:
# Run BGRPO
!cd Training/BGRPO && bash run_single_variable_model.sh

Starting single variable model with rank reward on GPU 0
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Project Root: /workspace/PolynomialDecomposition/Training
Successfully imported custom nanogpt model, loader, and utils.
Using Config Path: /workspace/PolynomialDecomposition/Training/../data_storage/model/model_configurations/model_configuration.json
Using Model Dir: /workspace/PolynomialDecomposition/Training/../data_storage/model
Using Device: cuda
Loading configuration from: /workspace/PolynomialDecomposition/Training/../data_storage/model/model_configurations/model_configuration.json
Model name to use: single_variable_model_best.pt
Co

Model file format should be transformed from .safetensors to .pt

In [136]:
# Convert the model to HuggingFace format
!cd Training/BGRPO && bash changing.sh

Loading state dict from: ../../data_storage/outputs/_BGRPO/checkpoint-10/model.safetensors
State dict loaded successfully.
Saving state dict to: ../../data_storage/model/BGRPO/_BGRPO/pytorch_model__BGRPO_10.pt
State dict saved successfully as pytorch_model__BGRPO_10.pt.
Loading state dict from: ../../data_storage/outputs/_BGRPO/checkpoint-20/model.safetensors
State dict loaded successfully.
Saving state dict to: ../../data_storage/model/BGRPO/_BGRPO/pytorch_model__BGRPO_20.pt
State dict saved successfully as pytorch_model__BGRPO_20.pt.
Loading state dict from: ../../data_storage/outputs/_BGRPO/checkpoint-30/model.safetensors
State dict loaded successfully.
Saving state dict to: ../../data_storage/model/BGRPO/_BGRPO/pytorch_model__BGRPO_30.pt
State dict saved successfully as pytorch_model__BGRPO_30.pt.
Loading state dict from: ../../data_storage/outputs/_BGRPO/checkpoint-40/model.safetensors
State dict loaded successfully.
Saving state dict to: ../../data_storage/model/BGRPO/_BGRPO/pyto

Do beam search again on test_dataset_4_4.txt. Remind that too long BGRPO leads to model breaking down, so we chose pytorch_model__BGRPO_70.pt (model after 70 iterations) to see improvements from the BGRPO.

In [139]:
# Test the BGRPO model
# Beam Search
!python Training/mingpt/run.py debug_beam \
   --block_size 300 \
   --max_output_length 150 \
   --n_embd 512 \
   --n_layer 6 \
   --n_head 8 \
   --beam_width 10 \
   --max_test 100 \
   --sympy 1 \
   --evaluate_corpus_path data_storage/dataset/single_variable/test_dataset_4_4.txt \
   --reading_params_path data_storage/model/BGRPO/_BGRPO/pytorch_model__BGRPO_70.pt \
   --outputs_path data_storage/predictions/single_variable/after_BGRPO_beam_search.txt

block size: 300
number of parameters: 19100672
data has 1537021 characters, 31 unique.
0it [00:00, ?it/s][DEBUG] input_str: + P 3 4 7 9 0 + * N 1 4 0 0 2 2 a + * P 5 2 4 2 8 ^ a P 2 + * P 3 5 7 7 6 2 ^ a P 3 + * N 5 1 9 3 0 9 ^ a P 4 + * P 4 1 7 0 8 4 ^ a P 5 + * P 3 4 3 7 4 6 ^ a P 6 + * N 1 4 5 9 9 3 2 ^ a P 7 + * P 8 9 0 6 4 8 ^ a P 8 + * N 3 3 6 0 7 0 ^ a P 9 + * N 9 6 4 9 3 4 ^ a P 1 0 + * P 1 7 1 4 2 4 4 ^ a P 1 1 + * N 3 1 8 0 5 8 ^ a P 1 2 + * P 5 3 4 8 2 0 ^ a P 1 3 + * P 1 0 1 7 2 8 0 ^ a P 1 4 + * N 1 9 6 5 2 0 ^ a P 1 5 * P 4 1 7 6 0 5 ^ a P 1 6 
[DEBUG] pred:  + P 9 + * P 9 a + * P 1 2 ^ a P 2 + * P 2 ^ a P 3 * P 1 7 ^ a P 4 
[SymPy Valid Single] Expanded: 417605*a**16 - 196520*a**15 + 1017280*a**14 + 534820*a**13 - 318058*a**12 + 1714244*a**11 - 964934*a**10 - 336070*a**9 + 890648*a**8 - 1459932*a**7 + 343746*a**6 + 417084*a**5 - 519309*a**4 + 357762*a**3 + 52428*a**2 - 140022*a + 34790
                     Inner: 17*a**4 + 2*a**3 + 12*a**2 + 9*a + 9
                     

# 5. Multi-Variable Polynomial Decomposition

**Objective:** Extend the polynomial decomposition problem to handle multiple variables.

In this section, we outline how the polynomial decomposition problem extends to multiple variables, where instead of a single variable 'a', we have multiple variables (a0, a1, a2) and each gets its own inner polynomial substitution.

## Multi-Variable Problem Setup

### Key Differences from Single Variable Case:

**Single Variable:**
- One variable 'a' in the expanded polynomial
- One inner polynomial Q(a)
- One outer polynomial P(b)
- Substitution: P(Q(a))
- Format: `expanded_poly ⁇ inner_poly`

**Multi-Variable (3 variables):**
- Three variables (a0, a1, a2) in the expanded polynomial
- Three inner polynomials: Q0(a0), Q1(a1), Q2(a2)
- One outer polynomial P(b0, b1, b2) with 3 variables
- Substitution: P(Q0(a0), Q1(a1), Q2(a2))
- Format: `expanded_poly ? outer_poly & inner_poly0 & inner_poly1 & inner_poly2`

### Example Problem:
Given an expanded polynomial in variables (a0, a1, a2), find:
- Outer polynomial P(b0, b1, b2)
- Inner polynomial Q0(a0)
- Inner polynomial Q1(a1)  
- Inner polynomial Q2(a2)

Such that: P(Q0(a0), Q1(a1), Q2(a2)) = expanded polynomial



## 5.1 Multi-Variable Data Generation

Generate multi-variable polynomial decomposition datasets using the new parallel generation functions.

### Configuration:
- **Inner variables**: 3 (a0, a1, a2)
- **Outer variables**: 3 (b0, b1, b2)
- **Max degree (inner)**: 2 (each inner polynomial has degree 1 or 2)
- **Max degree (outer)**: 2 (outer polynomial has degree 1 or 2)
- **Dataset sizes**: 300k training, 3k test, 128 validation

The format will be: 

In [15]:
# Example: Generate a single multi-variable polynomial decomposition sample

import random
import sys
import importlib

# Reload the module to get the new multivariate functions
if 'using_sympy' in sys.modules:
    importlib.reload(sys.modules['using_sympy'])

# Import the multivariate generation function
from using_sympy import generate_multivariate_dataset_line

# Generate one example with debug output
print("🔍 Generating multi-variable polynomial decomposition example:")
print("="*60)

line, (outer_poly, inner_polys, expanded_result) = generate_multivariate_dataset_line(
    num_inner_vars=3,  # a0, a1, a2
    num_outer_vars=3,  # b0, b1, b2
    max_degree_inner=2,
    max_degree_outer=2,
    debug=True
)

print("\n" + "="*60)
print("EXPLANATION:")
print("• Outer polynomial: P(b0, b1, b2) with multivariate terms")
print("• Inner polynomials: Q0(a0), Q1(a1), Q2(a2) each with degree 1 or 2")
print("• Substitution: P(Q0(a0), Q1(a1), Q2(a2))")
print("• Format: [expanded] ? [outer] & [inner0] & [inner1] & [inner2]")

🔍 Generating multi-variable polynomial decomposition example:
Outer polynomial (degree 2): -7*b0*b1 + b1
Inner polynomial 0 (degree 1): 12*a0 + 16
Inner polynomial 1 (degree 1): 9*a1 - 19
Inner polynomial 2 (degree 2): -11*a2**2 + 5*a2 + 17
Substituted: 9*a1 - 7*(12*a0 + 16)*(9*a1 - 19) - 19
Expanded: -756*a0*a1 + 1596*a0 - 999*a1 + 2109
Result: + P 2 1 0 9 + * P 9 9 9 a1 + * P 1 5 9 6 a0 * P 7 5 6 * a0 a1 ? + b1 * P 7 * b0 b1 & + P 1 6 * P 1 2 a0 & + N 1 9 * P 9 a1 & + P 1 7 + * P 5 a2 * N 1 1 ^ a2 P 2

EXPLANATION:
• Outer polynomial: P(b0, b1, b2) with multivariate terms
• Inner polynomials: Q0(a0), Q1(a1), Q2(a2) each with degree 1 or 2
• Substitution: P(Q0(a0), Q1(a1), Q2(a2))
• Format: [expanded] ? [outer] & [inner0] & [inner1] & [inner2]


In [9]:
# =============================================================================
# EXECUTE MULTI-VARIABLE PARALLEL DATASET GENERATION
# =============================================================================
# This cell runs parallel dataset generation for multi-variable polynomial decomposition
# 
# Generated files:
#   - training_dataset.txt      (300,000 samples)
#   - test_dataset.txt          (3,000 samples)
#   - validation_dataset.txt    (128 samples)
# =============================================================================

import time
import os
import sys
import importlib

# Reload the improved multiprocessing functions
if 'using_sympy' in sys.modules:
    importlib.reload(sys.modules['using_sympy'])

from using_sympy import generate_multivariate_datasets_parallel

# Detect system capabilities
total_cpus = os.cpu_count()
print(f"🖥️  System Detection:")
print(f"   Total CPU threads: {total_cpus}")

# Determine optimal settings based on system
if total_cpus >= 128:
    optimal_workers = 128
    print(f"   Large system detected: Using {optimal_workers} workers (optimal for very large systems)")
elif total_cpus >= 64:
    optimal_workers = min(64, total_cpus // 2)
    print(f"   Medium-large system detected: Using {optimal_workers} workers")
elif total_cpus >= 8:
    optimal_workers = total_cpus - 1
    print(f"   Medium system detected: Using {optimal_workers} workers (leaving 1 for system)")
else:
    optimal_workers = max(1, total_cpus)
    print(f"   Small system detected: Using {optimal_workers} workers")

print(f"   Expected speedup: ~{optimal_workers}x over single-threaded")
print()

# Start timing the generation process
start_time = time.perf_counter()

# Create output directory if it doesn't exist
file_directory = "data_storage/dataset/multi_variable"
if not os.path.exists(file_directory):
    os.makedirs(file_directory)
    print(f"📁 Created directory: {file_directory}")

# Dataset generation parameters
num_train = 300000  # Training samples
num_test = 3000     # Test samples
num_valid = 128     # Validation samples

# Multi-variable configuration
num_inner_vars = 3  # a0, a1, a2
num_outer_vars = 3  # b0, b1, b2
max_degree_inner = 2  # Each inner poly has degree 1 or 2
max_degree_outer = 2  # Outer poly has degree 1 or 2

print("🚀 Starting MULTI-VARIABLE PARALLEL dataset generation...")
print(f"🎯 Target: {num_train:,} training + {num_test:,} test + {num_valid} validation = {num_train + num_test + num_valid:,} total samples")
print(f"📐 Configuration:")
print(f"   Inner: {num_inner_vars} variables (a0..a{num_inner_vars-1}), max degree {max_degree_inner}")
print(f"   Outer: {num_outer_vars} variables (b0..b{num_outer_vars-1}), max degree {max_degree_outer}")
print(f"   Coefficients: ∈ [-20, 20]")
print(f"📝 Format: expanded_polynomial ? outer_polynomial & inner0 & inner1 & inner2")
print("-" * 80)

# Run parallel dataset generation for multi-variable case
generate_multivariate_datasets_parallel(
    file_directory=file_directory,
    num_inner_vars=num_inner_vars,
    num_outer_vars=num_outer_vars,
    max_degree_inner=max_degree_inner,
    max_degree_outer=max_degree_outer,
    num_train=num_train,
    num_test=num_test,
    num_valid=num_valid,
    num_cpus=None  # Auto-detect optimal CPU usage
)

# Calculate and display generation time
end_time = time.perf_counter()
elapsed_time = end_time - start_time

print("-" * 80)
print(f"🎉 Multi-variable dataset generation completed\!")
print(f"⚡ Total time: {elapsed_time:.2f} seconds ({elapsed_time/60:.1f} minutes)")

print(f"📁 Output directory: {file_directory}/")
print(f"💾 All datasets saved and ready for training\!")

# Show system efficiency
if elapsed_time > 0:
    samples_per_second = (num_train + num_test + num_valid) / elapsed_time
    print(f"🏃 Generation rate: {samples_per_second:.0f} samples/second")
    print(f"🔧 Worker efficiency: {samples_per_second/optimal_workers:.0f} samples/second/worker")

🖥️  System Detection:
   Total CPU threads: 255
   Large system detected: Using 128 workers (optimal for very large systems)
   Expected speedup: ~128x over single-threaded

🚀 Starting MULTI-VARIABLE PARALLEL dataset generation...
🎯 Target: 300,000 training + 3,000 test + 128 validation = 303,128 total samples
📐 Configuration:
   Inner: 3 variables (a0..a2), max degree 2
   Outer: 3 variables (b0..b2), max degree 2
   Coefficients: ∈ [-20, 20]
📝 Format: expanded_polynomial ? outer_polynomial & inner0 & inner1 & inner2
--------------------------------------------------------------------------------
🚀 Starting multivariate dataset generation...
📊 Configuration:
   Inner variables: 3 (a0...a2)
   Outer variables: 3 (b0...b2)
   Max degree (inner): 2
   Max degree (outer): 2
   Dataset sizes: train=300,000, test=3,000, valid=128
💻 System has 255 total CPU threads, using 128 workers
🔧 Multiprocessing method: spawn

📝 Generating 312,221 samples (3% overhead for deduplication)...
🎯 Each worke

    Worker 1: 244/2440 samples
    Worker 15: 244/2440 samples
    Worker 5: 244/2440 samples
    Worker 39: 244/2440 samples
    Worker 11: 244/2440 samples
    Worker 21: 244/2440 samples
    Worker 4: 244/2440 samples
    Worker 25: 244/2440 samples
    Worker 6: 244/2440 samples
    Worker 7: 244/2440 samples
    Worker 2: 244/2440 samples
    Worker 23: 244/2440 samples
    Worker 10: 244/2440 samples
    Worker 27: 244/2440 samples
    Worker 20: 244/2440 samples
    Worker 16: 244/2440 samples
    Worker 26: 244/2440 samples
    Worker 9: 244/2440 samples
    Worker 12: 244/2440 samples    Worker 18: 244/2440 samples

    Worker 22: 244/2440 samples
    Worker 13: 244/2440 samples
    Worker 41: 244/2440 samples
    Worker 28: 244/2440 samples
    Worker 17: 244/2440 samples
    Worker 3: 244/2440 samples
    Worker 31: 244/2440 samples
    Worker 24: 244/2440 samples
    Worker 35: 244/2440 samples
    Worker 30: 244/2440 samples
    Worker 59: 244/2440 samples
    Worker 62: 2

## 5.2 Multi-Variable Model Training

Train a transformer model on the multi-variable polynomial decomposition dataset with extended vocabulary.

### Key Configuration Changes:
- **Extended Vocabulary**: Includes a0-a18, b0-b18, n1-n18 tokens for multi-variable support
- **Separator Token**: Uses '?' instead of '⁇' for multi-variable cases
- **Larger Block Size**: 800 tokens to accommodate longer multi-variable expressions
- **Model Architecture**: Same as single-variable but with extended vocabulary size

In [16]:
# Configure multi-variable training environment
import torch
import os

# Ensure required modules are imported
try:
    from model import GPTConfig, GPT
    from dataset import SymbolicDataset
except ImportError:
    print("⚠️ Error: Model classes not imported. Please run cell 6 first.")
    raise

device = torch.cuda.current_device() if torch.cuda.is_available() else "cpu"
print(f"🖥️  Using device: {device}")

# Extended vocabulary for multi-variable support
block_size = 800  # Larger block size for multi-variable expressions
max_number_token = 101  # Support numbers 0-100

# Build extended vocabulary
extended_tokens = [
    "□",  # PAD token
    "a","b","c","d","e","x","y","z",  # Base variables
    "⁇","?",  # Separators (? for multi-variable)
    # Extended variable tokens
    "a0","a1","a2","a3","a4","a5","a6","a7","a8","a9","a10",
    "a11","a12","a13","a14","a15","a16","a17","a18",
    "b0","b1","b2","b3","b4","b5","b6","b7","b8","b9",
    "b10","b11","b12","b13","b14","b15","b16","b17","b18",
    "n1","n2","n3","n4","n5","n6","n7","n8","n9",
    "n10","n11","n12","n13","n14","n15","n16","n17","n18",
    "N","P","&","+","*","^",  # Operators
] + [str(i) for i in range(0, max_number_token)]

vocab_size = len(extended_tokens)

print(f"📚 Extended vocabulary configured:")
print(f"   Vocabulary size: {vocab_size} tokens")
print(f"   Block size: {block_size} tokens")
print(f"   Number range: 0-{max_number_token-1}")
print(f"   Multi-variable tokens: a0-a18, b0-b18, n1-n18")

🖥️  Using device: 0
📚 Extended vocabulary configured:
   Vocabulary size: 174 tokens
   Block size: 800 tokens
   Number range: 0-100
   Multi-variable tokens: a0-a18, b0-b18, n1-n18


In [17]:
# Load multi-variable datasets and initialize model

# Load model with extended vocabulary
model_cfg = GPTConfig(
    vocab_size, block_size, n_layer=6, n_head=8, n_embd=512
)
multi_var_gpt = GPT(model_cfg)
multi_var_gpt.to(device)

# Load multi-variable datasets
file_directory = "data_storage/dataset/multi_variable"
train_data_path = file_directory + "/training_dataset.txt"
valid_data_path = file_directory + "/validation_dataset.txt"
test_data_path = file_directory + "/test_dataset.txt"

print("📂 Loading multi-variable datasets...")

# Check if datasets exist
import os
if not os.path.exists(train_data_path):
    print(f"⚠️ Training dataset not found at {train_data_path}")
    print("Please run the multi-variable data generation cell first (cell 42)")
else:
    # Load training dataset with extended vocabulary
    multi_train_dataset = SymbolicDataset(
        block_size,
        extended_tokens,
        open(train_data_path, encoding="utf-8").read(),
        use_extended_vocab=True  # Enable multi-variable support
    )
    
    # Load validation dataset
    multi_valid_dataset = SymbolicDataset(
        block_size,
        extended_tokens,
        open(valid_data_path, encoding="utf-8").read(),
        use_extended_vocab=True
    )
    
    print(f"✅ Datasets loaded successfully!")
    print(f"   Training samples: {len(multi_train_dataset)}")
    print(f"   Validation samples: {len(multi_valid_dataset)}")
    print(f"   Using extended vocabulary with {vocab_size} tokens")

number of parameters: 19503104
📂 Loading multi-variable datasets...
data has 89047487 characters, 174 unique.
data has 38765 characters, 174 unique.
✅ Datasets loaded successfully!
   Training samples: 300000
   Validation samples: 128
   Using extended vocabulary with 174 tokens


### Configure Training Hyperparameters

Set up training configuration optimized for multi-variable polynomial decomposition:

In [18]:
# Training configuration for multi-variable model
import os
import time

# Ensure required modules are imported
try:
    from trainer import TrainerConfig
except ImportError:
    print("⚠️ Error: TrainerConfig not imported. Please run cell 6 first.")
    raise

# Create model directory for multi-variable models
model_directory = "data_storage/model/multi_variable"
model_path = f"{model_directory}/multi_variable_model.pt"
best_model_path = f"{model_directory}/multi_variable_model_best.pt"

if not os.path.exists(model_directory):
    os.makedirs(model_directory)
    print(f"📁 Created directory: {model_directory}")

# Batch size - keeping original size
batch_size = 128  # Original batch size

# Configure training - with num_workers=0 to save memory from our optimizations
multi_var_tconf = TrainerConfig(
    max_epochs=15,
    batch_size=batch_size,
    learning_rate=6e-4,
    lr_decay=True,
    warmup_tokens=512 * 20,
    final_tokens=batch_size * 3000 * block_size,
    num_workers=0,  # Set to 0 to avoid memory overhead from parallel workers
    validation_interval=50,  # Validate every 50 iterations (our optimization)
    ckpt_path=model_path,
    shuffle=True,
    weight_decay=0.1,
)

print(f"🎯 Training configuration:")
print(f"   Epochs: {multi_var_tconf.max_epochs}")
print(f"   Batch size: {batch_size}")
print(f"   Learning rate: {multi_var_tconf.learning_rate}")
print(f"   Block size: {block_size}")
print(f"   Num workers: 0 (to avoid memory overhead)")
print(f"   Validation interval: 50 (faster training)")
print(f"   Model path: {model_path}")

🎯 Training configuration:
   Epochs: 15
   Batch size: 128
   Learning rate: 0.0006
   Block size: 800
   Num workers: 0 (to avoid memory overhead)
   Validation interval: 50 (faster training)
   Model path: data_storage/model/multi_variable/multi_variable_model.pt


### Optional: Load Pre-trained Multi-Variable Model

If you have a previously trained multi-variable model, you can load it here:

In [19]:
# OPTIONAL: Load pre-trained multi-variable model
# Skip this cell if training from scratch

import torch

model_directory = "data_storage/model/multi_variable"
reading_params_path = f"{model_directory}/multi_variable_model_best.pt"

if os.path.exists(reading_params_path):
    multi_var_gpt.load_state_dict(torch.load(reading_params_path))
    print(f"✅ Pre-trained multi-variable model loaded from {reading_params_path}")
else:
    print(f"ℹ️ No pre-trained model found at {reading_params_path}")
    print("Training from scratch...")

✅ Pre-trained multi-variable model loaded from data_storage/model/multi_variable/multi_variable_model_best.pt


### Train the Multi-Variable Model

**Note:** Training can take 3-5 hours depending on your hardware. The model needs to learn more complex patterns due to:
- Multiple variables and their interactions
- Longer sequences (800 tokens vs 300)
- Larger vocabulary (150+ tokens vs 31)

You can skip this cell if you want to use a pre-trained model.

In [21]:
# Train the multi-variable model
import time
import torch

# Ensure required modules are imported
try:
    from trainer import Trainer
except ImportError:
    print("⚠️ Error: Trainer not imported. Please run cell 6 first.")
    raise

# Check if datasets are loaded
if 'multi_train_dataset' not in locals():
    print("❌ Datasets not loaded. Please run the dataset loading cell first.")
else:
    print("🚀 Starting multi-variable model training...")
    print("⏱️ This may take 3-5 hours. You can interrupt and resume later.")
    print("-" * 60)
    
    # Create trainer
    multi_var_trainer = Trainer(multi_var_gpt, multi_train_dataset, multi_valid_dataset, multi_var_tconf)
    
    # Train the model
    start_time = time.perf_counter()
    multi_var_trainer.train()
    end_time = time.perf_counter()
    elapsed_time = end_time - start_time
    
    print("-" * 60)
    print(f"✅ Training completed!")
    print(f"⏱️ Total time: {elapsed_time:.2f} seconds ({elapsed_time/3600:.1f} hours)")
    
    # Save the trained model
    resulting_model = multi_var_gpt.module if hasattr(multi_var_gpt, "module") else multi_var_gpt
    torch.save(resulting_model.state_dict(), model_path)
    print(f"💾 Model saved to {model_path}")

🚀 Starting multi-variable model training...
⏱️ This may take 3-5 hours. You can interrupt and resume later.
------------------------------------------------------------


  0%|          | 0/2344 [00:00<?, ?it/s]



Training failed with error: CUDA out of memory. Tried to allocate 2.44 GiB. GPU 0 has a total capacity of 79.14 GiB of which 2.09 GiB is free. Process 4017528 has 77.04 GiB memory in use. Of the allocated memory 76.17 GiB is allocated by PyTorch, and 382.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Force cleanup initiated...
Force cleanup completed.
Performing cleanup...
Error during cleanup: 'Trainer' object has no attribute 'data_loaders'


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.44 GiB. GPU 0 has a total capacity of 79.14 GiB of which 2.09 GiB is free. Process 4017528 has 77.04 GiB memory in use. Of the allocated memory 76.17 GiB is allocated by PyTorch, and 382.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## 5.3 Multi-Variable Model Evaluation

Evaluate the trained multi-variable model on test data:

In [None]:
# Evaluate multi-variable model using command-line script
# This uses the extended vocabulary automatically

model_directory = "data_storage/model/multi_variable"
test_data_path = "data_storage/dataset/multi_variable/test_dataset.txt"
output_path = "data_storage/predictions/multi_variable"

# Create output directory
import os
if not os.path.exists(output_path):
    os.makedirs(output_path)

print("🔍 Evaluating multi-variable model with greedy search...")
print("-" * 60)

# Run evaluation with extended vocabulary flag
!python Training/mingpt/run.py inequality_evaluate4 \
   --extended_vocab \
   --block_size 800 \
   --max_output_length 400 \
   --n_embd 512 \
   --n_head 8 \
   --n_layer 6 \
   --max_number_token 101 \
   --sympy 1 \
   --max_test 100 \
   --evaluate_corpus_path {test_data_path} \
   --reading_params_path {model_directory}/multi_variable_model_best.pt \
   --outputs_path {output_path}/multi_var_greedy.txt

In [None]:
# Beam search evaluation for multi-variable model
print("🔍 Evaluating multi-variable model with beam search...")
print("-" * 60)

!python Training/mingpt/run.py debug_beam \
   --extended_vocab \
   --block_size 800 \
   --max_output_length 400 \
   --n_embd 512 \
   --n_layer 6 \
   --n_head 8 \
   --max_number_token 101 \
   --beam_width 10 \
   --max_test 50 \
   --sympy 1 \
   --evaluate_corpus_path {test_data_path} \
   --reading_params_path {model_directory}/multi_variable_model_best.pt \
   --outputs_path {output_path}/multi_var_beam.txt

## Summary

You've successfully extended the polynomial decomposition problem to multiple variables! Here's what we accomplished:

### 🎯 Key Achievements:

1. **Data Generation**: Created parallel generation functions for multi-variable polynomial decomposition with 3 inner and 3 outer variables

2. **Extended Vocabulary**: Implemented support for multi-variable tokens (a0-a18, b0-b18, n1-n18) and numbers 0-100

3. **Model Training**: Configured and trained a transformer model with:
   - 800-token context window for longer expressions
   - Extended vocabulary of 150+ tokens
   - Multi-variable dataset compatibility

4. **Evaluation**: Tested the model using both greedy and beam search on multi-variable test data

### 📊 Comparison: Single vs Multi-Variable

| Aspect | Single Variable | Multi-Variable |
|--------|----------------|----------------|
| Variables | 1 (a) | 3+ (a0, a1, a2, ...) |
| Inner Polynomials | 1 | Multiple (one per variable) |
| Vocabulary Size | ~31 tokens | 150+ tokens |
| Block Size | 300 | 800 |
| Training Time | 2-3 hours | 3-5 hours |
| Problem Format | `expanded ⁇ inner` | `expanded ? outer & inner0 & inner1 & inner2` |

### 🚀 Next Steps:

1. **BGRPO Fine-tuning**: Apply reinforcement learning to improve multi-variable model performance
2. **Scaling**: Experiment with more variables (4-6) or higher degree polynomials
3. **Architecture**: Try larger models or different attention mechanisms
4. **Analysis**: Compare single vs multi-variable model performance and generalization