# Polynomial GCD Dataset — Minimal Example

This notebook shows how **easy** it is to plug a custom *problem generator* into the
`transformer_algebra` data pipeline.  
Instead of the built‑in `SumProblemGenerator`, we define our own `GCDProblemGenerator`
directly in the notebook, import the rest of the library, and instantly obtain a toy dataset.

## 1. Imports

In [None]:
from sage.all import PolynomialRing, GF
from transformer_algebra import PolynomialSampler, DatasetGenerator
# We will define GCDProblemGenerator below

## 2. Define a Polynomial Ring

In [19]:
ring = PolynomialRing(GF(7), 2, "x", order="degrevlex")  # GF(7) with variables x0, x1
ring

In [20]:
# GF(7) with 2 variables x0, x1
ring = PolynomialRing(GF(7), 2, "x", order="degrevlex")
ring

Multivariate Polynomial Ring in x0, x1 over Finite Field of size 7

## 3. Build a Polynomial Sampler

In [21]:
sampler = PolynomialSampler(
    ring=ring,
    max_num_terms=6,
    max_degree=4,
    min_degree=1,
    degree_sampling="uniform",
    term_sampling="uniform",
    nonzero_instance=True,
)

## 4. Write a **custom** `GCDProblemGenerator`

In [22]:
from sage.misc import randstate
from sage.misc.prandom import randint

class GCDProblemGenerator:
    """Generate pairs of polynomials and their greatest common divisor."""
    def __init__(self, sampler):
        self.sampler = sampler
        self.ring = sampler.ring

    def __call__(self, seed: int):
        randstate.set_random_seed(seed)

        # Draw three polynomials: gcd, q1, q2
        gcd_poly, q1, q2 = self.sampler.sample(num_samples=3)

        # Force q1, q2 to be *coprime* with each other so gcd is the only common factor
        _gcd = q1.gcd(q2)
        gcd_poly, q1, q2 = gcd_poly * _gcd, self.ring(q1 / _gcd), self.ring(q2 / _gcd)

        F = [gcd_poly * q1, gcd_poly * q2]           # Inputs
        g = self.ring(gcd_poly / gcd_poly.lc())      # Normalised GCD (monic)

        return F, g

**Key idea:** the generator is just a *callable* that returns `(inputs, target)`.  
If it follows that contract, `DatasetGenerator` can parallel‑generate samples automatically.

## 5. Create Data & Inspect a Sample

In [23]:
problem_generator = GCDProblemGenerator(sampler)
dataset_generator = DatasetGenerator(ring=ring, n_jobs=1, verbose=False, root_seed=2025)

# Single sample
F, g = problem_generator(seed=123)
print("Inputs F:", F)
print("Target g:", g)

Input F (polynomials): [-2*x0*x1^2 + 2*x0*x1, x0*x1^2 - x1^3 - x0*x1 + x1^2]
Output G (partial sums): x1^2 - x1


## 6. Generate a Tiny Dataset

In [24]:
samples, stats = dataset_generator.run(
    num_samples=20,
    train=True,
    problem_generator=problem_generator
)
stats

{'total_time': 0.009609460830688477,
 'samples_per_second': 2081.2822230492493,
 'num_samples': 20,
 'generation_time': {'mean': 0.00045791864395141604,
  'std': 0.00012987409280385738,
  'min': 0.00032973289489746094,
  'max': 0.0008177757263183594},
 'input_polynomials_overall': {'num_polynomials': {'mean': 2.0,
   'std': 0.0,
   'min': 2.0,
   'max': 2.0},
  'total_degree': {'mean': 7.2,
   'std': 2.4617067250182343,
   'min': 4.0,
   'max': 12.0},
  'total_terms': {'mean': 10.0,
   'std': 7.063993204979744,
   'min': 2.0,
   'max': 27.0},
  'max_degree': {'mean': 3.95,
   'std': 1.2835497652993437,
   'min': 2.0,
   'max': 6.0},
  'min_degree': {'mean': 3.25,
   'std': 1.2599603168354152,
   'min': 2.0,
   'max': 6.0},
  'max_terms': {'mean': 6.1,
   'std': 3.9736632972611057,
   'min': 1.0,
   'max': 14.0},
  'min_terms': {'mean': 3.9,
   'std': 3.3600595232822887,
   'min': 1.0,
   'max': 13.0},
  'max_coeff': {'mean': 5.55,
   'std': 0.5894913061275798,
   'min': 4.0,
   'max': 

In [25]:
# Show first three examples
for i, (F_i, g_i) in enumerate(samples[:3]):
    print(f"--- Sample {i} ---")
    print("F:", F_i)
    print("g:", g_i)

--- Sample 0 ---
F: [-x0*x1^2 + 2*x0*x1, -2*x0*x1^2]
G: x0*x1
--- Sample 1 ---
F: [-3*x0^3 + 3*x0^2 + 3*x0*x1 - 3*x1, 3*x0*x1 - 2*x0 - 3*x1 + 2]
G: x0 - 1
--- Sample 2 ---
F: [-3*x0*x1^3 + x0^2*x1 - 3*x0*x1^2 - x1^3 - x0*x1 - x1^2 - 2*x0 - 2*x1 - 3, -x1^5 - 2*x0*x1^3 - x1^4 - 2*x1^3 - 3*x1^2]
G: x1^3 + 2*x0*x1 + x1^2 + 2*x1 + 3


Change the ring, sampler hyper‑parameters, or swap in a different generator class,
and you immediately get a new task‑specific dataset — **no other code changes needed**.