## Part 1: Hugging Face Datasets

### 1.1 Loading Pre-existing Datasets

The Datasets library makes it easy to load and process data for training and evaluation.

In [1]:
from datasets import load_dataset

# Load a sample text dataset (simulating graph problem descriptions)
# For demonstration, we'll use a small dataset
dataset = load_dataset("glue", "sst2", split="train[:100]")

print(f"Dataset type: {type(dataset)}")
print(f"Number of examples: {len(dataset)}")
print(f"Features: {dataset.features}")

# View first example
print("\nFirst example:")
print(dataset[0])

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 67349/67349 [00:00<00:00, 1804997.96 examples/s]
Generating validation split: 100%|██████████| 872/872 [00:00<00:00, 220353.84 examples/s]
Generating test split: 100%|██████████| 1821/1821 [00:00<00:00, 466574.68 examples/s]

Dataset type: <class 'datasets.arrow_dataset.Dataset'>
Number of examples: 100
Features: {'sentence': Value('string'), 'label': ClassLabel(names=['negative', 'positive']), 'idx': Value('int32')}

First example:
{'sentence': 'hide new secretions from the parental units ', 'label': 0, 'idx': 0}





### 1.2 Custom Datasets

For graph problems, we'll need to use our own datasets.

In [7]:
from nlgraph_loader import load_nlgraph

dataset = load_nlgraph()
print(dataset)
print("\n--- First Example Record (train[0]) ---")
example = dataset['train'][0]
for key, value in example.items():
    # Truncate long values for readability
    value_str = str(value)
    if len(value_str) > 200:
        value_str = value_str[:200] + "..."
    print(f"  {key}: {value_str}")


DatasetDict({
    train: Dataset({
        features: ['query', 'answer', 'task'],
        num_rows: 4821
    })
    test: Dataset({
        features: ['query', 'answer', 'task'],
        num_rows: 961
    })
})

--- First Example Record (train[0]) ---
  query: Determine if there is a path between two nodes in the graph. Note that (i,j) means that node i and node j are connected with an undirected edge.
Graph: (0,12) (0,13) (0,2) (0,14) (0,23) (0,8) (0,1) (0...
  answer: The answer is yes.
  task: connectivity


### 1.3 Splitting Datasets

Prepare train/test splits for evaluating agent performance.

In [8]:
train_set = dataset['train']
test_set = dataset['test']

print(f"\nTrain set: {len(train_set)} examples")
print(f"Test set: {len(test_set)} examples")


Train set: 4821 examples
Test set: 961 examples


## Part 2: Hugging Face Transformers

### 2.1 Loading a Pre-trained Model

Load a small LLM to show how the library works.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "crumb/nano-mistral"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

print(f"Model loaded: {model.__class__.__name__}")
print(f"Number of parameters: {model.num_parameters():,}")
print(f"Tokenizer vocabulary size: {len(tokenizer)}")

model.safetensors:   0%|          | 0.00/340M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Model loaded: MistralForCausalLM
Number of parameters: 170,082,048
Tokenizer vocabulary size: 32000


### 2.2 Tokenization - Converting Text to Model Input

Transform prompts into tokens that the model understands.

In [7]:
# Example prompt for Proposer agent
prompt = """You are navigating a graph. Current position: node 2.
Visited nodes: [1, 2].
Available neighbors: [3, 4].
Goal: reach node 5.
Next move:"""

# Tokenize the input
inputs = tokenizer(prompt, return_tensors="pt")

print("Input text:")
print(prompt)
print(f"\nTokenized (first 20 tokens): {inputs['input_ids'][0][:20].tolist()}")
print(f"Total tokens: {len(inputs['input_ids'][0])}")


Input text:
You are navigating a graph. Current position: node 2.
Visited nodes: [1, 2].
Available neighbors: [3, 4].
Goal: reach node 5.
Next move:

Tokenized (first 20 tokens): [1, 995, 460, 27555, 1077, 264, 5246, 28723, 10929, 2840, 28747, 3179, 28705, 28750, 28723, 13, 5198, 1345, 9249, 28747]
Total tokens: 49


### 2.3 Generating Agent Responses

Use the model to generate text - this is how agents produce their suggestions and validations.

In [None]:
import torch

# Generate response from Proposer agent
prompt = "In a graph with nodes 1,2,3,4,5 and edges (1,2),(2,3),(3,4),(4,5), the shortest path from 1 to 5 is:"

inputs = tokenizer(prompt, return_tensors="pt")

# Generate with specific parameters
with torch.no_grad():
    outputs = model.generate(
        inputs['input_ids'],
        max_new_tokens=50,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Prompt:")
print(prompt)
print("\nGenerated response:")
print(generated_text[len(prompt):])
print("\n(This is just to show how it works, to get a meaningful response, a larger model is needed.)")

Prompt:
In a graph with nodes 1,2,3,4,5 and edges (1,2),(2,3),(3,4),(4,5), the shortest path from 1 to 5 is:

Generated response:


  (1,2) 
    (3,4)    (3,4)    (4,4)    (4,5)    (4,5)    (4,5)    (4,

This is just to show how it works, to get a meaningful response, a larger model is needed.
