---

#### $Load$ $Libraries$

---

In [None]:
import json
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModel
from huggingface_hub import notebook_login
import textwrap
import re
from generate import generate
# !pip install -U transformers accelerate bitsandbytes datasets

---

#### $Load$ $Model$

---

##### $Model$ $Access$

In order to access to the model we need to:
1. Visit the github of the model: https://github.com/ML-GSAI/LLaDA?tab=readme-ov-file
2. Clone it
3. Create inside the folder the notebook and run the model

In [None]:
# Initialize the model name 
model_name = 'GSAI-ML/LLaDA-8B-Base'

##### $Tokenizer$ $Set-up$


This code sets up a `tokenizer` for the language model.
It loads a pre-trained tokenizer.

In [None]:
# Load the tokenizer. The library will handle the chat template automatically.
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

print(f"Tokenizer for {model_name} loaded successfully.")


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/766 [00:00<?, ?B/s]

Tokenizer for GSAI-ML/LLaDA-8B-Base loaded successfully.


In [None]:
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16  # or torch.bfloat16 if supported
)


config.json: 0.00B [00:00, ?B/s]

configuration_llada.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/GSAI-ML/LLaDA-8B-Base:
- configuration_llada.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_llada.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/GSAI-ML/LLaDA-8B-Base:
- modeling_llada.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/2.92G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/2.99G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

Some parameters are on the meta device because they were offloaded to the cpu.


In [None]:
print(chat_template := tokenizer.chat_template)

{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}


---

####  $Zero$ $Shots$ $vs.$ $Few$ $Shots$

---

In [None]:
# Create the folder paths for results and the folders if they do not exist
base_results_folder = '/home/lathanasopoulou/capstone/search-in-ai/prompt-decomposition/HotpotQA/llm_predictions/'
zero_shot_folder = os.path.join(base_results_folder, 'zero_shot') # Zero-shot predictions
few_shots_folder = os.path.join(base_results_folder, 'few_shot')  # Few-shot predictions
os.makedirs(zero_shot_folder, exist_ok=True)
os.makedirs(few_shots_folder, exist_ok=True)

##### $Zero$ $Shots$ $Function$

For each question, it constructs a prompt instructing the model to break down the complex question into smaller, step-by-step sub-questions. It then uses the tokenizer to convert these messages into input IDs, adds an attention mask, and generates a response from the model. Finally, it decodes the model's output and stores the original question and its zero-shot decomposition. The function returns a list of dictionaries containing these results.

* **`input_ids`**: This is the prompt, but translated into a numerical format (a list of numbers) that the model can understand. Each number corresponds to a word or part of a word.

* **`attention_mask`**: This is a list of 1s and 0s that has the same length as the input_ids. It tells the model which tokens are real words to pay attention to (a 1) and which ones are just padding that should be ignored (a 0). 

* **`pad_token_id`**: This specifies the ID of the special token that is used to "fill up" shorter sentences when we process multiple sentences at once (batching). It's directly related to the attention_mask.

* **`max_new_tokens`**: This sets the maximum length of the generated response. max_new_tokens=512 tells the model, "Do not write more than 512 new tokens after the prompt." This prevents it from writing forever.

* **`early_stopping`**: If set to False, the model might finish its thought in 80 words but feel compelled to keep adding filler words to get closer to the 300-word limit. With early stopping once it writes its thought is complete, it just stops.


* **`do_sample`**:
	* **do_sample=False**: This forces the model to be deterministic. Every time it generates a new word, it chooses the single word that it calculates as being the most statistically likely to come next. When we compare the models, we want to compare their "best, most probable" attempt at the problem.
	* **do_sample=True**: This tells the model to be creative and less predictable. Instead of always picking the #1 most likely word, it might pick the #2 or #3 most likely word, based on a random sample (controlled by parameters like temperature).

Since the task is analytical (decomposing a problem) and not creative, the deterministic approach is better. We choose do_sample=False for all the baseline experiments to ensure your results are stable and repeatable.

*For the prompt, we use the prompt structure the model has been trained to.*

In [None]:
def run_zero_shot_experiment(model, tokenizer, questions):
    results = []

    for item in questions:
        print(f"Processing (Zero-Shot) ID: {item['id']}")

        # Manually construct the prompt using your template format
        system_prompt = (
            "You are a helpful assistant. Your task is to break down the following question "
            "into a few smaller questions that contribute to solving the overall problem.\n\n"
            f"Complex Question: {item['question']}\n\n"
            "Step-by-Step plan:"
        )

        # Manually format chat messages using the template provided
        bos_token = tokenizer.bos_token or "<|begin_of_text|>"
        eot_token = "<|eot_id|>"
        start_header = "<|start_header_id|>"
        end_header = "<|end_header_id|>"

        prompt_text = (
            f"{bos_token}"
            f"{start_header}user{end_header}\n\n{system_prompt.strip()}{eot_token}"
            f"{start_header}assistant{end_header}\n\n"
        )

        # Tokenize and convert to tensor
        input_ids = tokenizer(prompt_text, return_tensors="pt")['input_ids'].to(model.device)

        # Run generation (assuming a LLaDA-specific generate function)
        output_ids = generate(
            model=model,
            prompt=input_ids,
            steps=128,
            gen_length=128,
            block_length=32,
            temperature=0.0,
            cfg_scale=0.0,
            remasking='low_confidence'
        )

        # Decode only the newly generated tokens
        generated_text = tokenizer.batch_decode(
            output_ids[:, input_ids.shape[1]:],
            skip_special_tokens=True
        )[0].strip()

        # Store the result
        results.append({
            "id": item['id'],
            "question": item['question'],
            "decomposition": item.get('decomposition', ''),
            "zero_shot_decomposition": generated_text
        })

    return results


In [None]:
# def run_zero_shot_experiment(model, tokenizer, questions):
#     results = []

#     for item in questions:
#         print(f"Processing (Zero-Shot) ID: {item['id']}")

#         # Construct the raw string prompt
#         prompt_text = (
#             "You are a helpful assistant. Your task is to break down the following question "
#             "into a few smaller questions that contribute to solving the overall problem.\n\n"
#             f"Complex Question: {item['question']}\n\n"
#             "Step-by-Step plan:"
#         )

#         # Apply chat template if needed (LLaDA-Instruct models require it)
#         messages = [{"role": "user", "content": prompt_text}]
#         formatted_prompt = tokenizer.apply_chat_template(
#             messages,
#             add_generation_prompt=True,
#             tokenize=False
#         )

#         # Tokenize into input IDs
#         input_ids = tokenizer(formatted_prompt)['input_ids']
#         input_ids = torch.tensor(input_ids).to(model.device).unsqueeze(0)  # Shape: (1, L)

#         # Run LLaDA's generation function
#         output_ids = generate(
#             model=model,
#             prompt=input_ids,
#             steps=128,
#             gen_length=128,
#             block_length=32,
#             temperature=0.0,
#             cfg_scale=0.0,
#             remasking='low_confidence'
#         )

#         # Decode the generated output after the prompt
#         generated_text = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:], skip_special_tokens=True)[0].strip()

#         # Store the result
#         results.append({
#             "id": item['id'],
#             "question": item['question'],
#             "decomposition": item.get('decomposition', ''),  # fallback in case not provided
#             "zero_shot_decomposition": generated_text
#         })

#     return results


##### $Few$ $Shots$ $Function$

This function, performs few-shots learning.

It takes a list of questions, a list of high-quality example question/decomposition pairs (`shot_examples`), and the number of examples (`num_shots`) to use.

The function first uses a prompt that includes the specified number of examples, showing the model how to decompose complex questions into simpler sub-questions. Then, for each question, it sends the prompt to the language model, generates a decomposition, and stores the results. This approach helps the model understand the desired output format and style.

In [None]:
# from generate import generate  # LLaDA's custom generate function

# def run_few_shot_experiment(model, tokenizer, data, shot_examples, num_shots=3):
#     if num_shots > len(shot_examples):
#         raise ValueError(f"You asked for {num_shots} shots, but only {len(shot_examples)} are available.")

#     results = []

#     for item in data:
#         print(f"Processing ({num_shots}-Shot) ID: {item['id']}")

#         # Build the chat-style prompt using few-shot examples
#         messages = [
#             {
#                 "role": "system",
#                 "content": "You are an expert assistant. Your task is to break down the following question into a few smaller questions that contribute to solving the overall problem. Give only the decomposition steps, not the final answer."
#             }
#         ]

#         # Add few-shot examples
#         for example in shot_examples[:num_shots]:
#             decomposition_text = example['decomposition']
#             if isinstance(decomposition_text, list):
#                 decomposition_text = "\n".join(decomposition_text)
#             messages.append({"role": "user", "content": example['question']})
#             messages.append({"role": "assistant", "content": decomposition_text})

#         # Add the actual question to be decomposed
#         messages.append({"role": "user", "content": item['question']})

#         # Convert chat messages into a prompt string (not tokenized yet)
#         prompt_text = tokenizer.apply_chat_template(
#             messages,
#             add_generation_prompt=True,
#             tokenize=False
#         )

#         # Tokenize to tensor (batch of size 1)
#         input_ids = tokenizer(prompt_text)['input_ids']
#         input_ids = torch.tensor(input_ids).to(model.device).unsqueeze(0)

#         # Generate output using LLaDA
#         output_ids = generate(
#             model=model,
#             prompt=input_ids,
#             steps=128,
#             gen_length=128,
#             block_length=32,
#             temperature=0.0,
#             cfg_scale=0.0,
#             remasking='low_confidence'
#         )

#         # Decode only the generated portion
#         generated_text = tokenizer.batch_decode(
#             output_ids[:, input_ids.shape[1]:],
#             skip_special_tokens=True
#         )[0].strip()

#         results.append({
#             "id": item['id'],
#             "question": item['question'],
#             "decomposition": item.get('decomposition', ''),
#             f"{num_shots}_shot_decomposition": generated_text
#         })

#     return results


In [None]:
def run_few_shot_experiment(model, tokenizer, data, shot_examples, num_shots=3):
    if num_shots > len(shot_examples):
        raise ValueError(f"You asked for {num_shots} shots, but only {len(shot_examples)} are available.")

    results = []

    # Special tokens
    bos_token = tokenizer.bos_token or "<|begin_of_text|>"
    eot_token = "<|eot_id|>"
    start_header = "<|start_header_id|>"
    end_header = "<|end_header_id|>"

    for item in data:
        print(f"Processing ({num_shots}-Shot) ID: {item['id']}")

        # Begin with system message
        messages = [
            {
                "role": "system",
                "content": "You are an expert assistant. Your task is to break down the following question into a few smaller questions that contribute to solving the overall problem. Give only the decomposition steps, not the final answer."
            }
        ]

        # Add few-shot examples
        for example in shot_examples[:num_shots]:
            decomposition_text = example['decomposition']
            if isinstance(decomposition_text, list):
                decomposition_text = "\n".join(decomposition_text)

            messages.append({"role": "user", "content": example['question']})
            messages.append({"role": "assistant", "content": decomposition_text})

        # Add the new user question to decompose
        messages.append({"role": "user", "content": item['question']})

        # Manually build the prompt
        prompt_parts = []
        for idx, message in enumerate(messages):
            role = message["role"]
            content = message["content"].strip()
            segment = f"{start_header}{role}{end_header}\n\n{content}{eot_token}"
            if idx == 0:
                segment = bos_token + segment  # Only add BOS once
            prompt_parts.append(segment)

        # Add assistant header for generation
        prompt_parts.append(f"{start_header}assistant{end_header}\n\n")
        prompt_text = "".join(prompt_parts)

        # Tokenize
        input_ids = tokenizer(prompt_text, return_tensors="pt")['input_ids'].to(model.device)

        # Generate output using LLaDA
        output_ids = generate(
            model=model,
            prompt=input_ids,
            steps=128,
            gen_length=128,
            block_length=32,
            temperature=0.0,
            cfg_scale=0.0,
            remasking='low_confidence'
        )

        # Decode generated part
        generated_text = tokenizer.batch_decode(
            output_ids[:, input_ids.shape[1]:],
            skip_special_tokens=True
        )[0].strip()

        # Store the result
        results.append({
            "id": item['id'],
            "question": item['question'],
            "decomposition": item.get('decomposition', ''),
            f"{num_shots}_shot_decomposition": generated_text
        })

    return results


##### $Save$ $the$ $results$

*Once the model processes the questions, its predictions for the zero-shot and few-shot experiments are saved in the corresponding folders*

In [None]:
def save_results_to_json(results, folder, filename):

    # Make sure the folder exists
    os.makedirs(folder, exist_ok=True)

    # Construct the full path for the file
    full_path = os.path.join(folder, filename)

    # Save the file
    with open(full_path, 'w', encoding='utf-8') as f:
        json.dump(results, f, ensure_ascii=False, indent=4)

    print("Results have been saved!")


---

#### $HotpotQA$ $Dataset$ $Predictions$

---

In [None]:
hotpot_dataset_path = "/home/lathanasopoulou/capstone/search-in-ai/prompt-decomposition/HotpotQA/HotpotQA_dataset/hotpot_dataset.json" # Define the path to the hotpot dataset

# Load the dataset
with open(hotpot_dataset_path, "r") as file:
    hotpot_data = json.load(file)
    print(f"Loaded {len(hotpot_data)} questions.")


Loaded 5 questions.


In [None]:
# Display the first question
print("First question details:")
print("ID:", hotpot_data[0]["id"])
print("Question:", hotpot_data[0]["question"])
print("Answer:", hotpot_data[0]["answer"])
print("Supporting sentences:", hotpot_data[0]["supporting_sentences"])
print("-"*150)
print("Decomposition:", hotpot_data[0]["decomposition"])


First question details:
ID: 5a8b57f25542995d1e6f1371
Question: Were Scott Derrickson and Ed Wood of the same nationality?
Answer: yes
Supporting sentences: ['Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer.', 'Edward Davis Wood Jr. (October 10, 1924 – December 10, 1978) was an American filmmaker, actor, writer, producer, and director.', 'Aggregating the above we conclude that the answer is: yes']
------------------------------------------------------------------------------------------------------------------------------------------------------
Decomposition: ['What nationality Scott Derrickson had?', 'What nationality Ed Wood had?', 'Was the nationality the same?']


In [None]:
# Define the name of the model for our file names
model_file_name = "LLaDA-8B-Base_Hotpot_results.json"

In [None]:
few_shot_hotpot_examples_path = "/home/lathanasopoulou/capstone/search-in-ai/prompt-decomposition/HotpotQA/HotpotQA_dataset/hotpot_few_shot.json"
with open(few_shot_hotpot_examples_path, "r") as file:
    shot_examples = json.load(file)
    print(f"Loaded {len(hotpot_data)} questions.")

Loaded 5 questions.


##### $Experiments$ $Run$


In [None]:
# Run the zero-shot experiment
print("\nStarting Zero-Shot Experiment")
zero_shot_results = run_zero_shot_experiment(model, tokenizer, hotpot_data)
save_results_to_json(zero_shot_results, zero_shot_folder, model_file_name)


Starting Zero-Shot Experiment
Processing (Zero-Shot) ID: 5a8b57f25542995d1e6f1371
Processing (Zero-Shot) ID: 5a8c7595554299585d9e36b6
Processing (Zero-Shot) ID: 5a85ea095542994775f606a8
Processing (Zero-Shot) ID: 5adbf0a255429947ff17385a
Processing (Zero-Shot) ID: 5a8e3ea95542995a26add48d


In [None]:
# Run the few-shot experiment with 3 shots
print("\nStarting 3-Shot Experiment")
three_shot_results = run_few_shot_experiment(model, tokenizer, hotpot_data, shot_examples, num_shots=3)
save_results_to_json(three_shot_results, few_shots_folder, f"3shot_{model_file_name}")


Starting 3-Shot Experiment
Processing (3-Shot) ID: 5a8b57f25542995d1e6f1371
Processing (3-Shot) ID: 5a8c7595554299585d9e36b6
Processing (3-Shot) ID: 5a85ea095542994775f606a8
Processing (3-Shot) ID: 5adbf0a255429947ff17385a
Processing (3-Shot) ID: 5a8e3ea95542995a26add48d
Results have been saved!
