---

#### $Load$ $Libraries$

---

In [1]:
import json
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from huggingface_hub import notebook_login
import textwrap
import re
# !pip install -U transformers accelerate bitsandbytes datasets

---

#### $Load$ $Model$

---

In [2]:
# Initialize the model name 
model_name = "microsoft/Phi-4-mini-instruct"

##### $Bnb$ $Configuration$


The `bnb_config` creates a configuration to shrink the language model. 

* *Compresses the Model:* It tells the system to load the model in a "compressed" 4-bit format instead of its full 16-bit size.
* *Saves Memory:* This makes the model about 4 times smaller, allowing it to run on computers with less memory (RAM and VRAM).
* *Maintains Performance:* It uses clever tricks (like doing the actual math in 16-bit) to ensure the shrunken model is still fast and accurate.

Essentially, it's a set of instructions to make a huge model fit on the computer with minimal loss in quality.

In [3]:
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

##### $Tokenizer$ $Set-up$


This code sets up a `tokenizer` for the language model.
It loads a pre-trained tokenizer, makes sure there's a padding token.

In [4]:
# Load the tokenizer. The library will handle the chat template automatically.
tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    print("pad_token not set. Setting it to eos_token for this model.")
    tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer for {model_name} loaded successfully.")


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/15.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/249 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

Tokenizer for microsoft/Phi-4-mini-instruct loaded successfully.


In [5]:
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     quantization_config=bnb_config,
#     device_map="auto"
# )

In [6]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16  # or torch.bfloat16 if supported
)


config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.77G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

---

####  $Zero$ $Shots$ $vs.$ $Few$ $Shots$

---

##### $Zero$ $Shots$ $Function$

For each question, it constructs a prompt instructing the model to break down the complex question into smaller, step-by-step sub-questions. It then uses the tokenizer to convert these messages into input IDs, adds an attention mask, and generates a response from the model. Finally, it decodes the model's output and stores the original question and its zero-shot decomposition. The function returns a list of dictionaries containing these results.

* **`input_ids`**: This is the prompt, but translated into a numerical format (a list of numbers) that the model can understand. Each number corresponds to a word or part of a word.

* **`attention_mask`**: This is a list of 1s and 0s that has the same length as the input_ids. It tells the model which tokens are real words to pay attention to (a 1) and which ones are just padding that should be ignored (a 0). 

* **`pad_token_id`**: This specifies the ID of the special token that is used to "fill up" shorter sentences when we process multiple sentences at once (batching). It's directly related to the attention_mask.

* **`max_new_tokens`**: This sets the maximum length of the generated response. max_new_tokens=512 tells the model, "Do not write more than 512 new tokens after the prompt." This prevents it from writing forever.

* **`early_stopping`**: If set to False, the model might finish its thought in 80 words but feel compelled to keep adding filler words to get closer to the 300-word limit. With early stopping once it writes its thought is complete, it just stops.


* **`do_sample`**:
	* **do_sample=False**: This forces the model to be deterministic. Every time it generates a new word, it chooses the single word that it calculates as being the most statistically likely to come next. When we compare the models, we want to compare their "best, most probable" attempt at the problem.
	* **do_sample=True**: This tells the model to be creative and less predictable. Instead of always picking the #1 most likely word, it might pick the #2 or #3 most likely word, based on a random sample (controlled by parameters like temperature).

Since the task is analytical (decomposing a problem) and not creative, the deterministic approach is better. We choose do_sample=False for all the baseline experiments to ensure your results are stable and repeatable.

*For the prompt, we use the prompt structure the model has been trained to.*

In [7]:
def run_zero_shot_experiment(model, tokenizer, questions):
    results = []
    for item in questions:
        print(f"Processing (Zero-Shot) ID: {item['id']}")

        system_instructions = (
            "You are an expert system that decomposes complex user questions "
            "into a numbered list of simple, sequential sub-questions."
        )
        user_question = item['question']

        messages = [
            {"role": "system", "content": system_instructions},
            {"role": "user", "content": user_question}
        ]

        input_ids = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt"
        ).to(model.device)

        # Add attention mask
        attention_mask = torch.ones_like(input_ids)

        # Generate the response
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            pad_token_id=tokenizer.eos_token_id,
            max_new_tokens=512,
            do_sample=False,
            repetition_penalty=1.2,
            eos_token_id=tokenizer.eos_token_id
        )

        response = outputs[0][input_ids.shape[-1]:]
        decomposition_result = tokenizer.decode(response, skip_special_tokens=True)

        # Store the result
        results.append({
            "id": item['id'],
            "question": item['question'],
            "decomposition": item['decomposition'],
            "zero_shot_decomposition": decomposition_result
        })

    return results


##### $Few$ $Shots$ $Function$

This function, performs few-shots learning.

It takes a list of questions, a list of high-quality example question/decomposition pairs (`shot_examples`), and the number of examples (`num_shots`) to use.

The function first uses a prompt that includes the specified number of examples, showing the model how to decompose complex questions into simpler sub-questions. Then, for each question, it sends the prompt to the language model, generates a decomposition, and stores the results. This approach helps the model understand the desired output format and style.

*For the prompt, we use the prompt structure the model has been trained to.*

In [8]:
def run_few_shot_experiment(model, tokenizer, data, shot_examples, num_shots=3):
    
    if num_shots > len(shot_examples):
        raise ValueError(f"You asked for {num_shots} shots, but only {len(shot_examples)} are available.")

    results = []
    for item in data:
        print(f"Processing ({num_shots}-Shot) ID: {item['id']}")
        
        messages = [
            {
                "role": "system",
                "content": "You are an expert assistant. Your task is to break down the following question into a few smaller questions that contribute to solving the overall problem. Give only the decomposition steps, not the final answer."
            }
        ]

        # Add the few-shot examples as a series of user/assistant turns
        for example in shot_examples[:num_shots]:
            decomposition_content = example['decomposition']
            if isinstance(decomposition_content, list):
                decomposition_content = "\n".join(decomposition_content)

            messages.append({"role": "user", "content": example['question']})
            messages.append({"role": "assistant", "content": decomposition_content})

        # Finally, add the actual user question we want the model to answer now
        messages.append({"role": "user", "content": item['question']})

        # Generate the Response
        input_ids = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt"
        ).to(model.device)

        # Add attention mask
        attention_mask = torch.ones_like(input_ids)

        # Explicitly set pad_token_id
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            pad_token_id=tokenizer.eos_token_id,
            max_new_tokens=512,
            do_sample=False,
            repetition_penalty=1.2,  
            eos_token_id=tokenizer.eos_token_id
        )

        generated_tokens = outputs[0][input_ids.shape[-1]:]
        decomposition_result = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()

        # Store the result 
        results.append({
            "id": item['id'],
            "question": item['question'],
            "decomposition": item['decomposition'],
            f"{num_shots}_shot_decomposition": decomposition_result
        })

    return results


##### $Save$ $the$ $results$

*Once the model processes the questions, its predictions for the zero-shot and few-shot experiments are saved in the corresponding folders*

In [9]:
def save_results_to_json(results, folder, filename):

    # Make sure the folder exists
    os.makedirs(folder, exist_ok=True)

    # Construct the full path for the file
    full_path = os.path.join(folder, filename)

    # Save the file
    with open(full_path, 'w', encoding='utf-8') as f:
        json.dump(results, f, ensure_ascii=False, indent=4)

    print("Results have been saved!")


In [10]:
# Define the name of the model for our file names
model_file_name = "Phi-4-Mini-Instruct_Hotpot_results.json"

---

#### $HotpotQA$ $Dataset$ $Predictions$

---

In [None]:
# Create the folder paths for results and the folders if they do not exist
base_results_folder = '../HotpotQA/llm_predictions/'
zero_shot_folder = os.path.join(base_results_folder, 'zero_shot') # Zero-shot predictions
few_shots_folder = os.path.join(base_results_folder, 'few_shot')  # Few-shot predictions
os.makedirs(zero_shot_folder, exist_ok=True)
os.makedirs(few_shots_folder, exist_ok=True)

In [12]:
hotpot_dataset_path = "../HotpotQA/HotpotQA_dataset/hotpot_dataset.json" # Define the path to the hotpot dataset

# Load the dataset
with open(hotpot_dataset_path, "r") as file:
    hotpot_data = json.load(file)
    print(f"Loaded {len(hotpot_data)} questions.")


Loaded 5 questions.


In [13]:
# Display the first question
print("First question details:")
print("ID:", hotpot_data[0]["id"])
print("Question:", hotpot_data[0]["question"])
print("Answer:", hotpot_data[0]["answer"])
print("Supporting sentences:", hotpot_data[0]["supporting_sentences"])
print("-"*150)
print("Decomposition:", hotpot_data[0]["decomposition"])


First question details:
ID: 5a8b57f25542995d1e6f1371
Question: Were Scott Derrickson and Ed Wood of the same nationality?
Answer: yes
Supporting sentences: ['Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer.', 'Edward Davis Wood Jr. (October 10, 1924 – December 10, 1978) was an American filmmaker, actor, writer, producer, and director.', 'Aggregating the above we conclude that the answer is: yes']
------------------------------------------------------------------------------------------------------------------------------------------------------
Decomposition: ['What nationality Scott Derrickson had?', 'What nationality Ed Wood had?', 'Was the nationality the same?']


In [15]:
few_shot_hotpot_examples_path = "../HotpotQA/HotpotQA_dataset/hotpot_few_shot.json"
with open(few_shot_hotpot_examples_path, "r") as file:
    shot_examples = json.load(file)
    print(f"Loaded {len(hotpot_data)} questions.")

Loaded 5 questions.


##### $Experiments$ $Run$


In [16]:
# Run the zero-shot experiment
print("\nStarting Zero-Shot Experiment")
zero_shot_results = run_zero_shot_experiment(model, tokenizer, hotpot_data)
save_results_to_json(zero_shot_results, zero_shot_folder, model_file_name)


Starting Zero-Shot Experiment
Processing (Zero-Shot) ID: 5a8b57f25542995d1e6f1371
Processing (Zero-Shot) ID: 5a8c7595554299585d9e36b6
Processing (Zero-Shot) ID: 5a85ea095542994775f606a8
Processing (Zero-Shot) ID: 5adbf0a255429947ff17385a
Processing (Zero-Shot) ID: 5a8e3ea95542995a26add48d
Results have been saved!


In [17]:
# Run the few-shot experiment with 3 shots
print("\nStarting 3-Shot Experiment")
three_shot_results = run_few_shot_experiment(model, tokenizer, hotpot_data, shot_examples, num_shots=3)
save_results_to_json(three_shot_results, few_shots_folder, f"3shot_{model_file_name}")


Starting 3-Shot Experiment
Processing (3-Shot) ID: 5a8b57f25542995d1e6f1371
Processing (3-Shot) ID: 5a8c7595554299585d9e36b6
Processing (3-Shot) ID: 5a85ea095542994775f606a8
Processing (3-Shot) ID: 5adbf0a255429947ff17385a
Processing (3-Shot) ID: 5a8e3ea95542995a26add48d
Results have been saved!


---

#### $QDMR$ $Dataset$ $Predictions$

---

In [11]:
# Create the folder paths for results and the folders if they do not exist
base_results_folder = '../QDMR/llm_predictions/'
zero_shot_folder = os.path.join(base_results_folder, 'zero_shot') # Zero-shot predictions
few_shots_folder = os.path.join(base_results_folder, 'few_shot')  # Few-shot predictions
os.makedirs(zero_shot_folder, exist_ok=True)
os.makedirs(few_shots_folder, exist_ok=True)

In [12]:
qdmr_dataset_path = "../QDMR/QDMR_dataset/qdmr_dataset.json" # Define the path to the qdmr dataset

# Load the dataset
with open(qdmr_dataset_path, "r") as file:
    qdmr_data = json.load(file)
    print(f"Loaded {len(qdmr_data)} questions.")


Loaded 5 questions.


In [13]:
# Display the first question
print("First question details:")
print("ID:", qdmr_data[0]["id"])
print("Question:", qdmr_data[0]["question"])
print("Lexicon tokens:", qdmr_data[0]["lexicon tokens"])
print("-"*150)
print("Decomposition:", qdmr_data[0]["decomposition"])


First question details:
ID: CWQ_dev_WebQTest-1011_c0be4f76a5397ba6d0d06f53905e504b
Question: What Tibetan speaking countries have a population of less than 993885000?
Lexicon tokens: ['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', 'countries', 'equal', 'hundred', 'those', 'sorted by', 'elevation', 'which ', '@@6@@', '993885000', 'was ', 'did ', 'population', 'height', 'one', 'that ', 'on', 'did', 'who', 'true', '@@2@@', '100', 'false', 'and', 'was', 'speaking', 'populations', 'who ', 'a ', 'the', 'number of ', '@@16@@', 'if ', 'where', '@@18@@', 'how', 'larger than', 'is ', 'from ', 'a', 'less', 'for each', 'are ', '@@19@@', '@@4@@', '@@11@@', 'distinct ', 'to', 'not ', 'objects', 'with ', ', ', 'lowest', 'in', 'has ', 'zero', 'in ', 'there ', 'lower than', 'highest', '@@9@@', 'than', 'size', 'multiplication', 'with', 'besides ', ',', '@@1@@', 'what', 'have', 'those ', 'of', '@@3@@', 'that', 'there', '@@10@@

In [14]:
few_shot_qdmr_examples_path = "../QDMR/QDMR_dataset/qdmr_few_shot.json"
with open(few_shot_qdmr_examples_path, "r") as file:
    shot_examples = json.load(file)
    print(f"Loaded {len(qdmr_data)} questions.")

Loaded 5 questions.


##### $Experiments$ $Run$


In [15]:
# Run the zero-shot experiment
print("\nStarting Zero-Shot Experiment")
zero_shot_results = run_zero_shot_experiment(model, tokenizer, qdmr_data)
save_results_to_json(zero_shot_results, zero_shot_folder, model_file_name)


Starting Zero-Shot Experiment
Processing (Zero-Shot) ID: CWQ_dev_WebQTest-1011_c0be4f76a5397ba6d0d06f53905e504b


Processing (Zero-Shot) ID: CWQ_dev_WebQTest-1011_edc922a0faa1e47614eb7e6effe2d1a1
Processing (Zero-Shot) ID: CWQ_dev_WebQTest-1036_0b5333d98ef87008aa02d1fbc1554b05
Processing (Zero-Shot) ID: CWQ_dev_WebQTest-1036_4e73509d14bda62590480b655eee8751
Processing (Zero-Shot) ID: CWQ_dev_WebQTest-1081_1ecabf57357cb4abd089a4af52154854
Results have been saved!


In [16]:
# Run the few-shot experiment with 3 shots
print("\nStarting 3-Shot Experiment")
three_shot_results = run_few_shot_experiment(model, tokenizer, qdmr_data, shot_examples, num_shots=3)
save_results_to_json(three_shot_results, few_shots_folder, f"3shot_{model_file_name}")


Starting 3-Shot Experiment
Processing (3-Shot) ID: CWQ_dev_WebQTest-1011_c0be4f76a5397ba6d0d06f53905e504b
Processing (3-Shot) ID: CWQ_dev_WebQTest-1011_edc922a0faa1e47614eb7e6effe2d1a1
Processing (3-Shot) ID: CWQ_dev_WebQTest-1036_0b5333d98ef87008aa02d1fbc1554b05
Processing (3-Shot) ID: CWQ_dev_WebQTest-1036_4e73509d14bda62590480b655eee8751
Processing (3-Shot) ID: CWQ_dev_WebQTest-1081_1ecabf57357cb4abd089a4af52154854
Results have been saved!
