---

#### $Load$ $Libraries$

---

In [None]:
import json
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, AutoModel
from huggingface_hub import notebook_login
import textwrap
from generate import generate
import re
from shots_type import run_experiment
from manage_folders import save_results_to_json, dataset_folders
from retriever import DynamicRetriever
from accelerate import infer_auto_device_map, dispatch_model
# !pip install -U transformers accelerate bitsandbytes datasets

---

#### $Load$ $Model$

---

##### $Model$ $Access$

In order to access to the model we need to:
1. Visit the webpage of the model: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
2. Log in to our HF account
3. Read and Accept the Use Policies of Meta 
4. Write yout contact information (email and username)
5. Make a request
6. An email for validation will be sent
7. Run `from huggingface_hub import notebook_login` and `notebook_login()` 
8. Press the link shown. It will take you to HF and create a new token with `read` rights
9. Copy the token and paste it in the notebook and press `Enter`

 *It might take some minutes*...


In [2]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
# Initialize the model name and configuration
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

##### $Bnb$ $Configuration$


The `bnb_config` creates a configuration to shrink the language model. 

* *Compresses the Model:* It tells the system to load the model in a "compressed" 4-bit format instead of its full 16-bit size.
* *Saves Memory:* This makes the model about 4 times smaller, allowing it to run on computers with less memory (RAM and VRAM).
* *Maintains Performance:* It uses clever tricks (like doing the actual math in 16-bit) to ensure the shrunken model is still fast and accurate.

Essentially, it's a set of instructions to make a huge model fit on the computer with minimal loss in quality.

In [4]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

##### $Tokenizer$ $Set-up$


This code sets up a `tokenizer` for the language model.
It loads a pre-trained tokenizer, makes sure there's a padding token (for making all inputs the same length), and if no chat template is found, it manually adds one specifically for Llama 3.x models to format conversations correctly.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Set chat_template if missing
if not hasattr(tokenizer, "chat_template") or tokenizer.chat_template is None:
    print("Chat template not found. Setting LLaMA 3.1 chat template manually.")
    tokenizer.chat_template = (
        "<|begin_of_text|>{% for message in messages %}"
        "{% if message['role'] == 'user' %}"
        "<|start_header_id|>user<|end_header_id|>\n\n{{ message['content'] }}<|eot_id|>"
        "{% elif message['role'] == 'assistant' %}"
        "<|start_header_id|>assistant<|end_header_id|>\n\n{{ message['content'] }}<|eot_id|>"
        "{% endif %}"
        "{% endfor %}"
        "{% if add_generation_prompt %}"
        "<|start_header_id|>assistant<|end_header_id|>\n\n"
        "{% endif %}"
    )


In [6]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

##### $Parameters$ $for$ $the$ $model$ $configuration:$ 

* **`max_new_tokens`**: This sets the maximum length of the generated response. max_new_tokens=512 tells the model, "Do not write more than 512 new tokens after the prompt." This prevents it from writing forever.

* **`do_sample`**:
	* **do_sample=False**: This forces the model to be deterministic. Every time it generates a new word, it chooses the single word that it calculates as being the most statistically likely to come next. When we compare the models, we want to compare their "best, most probable" attempt at the problem.
	* **do_sample=True**: This tells the model to be creative and less predictable. Instead of always picking the #1 most likely word, it might pick the #2 or #3 most likely word, based on a random sample (controlled by parameters like temperature).

	Since the task is analytical (decomposing a problem) and not creative, the deterministic approach is better. We choose do_sample=False for all the baseline experiments to ensure your results are stable and repeatable.

* **`repetition_penalty`**:
   * used to discourage a language model from repeating the same words or phrases during text generation.
   * Default value: 1.0 (no penalty)
     * Typical range: 1.1 to 2.0
	 * 1.2 = light penalty
	 * 1.5+ = strong penalty
	

In [7]:
# Define the model specific configuration
model_config = {
    "model_id": model_name,
    "uses_system_prompt": True,  # Llama supports a system prompt
    # "system_prompt": "You are an expert system that decomposes complex user questions into a numbered list of simple, sequential sub-questions. Each sub-question should be a direct, answerable query that contributes to a logical plan. Your output should ONLY be the sub-questions. Do not provide any explanations or other answers.",
    "generation_params": {
        "max_new_tokens": 256,
        "do_sample": False,
        "repetition_penalty": 1.2
    }
}



In [8]:
# Define the name of the model for our file names
model_file_name = "Llama-3.1-8B_results.json"


---

#### $QDMR$ $Dataset$ $Predictions$

---

In [None]:
qdmr_base_folder = '../QDMR/llm_predictions/static'
qdmr_dataset_file = "../QDMR/QDMR_examples/qdmr_evaluation.json" # Define the path to the qdmr dataset
qdmr_fewshot_file = "../QDMR/QDMR_examples/qdmr_few_shot.json"

# Call the function to get everything we need in one go
data_assets = dataset_folders(qdmr_base_folder, qdmr_dataset_file, qdmr_fewshot_file)

# Unpack the dictionary into these variables for easy access
qdmr_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


##### $Zero$ $Shot$ $Experiment$

In [None]:
# Run the zero-shot experiment
print("\nStarting Zero-Shot Experiment")

zero_shot_results = run_experiment(
    model=model,
    tokenizer=tokenizer,
    data=qdmr_data,
    shot_examples=shot_examples,
    model_config=model_config,  
    num_shots=0
)
save_results_to_json(zero_shot_results, zero_shot_folder, model_file_name)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Starting Zero-Shot Experiment
Processing (0-Shot) ID: CWQ_dev_WebQTest-1011_c0be4f76a5397ba6d0d06f53905e504b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: CWQ_dev_WebQTest-1011_edc922a0faa1e47614eb7e6effe2d1a1


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: CWQ_dev_WebQTest-1036_0b5333d98ef87008aa02d1fbc1554b05


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: CWQ_dev_WebQTest-1036_4e73509d14bda62590480b655eee8751


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: CWQ_dev_WebQTest-1081_1ecabf57357cb4abd089a4af52154854
Results saved successfully to '../QDMR/llm_predictions/static/zero_shot/Llama-3.2-3B_results.json'


##### $Few$ $Shot$ $Experiment$ $-$ $Static$ $Shots$

In [None]:
# Run the few-shot experiment with 3 shots
print("\nStarting 3-Shot Experiment")
# Run a 3-shot experiment
three_shot_results = run_experiment(
    model=model,
    tokenizer=tokenizer,
    data=qdmr_data,
    shot_examples=shot_examples,
    model_config=model_config, # Use the same config
    num_shots=3,
    few_shot_type="static",  # "static" | "random" | "dynamic"
    retriever=None,          # Required if few_shot_type="dynamic"
    seed=42
)

# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Starting 3-Shot Experiment


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Results saved successfully to '../QDMR/llm_predictions/static/few_shot/3shot_Llama-3.2-3B_results.json'


##### $Few$ $Shot$ $Experiment$ $-$ $Random$ $Shots$

In [None]:
qdmr_base_folder = '../QDMR/llm_predictions/random'
qdmr_dataset_file = "../QDMR/QDMR_examples/qdmr_evaluation.json" # Define the path to the qdmr dataset
qdmr_fewshot_file = "../QDMR/QDMR_examples/qdmr_few_shot.json"

# Call the function to get everything we need in one go
data_assets = dataset_folders(qdmr_base_folder, qdmr_dataset_file, qdmr_fewshot_file)

# Unpack the dictionary into these variables for easy access
qdmr_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


In [None]:
# Run the few-shot experiment with 3 shots
print("\nStarting Random-3-Shot Experiment")
# Run a 3-shot experiment
three_shot_results = run_experiment(
    model=model,
    tokenizer=tokenizer,
    data=qdmr_data,
    shot_examples=shot_examples,
    model_config=model_config, # Use the same config
    num_shots=3,
    few_shot_type="random",  # "static" | "random" | "dynamic"
    retriever=None,          # Required if few_shot_type="dynamic"
    seed=42
)

# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Starting Random-3-Shot Experiment


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Results saved successfully to '../QDMR/llm_predictions/random/few_shot/3shot_Llama-3.2-3B_results.json'


##### $Few$ $Shot$ $Experiment$ $-$ $Dynamic$ $Shots$

In [None]:
qdmr_base_folder = '../QDMR/llm_predictions/dynamic'
qdmr_dataset_file = "../QDMR/QDMR_examples/qdmr_evaluation.json" # Define the path to the qdmr dataset
qdmr_fewshot_file = "../QDMR/QDMR_examples/qdmr_few_shot.json"

# Call the function to get everything we need in one go
data_assets = dataset_folders(qdmr_base_folder, qdmr_dataset_file, qdmr_fewshot_file)

# Unpack the dictionary into these variables for easy access
qdmr_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


In [None]:
# Run the few-shot experiment with 3 shots
print("\nStarting Dynamic-3-Shot Experiment")

retriever_instance = DynamicRetriever(shot_examples)

# Run a 3-shot experiment
three_shot_results = run_experiment(
    model=model,
    tokenizer=tokenizer,
    data=qdmr_data,
    shot_examples=shot_examples,
    model_config=model_config, # Use the same config
    num_shots=3,
    few_shot_type="dynamic",  # "static" | "random" | "dynamic"
    retriever=retriever_instance,          # Required if few_shot_type="dynamic"
    seed=42
)

# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")


Starting Dynamic-3-Shot Experiment
Initializing DynamicRetriever...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Retriever initialized and example embeddings are pre-computed.


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Results saved successfully to '../QDMR/llm_predictions/dynamic/few_shot/3shot_Llama-3.2-3B_results.json'


In [None]:
hnghnh

---

#### $HotpotQA$ $Dataset$ $Predictions$

---

We use a helper function `dataset_folders(base_results_folder, dataset_path, few_shot_examples_path)` that streamlines the setup process for running experiments on a new dataset. It handles three key tasks:

1.  **Creates Output Directories:** It takes a base folder path and automatically creates the `zero_shot` and `few_shot` subdirectories where the model's predictions will be saved.
2.  **Loads Evaluation Data:** It reads the main dataset file (e.g., `hotpot_dataset.json`) containing the questions to be evaluated.
3.  **Loads Few-Shot Examples:** It reads the corresponding file containing the high-quality examples for few-shot prompting.

The function returns a single, convenient dictionary containing all these assets (the loaded data and the output folder paths), which can then be easily used by the main experiment functions.

In [None]:
# Define the paths for the dataset we want to use
hotpot_base_folder = '../HotpotQA/llm_predictions/static/'
hotpot_dataset_file = '../HotpotQA/HotpotQA_examples/hotpot_evaluation.json'
hotpot_fewshot_file = '../HotpotQA/HotpotQA_examples/hotpot_few_shot.json'

# Call the function to get everything we need in one go
data_assets = dataset_folders(hotpot_base_folder, hotpot_dataset_file, hotpot_fewshot_file)

# Unpack the dictionary into these variables for easy access
hotpot_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


##### $Zero$ $Shot$ $Experiment$

In [10]:
# Run the zero-shot experiment
print("\nStarting Zero-Shot Experiment")

zero_shot_results = run_experiment(
    model=model,
    tokenizer=tokenizer,
    data=hotpot_data,
    shot_examples=shot_examples,
    model_config=model_config,  
    num_shots=0
)

save_results_to_json(zero_shot_results, zero_shot_folder, model_file_name)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Starting Zero-Shot Experiment
Processing (0-Shot) ID: 5a8b57f25542995d1e6f1371


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: 5a8c7595554299585d9e36b6


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: 5a85ea095542994775f606a8


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: 5adbf0a255429947ff17385a


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: 5a8e3ea95542995a26add48d
Results saved successfully to '../HotpotQA/llm_predictions/zero_shot/Llama-3.1-8B_results.json'


##### $Few$ $Shot$ $Experiment$

In [10]:
# Run the few-shot experiment with 3 shots
print("\nStarting 3-Shot Experiment")
# Run a 3-shot experiment
three_shot_results = run_experiment(
    model=model,
    tokenizer=tokenizer,
    data=hotpot_data,
    shot_examples=shot_examples,
    model_config=model_config, # Use the same config
    num_shots=3
)

# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")


Starting 3-Shot Experiment
Processing (3-Shot) ID: 5a8b57f25542995d1e6f1371
Processing (3-Shot) ID: 5a8c7595554299585d9e36b6
Processing (3-Shot) ID: 5a85ea095542994775f606a8
Processing (3-Shot) ID: 5adbf0a255429947ff17385a
Processing (3-Shot) ID: 5a8e3ea95542995a26add48d
Results saved successfully to '../HotpotQA/llm_predictions/few_shot/3shot_Llama-3.1-8B_results.json'


---

#### $StrategyQA$ $Dataset$ $Predictions$

---

In [None]:
# Create the folder paths for results and the folders if they do not exist
strategyqa_base_folder = '../StrategyQA/llm_predictions/static'
strategyqa_dataset_file = "../StrategyQA/StrategyQA_examples/strategyqa_evaluation.json" # Define the path to the strategyqa dataset
strategyqa_fewshot_file = "../StrategyQA/StrategyQA_examples/strategyqa_few_shot.json"

# Call the function to get everything we need in one go
data_assets = dataset_folders(strategyqa_base_folder, strategyqa_dataset_file, strategyqa_fewshot_file)

# Unpack the dictionary into these variables for easy access
strategyqa_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


##### $Zero$ $Shot$ $Experiment$

In [15]:
# Run the zero-shot experiment
print("\nStarting Zero-Shot Experiment")

zero_shot_results = run_experiment(
    model=model,
    tokenizer=tokenizer,
    data=strategyqa_data,
    shot_examples=shot_examples,
    model_config=model_config,  
    num_shots=0
)

save_results_to_json(zero_shot_results, zero_shot_folder, model_file_name)


Starting Zero-Shot Experiment
Processing (0-Shot) ID: 7a0e419ffb6009156828
Processing (0-Shot) ID: 427fe3968e32005479b9
Processing (0-Shot) ID: 06b9ed3f803e3d5796ed
Processing (0-Shot) ID: 5090f573b09ac3050824
Processing (0-Shot) ID: d912709b7341dd86ba39
Results saved successfully to '../StrategyQA/llm_predictions/zero_shot/Llama-3.1-8B_results.json'


##### $Few$ $Shot$ $Experiment$

In [16]:
# Run the few-shot experiment with 3 shots
print("\nStarting 3-Shot Experiment")
# Run a 3-shot experiment
three_shot_results = run_experiment(
    model=model,
    tokenizer=tokenizer,
    data=strategyqa_data,
    shot_examples=shot_examples,
    model_config=model_config, # Use the same config
    num_shots=3
)

# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")


Starting 3-Shot Experiment
Processing (3-Shot) ID: 7a0e419ffb6009156828
Processing (3-Shot) ID: 427fe3968e32005479b9
Processing (3-Shot) ID: 06b9ed3f803e3d5796ed
Processing (3-Shot) ID: 5090f573b09ac3050824
Processing (3-Shot) ID: d912709b7341dd86ba39
Results saved successfully to '../StrategyQA/llm_predictions/few_shot/3shot_Llama-3.1-8B_results.json'
