# Llama 3 8B Inference on the ARC Dataset

In this notebook, we will demonstrate how to use a fine-tuned version of Llama 3 8B to solve ARC tasks. You can also experiment with the original version or other models compatible with Hugging Face’s infrastructure. We utilize Hugging Face libraries due to their extensive documentation and ease of use. However, feel free to adapt this notebook using different packages to suit your needs.

Note that this is not intended to be a sample solution and will likely not solve any tasks in the hidden test set. It’s just meant as a starting point for you to quickly get started, submit something, and receive a score.

If you have any additional sections you’d like to enhance or specific details to include, let us know!

## 1. Add Datasets and Model

We will be using the following datasets:

1. ‘Abstraction and Reasoning Challenge’: This is the official dataset containing the ARC tasks to be solved.
2. ‘Llama-3-8b-chat-ARC’: This dataset contains adapters for our Llama 3 model fine-tuned on the ARC dataset. !missing! (not public)
3. (Optional) ‘Llama-3-ARC-deps’: This dataset contains the wheel files for additional packages not available in the Kaggle Kernel. Note that this dataset is required if you plan to submit this notebook to the competition, as no internet access is allowed during the competition. You can find it [here](https://www.kaggle.com/datasets/hansuelijud/llama-3-arc-deps).

Additionally, we need to add the original Llama 3 8B model:

1. ‘Llama 3 8B-chat-hf’ (framework: transformers, variation: 8b-chat-hf, version: V1)

Please note that to access the Llama 3 model on Kaggle, you need to obtain access from Meta. Instructions on how to do this can be found [here](https://www.kaggle.com/models/metaresearch/llama-3).

## 2. Install and import Packages

As mentioned, we will be using Huggingface libraries, and most of the necessary packages are already available in Kaggle kernels. However, there are a few packages that are not included by default. If you are not submitting to the competition, you can download these packages directly.

For competition submissions, since internet access is restricted, we will use a Kaggle dataset containing the required wheel files. This allows us to install the packages without needing internet access during the submission process.

### 2.1 With internet access:

If we have internet access we can just directly install the packages:

In [None]:
# !pip install -q -U -i https://pypi.org/simple/ bitsandbytes
# !pip install -q -U trl
# !pip install -q -U peft

In [None]:
% test

### 2.2 Without internet access (use for submission):

If we don't have internet access you can:
1. Add the dataset we prepared [llama-3-arc-deps](https://www.kaggle.com/datasets/hansuelijud/llama-3-arc-deps).
2. Create your own dataset. You can find the explanation [here](https://www.kaggle.com/code/hansuelijud/llama-3-8b-arc-additional-dependencies).

In [None]:
deps_path = '/kaggle/input/llama-3-arc-deps'
! pip install --no-index --find-links {deps_path} --requirement {deps_path}/requirements.txt

### 2.3 Import Packages

Now, let’s import the necessary packages:

In [None]:
# For dataset
import pandas as pd
import json
import os
import ast
import re
import numpy as np
from datasets import Dataset

# For LLM
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    set_seed,
    pipeline
)
from trl import setup_chat_format

import torch
from time import time

# Set seed
set_seed(42)

In [None]:
# implement another way to load the model

## 3. Load the data

Next, let’s load the ARC tasks. Note that we will split tasks containing more than one test input into separate tasks, as this makes the pipeline easier. In the end, we will combine them again to create a valid submission file.

In [None]:
# Function to split the tasks that have multiple test input/output pairs.
# This makes the handling easier, we will combine it again at the end for the submission
def split_dictionary(data):
    """
    Splits the tasks that have multiple test input/output pairs into separate entries.

    Args:
    data (dict): The original dictionary containing tasks with 'test' and 'train' fields.

    Returns:
    tuple: A tuple containing:
        - result (dict): The dictionary with tasks split into separate entries if they have multiple test pairs.
        - split_files (list): A list of keys for the tasks that were split.
    """
    result = {}
    split_files = []
    for key, value in data.items():
        test_list = value.get("test", [])
        train_list = value.get("train", [])
        if len(test_list) > 1:
            for idx, test_item in enumerate(test_list):
                new_key = f"{key}_{idx}"
                result[new_key] = {
                    "test": [test_item],
                    "train": train_list
                }
                split_files.append(new_key)
        else:
            result[key] = value
    return result, split_files

In [None]:
# Set test_run variable: False: create submission file for private test set, True: Evaluate on public tasks
test_run = False

# Prepare data for DataFrame

# Load JSON data from the files
if test_run:
    with open('/kaggle/input/arc-prize-2024/arc-agi_evaluation_challenges.json') as f:
        challenges = json.load(f)
        # Split tasks with multiple test inputs
        challenges, split_files = split_dictionary(challenges) 

    with open('/kaggle/input/arc-prize-2024/arc-agi_evaluation_solutions.json') as f:
        solutions = json.load(f)
else:
    with open('/kaggle/input/arc-prize-2024/arc-agi_test_challenges.json') as f:
        challenges = json.load(f)
    # Split tasks with multiple test inputs
    challenges, split_files = split_dictionary(challenges) 

# Print how many files have been split and their names
split_file_count = len(split_files)//2

print(f"Number of files split: {split_file_count}")
print("File names:")
for name in split_files:
    print(name)

# Prepare data
data = []
        
for file_name, grids in challenges.items():
    train_grids = grids.get('train', [])
    test_inputs = grids.get('test', [])
    if test_run:
        # Handle files with multiple test inputs
        parts = file_name.split('_')
        if len(parts) > 1:
            test_nr = int(parts[1])
        else:
            test_nr = 0
        test_outputs = solutions.get(parts[0], [])
        # Transform test grids to lists of dicts with 'output' key
        test_outputs_transformed = [{'output': test_outputs[test_nr]}]
        # Combine test inputs and outputs in alternating manner
        combined_tests = [{'input': test_inputs[0]['input'], 'output': test_outputs_transformed[0]['output']}]
    data.append({
            'file_name': file_name,
            'train': train_grids,
            'test_input': test_inputs,
            'test_output': test_outputs_transformed if test_run else [[0, 0]],
            'test': combined_tests if test_run else test_inputs
    })

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

## 4. Load finetuned Llama-3 Model

Next, we will load our fine-tuned Llama 3 model. We are using a 4-bit quantized version to reduce memory requirements. Ensure that you have selected an appropriate accelerator (e.g., P100) for the session to enable smooth functioning.

In [None]:
# Define a template for formatting chat messages with the Llama 3 model
# This is model specific. Change it if you e.g. use Google's Gemma instead of Llama
LLAMA_3_CHAT_TEMPLATE = """{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"""

# Set the data type for computations to float16, bfloat16 not supported on T4/P100
compute_dtype = getattr(torch, "float16")

# Configure the BitsAndBytes settings for 4-bit quantization to reduce memory usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enable 4-bit quantization
    bnb_4bit_use_double_quant=True,  # Use double quantization for improved precision
    bnb_4bit_quant_type="nf4",  # Specify the quantization type
    bnb_4bit_compute_dtype=compute_dtype,  # Set the computation data type
)

# Specify the model ID for loading the fine-tuned Llama 3 model
# You can also test other models by replacing this line.
# For the original non-finetuned model use
# model_id = "/kaggle/input/llama-3/transformers/8b-chat-hf/1"
model_id = "/kaggle/input/llama-3-8b-chat-hf-arc-finetune/"

# Record the start time to measure the loading duratio
time_start = time()
print("Loading model")
# Load the pre-trained model with specified configurations
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True, # Allow the model to use custom code from the repository
    quantization_config=bnb_config, # Apply the 4-bit quantization configuration
    attn_implementation='sdpa', # Use scaled-dot product attention for better performance
    torch_dtype=compute_dtype, # Set the data type for the model
    use_cache=False, # Disable caching to save memory
    device_map='auto', # Automatically map the model to available devices (e.g., GPUs)
)

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.chat_template = LLAMA_3_CHAT_TEMPLATE # Apply the chat message template

# Record the end time and print the duration for preparing the model and tokenizer
time_end = time()
print(f"Prepare model, tokenizer: {round(time_end-time_start, 3)} sec.")

## 5. Create Prompts and filter the dataset

Next, we will create the prompts that will be used to evaluate the model on the ARC dataset.

### 5.1 Create Prompts


In [None]:
# The system_prompt defines the initial instructions for the model, setting the context for solving ARC tasks.
system_prompt = '''You are a puzzle solving wizard. You are given a puzzle from the abstraction and reasoning corpus developed by Francois Chollet.'''

# User message template is a template for creating user prompts. It includes placeholders for training data and test input data, guiding the model to learn the rule and apply it to solve the given puzzle.
user_message_template = '''Here are the example input and output pairs from which you should learn the underlying rule to later predict the output for the given test input:
----------------------------------------
{training_data}
----------------------------------------
Now, solve the following puzzle based on its input grid by applying the rules you have learned from the training data.:
----------------------------------------
[{{'input': {input_test_data}, 'output': [[]]}}]
----------------------------------------
What is the output grid? Only provide the output grid in the form as in the example input and output pairs. Do not provide any additional information:'''

def preprocess(task, test_run, train_mode=False):
    """
    Preprocess a single ARC task to create the prompt and solution for the model.

    This function formats the system and user messages using a predefined template and the task's training and test data.
    If in training mode, it also includes the assistant's message with the expected output.

    Parameters:
    task (dict): The ARC task data containing training and test examples.
    train_mode (bool): If True, includes the assistant's message with the expected output for training purposes.

    Returns:
    dict: A dictionary containing the formatted text prompt, the solution, and the file name.
    """
    # System message
    system_message = {"role": "system", "content": system_prompt}

    # Extract training data and input grid from the task
    training_data = task['train']
    input_test_data = task['test'][0]['input']
    if test_run:
        output_test_data = task['test'][0]['output']
    else:
        output_test_data = [[0 ,0]]

    # Format the user message with training data and input test data
    user_message_content = user_message_template.format(training_data=training_data, input_test_data=input_test_data)
    user_message = {
        "role": "user",
        "content": user_message_content
    }

    # Include the assistant message with the expected output if in training mode
    if train_mode:
        assistant_message = {
            "role": "assistant",
            "content": str(output_test_data)
        }

        # Combine system, user, and assistant messages
        messages = [system_message, user_message, assistant_message]
    else:
        messages = [system_message, user_message]
    # Convert messages using the chat template for use with the instruction finetuned version of Llama
    messages = tokenizer.apply_chat_template(messages, tokenize=False)
    if test_run:
        return {"text": messages, "solution": output_test_data, "file_name": task['file_name']}
    else:
        return {"text": messages, "file_name": task['file_name']}

# Convert the loaded data to a Huggingface Dataset object
dataset = Dataset.from_pandas(df)

# Apply the preprocess function to each task in the dataset
dataset = dataset.map(lambda x: preprocess(x, test_run), batched=False, remove_columns=dataset.column_names)

### 5.2 Filter the Dataset

To ensure that the model’s context window is not exceeded, we will restrict the tasks to those with prompts below a specified maximum number of tokens. Due to memory limitations, the model has been fine-tuned only on tasks with up to 2048 tokens.

In [None]:
# Define the maximum number of tokens allowed
max_tokens = 8000  # Adjust this value as needed


# Function to calculate the number of tokens
def count_tokens(text):
    """
    Calculate the number of tokens in a given text using the tokenizer.

    This function uses the tokenizer to encode the input text and returns the
    number of tokens. It is useful for ensuring that the text length stays
    within the model's context window.

    Parameters:
    text (str): The input text to be tokenized.

    Returns:
    int: The number of tokens in the input text.
    """
    return len(tokenizer.encode(text))

# Filter the dataset to include only tasks with a number of tokens within the allowed limit
filtered_dataset = dataset.filter(lambda x: count_tokens(x['text']) <= max_tokens)

# Print the number of tasks filtered out and the remaining tasks
print(f'{len(dataset)-len(filtered_dataset)} tasks contain too many tokens if we set max_tokens to {max_tokens}')
print(f'The dataset contains {len(filtered_dataset)} tasks to evaluate the model')

## 6. Evaluate the model

Now, let’s query the Language Learning Model (LLM) to determine its ability to successfully solve ARC tasks:

### 6.1 Generate the Outputs


In [None]:
# Define your LLM pipeline
text_gen_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Define terminators for the pipeline
terminators = [
    text_gen_pipeline.tokenizer.eos_token_id,
    text_gen_pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

# Function to generate outputs
def generate_solution(task, max_new_tokens=512, do_sample=True, temperature=0.1, top_p=0.1):
    """
    Generate a solution for an ARC task using the language model.

    This function takes a task prompt, generates a solution using the text generation pipeline,
    and extracts the generated solution from the model's output.

    Parameters:
    task (dict): The ARC task data containing the prompt and other relevant information.
    max_new_tokens (int, optional): The maximum number of new tokens to generate. Default is 512.
    do_sample (bool, optional): Whether to use sampling; if False, greedy decoding is used. Default is True.
    temperature (float, optional): The sampling temperature. Lower values make the model more conservative. Default is 0.1.
    top_p (float, optional): The cumulative probability for nucleus sampling. Lower values make the model more conservative. Default is 0.1.

    Returns:
    dict: A dictionary containing the generated solution.
    """
    # Extract the prompt from the task
    prompt = task['text']
    
    # Generate the model's output based on the prompt
    outputs = text_gen_pipeline(
        prompt, 
        max_new_tokens=max_new_tokens, 
        eos_token_id=terminators, 
        do_sample=do_sample, 
        temperature=temperature, 
        top_p=top_p
    )
    
    # Extract the generated solution from the model's output
    generated_solutions = outputs[0]["generated_text"][len(prompt):]
    return {'generated_solution': generated_solutions}

# Generate solutions
print("Generating solutions")
filtered_dataset = filtered_dataset.map(generate_solution, batched=False)

In [None]:
print(filtered_dataset[:5]['generated_solution'])

### 6.2 Extract Solutions and Verify Validity

Next, we will extract the generated solutions from the model’s output and verify their validity.

In [None]:
def extract_solution(text):
    """
    Extract the solution array from the generated text.

    Parameters:
    text (str): The text containing the generated solution.

    Returns:
    list: A list of lists representing the extracted solution array.
          Returns [[0]] if no valid solution is found.
    """
    try:
        # Find the part of the text that looks like a nested list
        start = text.index('[[')
        end = text.index(']]', start) + 2
        array_str = text[start:end]
        
        # Use ast.literal_eval to safely evaluate the string as a Python expression
        array = ast.literal_eval(array_str)
        
        # Check if the result is a list of lists
        if all(isinstance(i, list) for i in array):
            return array
        else:
            return [[0]]
    except (ValueError, SyntaxError):
        return [[0]]

def pad_array_with_value(array, target_shape, pad_value):
    """
    Pad the given array to the target shape with the specified pad value.

    This function pads the original array to fit the target shape by adding additional
    pixels at the ends. This method ensures that the smaller array is placed at the
    top-left corner of the target shape, making sense of the number of correct pixels
    during comparison.

    Note:
    Depending on how you pad the arrays, the number of correct pixels might vary.
    For example, placing the smaller array in the center versus adding pixels at the ends
    can yield different results. Here, we pad by adding pixels at the ends.

    Parameters:
    array (list): The original array to be padded.
    target_shape (tuple): The desired shape of the padded array (rows, columns).
    pad_value (int): The value to use for padding the array.

    Returns:
    np.ndarray: A padded array with the specified target shape and pad value.
    """
    padded_array = np.full(target_shape, pad_value, dtype=int)
    original_shape = np.array(array).shape
    padded_array[:original_shape[0], :original_shape[1]] = array
    return padded_array

def compare_solutions_with_padding(generated_output, correct_output, pad_value=-1):
    """
    Compare the generated output with the correct output, using padding to align their shapes.

    Parameters:
    generated_output (list): The generated solution array.
    correct_output (list): The correct solution array.
    pad_value (int, optional): The value to use for padding. Default is -1. The colour value -1 should not be present in the solutions.

    Returns:
    tuple: A tuple containing:
        - is_correct (bool): True if the solutions match exactly, False otherwise.
        - correct_percentage (float): The percentage of correctly matched pixels.
    """
    max_rows = max(len(generated_output), len(correct_output))
    max_cols = max(len(generated_output[0]), len(correct_output[0]))
    target_shape = (max_rows, max_cols)
    
    padded_generated = pad_array_with_value(generated_output, target_shape, pad_value)
    padded_correct = pad_array_with_value(correct_output, target_shape, pad_value)
    
    total_pixels = max_rows * max_cols
    correct_pixels = np.sum((padded_generated == padded_correct) & (padded_generated != pad_value) & (padded_correct != pad_value))
    correct_percentage = (correct_pixels / total_pixels) * 100
    
    is_correct = (correct_pixels == total_pixels)
    
    return is_correct, correct_percentage

if test_run:
    # Lists to store results of task evaluation
    solved_tasks = []
    failed_tasks = []
    accuracy_list = []

    for i, task in enumerate(filtered_dataset):
        true_solution = task['solution']
        file_name = task['file_name']
        generated_text = task["generated_solution"]

        # Extract the solution generated by the model
        gen_solution = extract_solution(generated_text)

        # Compare the generated solution with the true solution
        is_correct, correct_percentage = compare_solutions_with_padding(gen_solution, true_solution)

        # Append results to respective lists based on correctness
        if is_correct:
            solved_tasks.append({
                'file_name': file_name,
                'llm_output': generated_text,
                'solution': gen_solution
            })
        else:
            failed_tasks.append({
                'file_name': file_name,
                'llm_output': generated_text,
                'solution': gen_solution
            })

        # Store "pixel accuracy for each task
        accuracy_list.append({
            'file_name': file_name,
            'correct_percentage': correct_percentage
        })

    # Create a dictionary to store results
    results = {'file_name': [], 'solved': [], 'accuracy': []}

    # Add solved tasks to the results
    for task in solved_tasks:
        results['file_name'].append(task['file_name'])
        results['solved'].append(True)
        results['accuracy'].append(next((item['correct_percentage'] for item in accuracy_list if item['file_name'] == task['file_name']), None))

    # Add failed tasks to the results
    for task in failed_tasks:
        results['file_name'].append(task['file_name'])
        results['solved'].append(False)
        results['accuracy'].append(next((item['correct_percentage'] for item in accuracy_list if item['file_name'] == task['file_name']), None))

    # Create a DataFrame
    df_results = pd.DataFrame(results)

    # Display the DataFrame as a table
    print(df_results)

    # Calculate and print the average correct percentage
    average_correct_percentage = df_results['accuracy'].mean()
    print(f"Average 'Pixel Accuracy' of attempted tasks: {average_correct_percentage:.2f}%")

    # Calculate and print the number of solved tasks out of the total number of tasks
    total_tasks = len(df)
    solved_tasks_count = df_results['solved'].sum()
    print(f"Solved {solved_tasks_count} out of {total_tasks} tasks ({(solved_tasks_count / total_tasks) * 100:.2f}%)")

## 7. Submit

For submission you must create a file called 'submission.json' which should have the format as explained [here](www.kaggle.com/competitions/arc-prize-2024/overview/evaluation)

`
{"00576224": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}],
 "009d5c81": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}],
 "12997ef3": [{"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]},
              {"attempt_1": [[0, 0], [0, 0]], "attempt_2": [[0, 0], [0, 0]]}],
 ...
}
`

In [None]:
solution_dict = {}

for i, task in enumerate(filtered_dataset):
    file_name = task['file_name']
    generated_text = task["generated_solution"]
    # Extract the solution generated by the model
    gen_solution = extract_solution(generated_text)
    # For now we only do one attempt
    solution_dict[file_name] = [
        {
            "attempt_1": gen_solution,
            "attempt_2": [[0, 0], [0, 0]]
        }
    ]

# Recombining the solutions for split files
combined_solution_dict = {}
combined_files = {}

for file_name, attempts in solution_dict.items():
    base_name = file_name.split('_')[0]
    if base_name not in combined_solution_dict:
        combined_solution_dict[base_name] = []
        combined_files[base_name] = []
    combined_solution_dict[base_name].extend(attempts)
    if '_' in file_name:
        combined_files[base_name].append(file_name)
        
# Printing which file names have been combined
print("Files that have been combined:")
for base_name, files in combined_files.items():
    if files:  # Print only if there are files that were combined
        print(f"{base_name}: {', '.join(files)}")

# We still need to fill in dummy solutions for the tasks we did not consider to make a valid submission:
# Load the sample submission file
with open('/kaggle/input/arc-prize-2024/sample_submission.json') as f:
    sample_submission = json.load(f)
# Fill in all entries that are still missing from the sample_submission file
for key, value in sample_submission.items():
    if key not in combined_solution_dict:
        combined_solution_dict[key] = value

# Create submission
with open("submission.json", "w") as json_file:
    json.dump(combined_solution_dict, json_file) 

# Closing Remarks

The fine-tuned Llama 3 model used in this notebook has not been optimized and will likely not successfully solve any of the hidden test tasks. The primary purpose of this fine-tuning is to ensure that the model’s responses are specific to the task, making it easier to extract the solution array. It is unlikely that the model has significantly improved in solving ARC tasks through this fine-tuning process. It is your task to change that ;).

It’s worth noting that Large Language Model (LLM) performance might improve with techniques like chain-of-thought prompting, where the model is asked to explain its reasoning process. While this approach can enhance understanding, it may also complicate the extraction of the solution array and take more time. You can find an overview of recent papers and repositories [here](https://docs.google.com/spreadsheets/d/1fR4cgjY1kNKN_dxiidBQbyT6Gv7_Ko7daKOjlYojwTY/edit#gid=756763742).

For further insights and state-of-the-art performance on ARC using a fine-tuned LLM, you can refer to the work of Jack Cole and Mohamed Osman. The notebook they submitted to the past challenge can be found [here](https://www.kaggle.com/code/jcole75/mindsai-nlp-arc) and a recent interview on the [lab42.global](https://lab42.global/community-2023-july-arc-sota/) webpage. His approach operates on a significantly different scale and showcases advanced techniques and optimizations.

For more inspiration head over to the official [arcprize.org](https://arcprize.org/) webpage and have a lookt at the community [resources](https://arcprize.org/guide#resources) list.