For this model we used different resources:
https://huggingface.co/docs/transformers/index
https://huggingface.co/docs/transformers/tasks/question_answering
https://huggingface.co/docs/transformers/model_doc/llama3 - https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct

We want to fine-tune Llama3-70B with our data.
As we understand right now, the traditional question-answer transformer needs:
	- context
	- question
	- answer found in context

With mintaka we have no context, but instead wikidata entities. 
Do we find some transformer that takes question entities and answer entities separately?
What kind of transformer do we need to do this?

We need to convert all inputs to type string(at least the numerical answers, maybe more)

In [None]:
%pip install --cache-dir=/home/user/tmp torch transformers==2.5.1 datasets accelerate --break-system-packages
#transformers datasets accelerate torch deepspeed

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
# import torch


Here we load the dataset from the mintaka data files

The json manipulaton is WIP, we need to transform the original data properly

In [27]:
import json

# Load the JSON data
with open('../dataset-generation/data/mintaka_dev_extended_preprocessed.json', 'r', encoding='utf-8') as file:
    data = json.load(file)


formatted_data = []
for entry in data:
    if 'answer' in entry:
        if isinstance(entry['answer'], list):
            for i, ans in enumerate(entry['answer']):
                if isinstance(ans, int):
                    entry['answer'][i] = {"answerType": "numerical", "answer": ans, "mention": str(ans)}
                elif isinstance(ans, str):
                    entry['answer'][i] = {"answerType": "text", "answer": ans, "mention": ans}
        elif isinstance(entry['answer'], int):
            entry['answer'] = {"answerType": "numerical", "answer": entry['answer'], "mention": str(entry['answer'])}
        elif isinstance(entry['answer'], str):
            entry['answer'] = {"answerType": "text", "answer": entry['answer'], "mention": entry['answer']}
        elif not isinstance(entry['answer'], dict):
            entry['answer'] = {"answerType": None, "answer": [], "mention": None}
    else:
        entry['answer'] = {"answerType": None, "answer": [], "mention": None}

    formatted_entry = {
        "id": entry["id"],
        "question": entry["question"],
        "translations": entry["translations"],
        "answer": entry["answer"]["mention"],
        "answer_translations": [entity["label"] for entity in entry["answer"]["answer"] if type(entry["answer"]["answer"]) == list ]
    }
    formatted_data.append(formatted_entry)
    


with open('./mintaka_dev_extended_formatted.json', 'w', encoding='utf-8') as file:
    json.dump(formatted_data, file, ensure_ascii=False, indent=4)

In [10]:
path_to_data = "../dataset-generation/data/"
train_file = path_to_data+"mintaka_train.json"
test_file = path_to_data+"mintaka_test.json"
# dev_file = path_to_data+"mintaka_dev.json"
# dev_file = path_to_data+"mintaka_dev_extended.json"
dev_file = "./mintaka_dev_extended_formatted.json"

dataset = load_dataset("json", data_files={"train": dev_file, "validation": dev_file})



NameError: name 'load_dataset' is not defined

Here we preprocess the data to prepare it for the Llama model
The user needs to have been granted access to the Llama3.3 model on Huggingface while being logged in on the client system

In [None]:

pre_trained_model = "meta-llama/Llama-3.3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(pre_trained_model)



Preprocess our dataset

We need to separate the dataset into a "question" list and an "answers" list, which are symmetrically ordered and then encode them with the tokenizer which we loaded from llama3 above.

In [None]:
def preprocess_data():
	# https://huggingface.co/transformers/v3.0.2/preprocessing.html
	# Look at 'Preprocessing pars of sentences'
	# encoded_input = tokenizer([]"How old are you?", "what's your name?"], ["I'm 6 years old", "Magnus"])
	# print(encoded_input)

	batch_questions = [
		#LIST OF ALL QUESTIONS IN ORDER
	]
	
	batch_answers = [
		#LIST OF ALL ANSWERS IN SAME ORDER AS QUESTIONS
	]

	encoded_inputs = tokenizer(batch_questions, batch_answers)

	return {}



Map and test our dataset

In [None]:
processed_dataset = dataset.map(preprocess_data, bached=True,)

print(processed_dataset["train"][0])

Defining our model

In [13]:
%pip install -cache-dir=/home/user/tmp accelerate --break-system-packages

Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'ache-dir=/home/user/tmp'[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [None]:
from accelerate import Accelerator
from transformers import AdamW, get_scheduler, AutoModelForCausalLM

def training_function():
    accelerator = Accelerator()

    model = AutoModelForCausalLM.from_pretrained(pre_trained_model)
    optimizer = AdamW(model.parameters(), lr=3e-5)


    train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
        train_dataloader, eval_dataloader, model, optimizer
    )


    num_epochs = 5
    num_training_steps = num_epochs * len(train_dataloader)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps
    )

    model.train()
    for epoch in range(num_epochs):
        #do stuff here with our loss function and backwards propagation
        for batch in train_dataloader:
            outputs=model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()   

In [None]:
from accelerate import notebook_launcher

notebook_launcher(training_function)