# **Proyecto #2 - Football Chatbot**

## **Fine Tunning**

In this section, we are going to fine tune the Llama3 model with some pdfs from our knowledge database to improve its responses in terms of quality of the response and accuracy.

### **Data Preparation**

First, we have to load the pdf in memory to work with it.

In [None]:
import os
from langchain_community.document_loaders import PyMuPDFLoader

pdf_path = os.path.abspath('../docs/knowledge-database/documents/The ball is round.pdf')

loader = PyMuPDFLoader(pdf_path)
data = loader.load()

Once we have the pdf in memory, we can manipulate its contents to use them in a correct way.

We have to select the relevant pages. In this document the pages that contain relevant information are the pages from page 23 to page 987.

In [None]:
data = data[22:987]
data

Now we have clean the data to remove irrelevant characters.

In [None]:
import re
import roman

# Function to check if a string is a roman number
def isRomanNumeral(s):
    try:
        roman.fromRoman(s)
        return True
    except roman.InvalidRomanNumeralError:
        return False

# Extract page contents
def extractPageContents(data):
    pages = []
    for page in data:
        pages += [page.page_content]

    return pages

# Split pages by lines
def splitPagesIntoLines(pages):
    lines = []
    for page in pages:
        lines += page.split('\n')

    return lines

# Clean the lines
def cleanLines(lines):
    cleanedLines = []
    for line in lines:
        temp = line.strip()
        
        if (temp.isdigit()):
            continue
        elif (temp == ''):
            continue
        elif (isRomanNumeral(temp)):
            continue

        temp = re.sub(r"’\d", "’", temp)
        temp = re.sub(r"\.\d", ".", temp)
        
        cleanedLines += [temp]
    
    return cleanedLines

pages = extractPageContents(data)
lines = splitPagesIntoLines(pages)
cleanedLines = cleanLines(lines)

cleanedLines

# Merge all the lines into a single string
cleanedText = ' '.join(cleanedLines)
# print(cleanedText)

Once the text is cleaned, we can tokenize it. In this case we are going to tokenize it into sentences.

In [None]:
import nltk
nltk.download('punkt')

sentences = nltk.sent_tokenize(cleanedText)

sentences

There are still some junk chars, so we have to clean them.

In [None]:
cleanedSentences = []
for sentece in sentences:
    if (sentece != '.'):
        cleanedSentences += [sentece]

cleanedSentences

### **Datasets**

Once we have cleaned the data, we can prepare our dataset to fine tune the model.

First we have to label the sentences. In this case we are going to label a sentence with the next sentence. Also, we are going to create a Dataset object, so it can be processed by PyTorch.

In [None]:
from datasets import Dataset, DatasetDict

dataDict = {
    'inputText': cleanedSentences ,
    'targetText': cleanedSentences[1:] + [None]
}

dataset = Dataset.from_dict(dataDict)

dataset[0]

Once we have the dataset, we can split it into different datasets to train, evaluate and test the model.

In [None]:
trainTestSplit = dataset.train_test_split(test_size=0.3, seed=42)
trainDataset = trainTestSplit['train']
testValidationSplit = trainTestSplit['test'].train_test_split(test_size=0.5, seed=42)

datasetDict = DatasetDict({
    'train': trainDataset,
    'validation': testValidationSplit['train'],
    'test': testValidationSplit['test']
})

Now we have to encode the input sentences, so the model can process them.

In [None]:
from transformers import AutoTokenizer

modelId = 'meta-llama/Meta-Llama-3-8B'

tokenizer = AutoTokenizer.from_pretrained(modelId)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenizeFunction(data):
    tokenizedData = tokenizer(data['inputText'], return_tensors='pt', padding=True, truncation=True, max_length=51)
    labels = tokenizedData.input_ids.clone()
    tokenizedData['labels'] = labels
    return tokenizedData

tokenizedDatasets = datasetDict.map(tokenizeFunction, batched=True)

In [None]:
tokenizedDatasets['train'][0]

### **Training**

In this section we are going to fine tune the Llama3 model with the previously created datasets.

First we have to setup the training.

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM
import numpy as np
import evaluate

model = AutoModelForCausalLM.from_pretrained(modelId)

metric = evaluate.load('accuracy')

def computeMetrics(evalPred):
    logits, labels = evalPred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainingArgs = TrainingArguments(
    output_dir='./fine-tuned-model',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    save_steps=100,
    eval_strategy='epoch',
    fp16=True,
    learning_rate=5e-5
)

trainer = Trainer(
    model=model,
    args=trainingArgs,
    train_dataset=tokenizedDatasets['train'],
    eval_dataset=tokenizedDatasets['validation'],
    compute_metrics=computeMetrics
)

In [None]:
trainer.train()

In [None]:
eval_results = trainer.evaluate()

### **Results**