# **Proyecto #2 - Soccer Chatbot**

## **Fine Tunning**

In this section, we are going to fine tune the Llama3 model with some pdfs from our knowledge database to improve its responses in terms of quality of the response and accuracy.

### **Data Preparation**

First, we have to load the pdf in memory to work with it.

In [15]:
import os
from langchain_community.document_loaders import PyMuPDFLoader

pdf_path = os.path.abspath('../docs/knowledge-database/documents/The ball is round.pdf')

loader = PyMuPDFLoader(pdf_path)
data = loader.load()

Once we have the pdf in memory, we can manipulate its contents to use them in a correct way.

We have to select the relevant pages. In this document the pages that contain relevant information are the pages from page 23 to page 987.

In [16]:
data = data[22:987]
data

[Document(page_content='1\nChasing Shadows: The Prehistory of Football\nFootball is as old as the world . . . People have always played some form of\nfootball, from its very basic form of kicking a ball around to the game it is\ntoday.\nSepp Blatter, FIFA President\nI\nIs it? Have we? Let us forgive the President of FIFA his hyperbole, let us\nnot take him at his word. Football, at the very least, requires feet. The\n', metadata={'source': 'c:\\Users\\sebarro04\\Dev\\Rag-System-Project\\docs\\knowledge-database\\documents\\The ball is round.pdf', 'file_path': 'c:\\Users\\sebarro04\\Dev\\Rag-System-Project\\docs\\knowledge-database\\documents\\The ball is round.pdf', 'page': 22, 'total_pages': 1193, 'format': 'PDF 1.4', 'title': 'The Ball is Round', 'author': 'David Goldblatt', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': "D:20240606005724+00'00'", 'modDate': "D:20240606005724+00'00'", 'trapped': ''}),
 Document(page_content='emergence of bipedal hominid

Now we have clean the data to remove irrelevant characters.

In [17]:
import re
import roman

# Function to check if a string is a roman number
def isRomanNumeral(s):
    try:
        roman.fromRoman(s)
        return True
    except roman.InvalidRomanNumeralError:
        return False

# Extract page contents
def extractPageContents(data):
    pages = []
    for page in data:
        pages += [page.page_content]

    return pages

# Split pages by lines
def splitPagesIntoLines(pages):
    lines = []
    for page in pages:
        lines += page.split('\n')

    return lines

# Clean the lines
def cleanLines(lines):
    cleanedLines = []
    for line in lines:
        temp = line.strip()
        
        if (temp.isdigit()):
            continue
        elif (temp == ''):
            continue
        elif (isRomanNumeral(temp)):
            continue

        temp = re.sub(r"’\d", "’", temp)
        temp = re.sub(r"\.\d", ".", temp)
        
        cleanedLines += [temp]
    
    return cleanedLines

pages = extractPageContents(data)
lines = splitPagesIntoLines(pages)
cleanedLines = cleanLines(lines)

cleanedLines

# Merge all the lines into a single string
cleanedText = ' '.join(cleanedLines)
# print(cleanedText)

Once the text is cleaned, we can tokenize it. In this case we are going to tokenize it into sentences.

In [18]:
import nltk
nltk.download('punkt')

sentences = nltk.sent_tokenize(cleanedText)

sentences

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sebarro04\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Chasing Shadows: The Prehistory of Football Football is as old as the world .',
 '.',
 '.',
 'People have always played some form of football, from its very basic form of kicking a ball around to the game it is today.',
 'Sepp Blatter, FIFA President Is it?',
 'Have we?',
 'Let us forgive the President of FIFA his hyperbole, let us not take him at his word.',
 'Football, at the very least, requires feet.',
 'The emergence of bipedal hominids, whose feet and hands are sufficiently differentiated that they can trap and kick or catch and throw rather than paw, pad or shove, can be dated to around 2 million years ago.',
 'The world is somewhat older.',
 'And the ball?',
 'Let us forgive Blatter his carelessness with the archaeological record, for there is no evidence of any human manufactured sphere that could be kicked before 2000 BCE.',
 'Perhaps those stitched ancient Egyptian balls were kicked, but the hieroglyphic and mural evidence only shows throwing.',
 'No doubt, people have bee

There are still some junk chars, so we have to clean them.

In [19]:
cleanedSentences = []
for sentece in sentences:
    if (sentece != '.'):
        cleanedSentences += [sentece]

cleanedSentences

['Chasing Shadows: The Prehistory of Football Football is as old as the world .',
 'People have always played some form of football, from its very basic form of kicking a ball around to the game it is today.',
 'Sepp Blatter, FIFA President Is it?',
 'Have we?',
 'Let us forgive the President of FIFA his hyperbole, let us not take him at his word.',
 'Football, at the very least, requires feet.',
 'The emergence of bipedal hominids, whose feet and hands are sufficiently differentiated that they can trap and kick or catch and throw rather than paw, pad or shove, can be dated to around 2 million years ago.',
 'The world is somewhat older.',
 'And the ball?',
 'Let us forgive Blatter his carelessness with the archaeological record, for there is no evidence of any human manufactured sphere that could be kicked before 2000 BCE.',
 'Perhaps those stitched ancient Egyptian balls were kicked, but the hieroglyphic and mural evidence only shows throwing.',
 'No doubt, people have been kicking fr

### **Datasets**

Once we have cleaned the data, we can prepare our dataset to fine tune the model.

First we have to label the sentences. In this case we are going to label a sentence with the next sentence. Also, we are going to create a Dataset object, so it can be processed by PyTorch.

In [20]:
from datasets import Dataset

dataDict = {
    'inputText': cleanedSentences ,
    'targetText': cleanedSentences[1:] + [None]
}

dataset = Dataset.from_dict(dataDict)

dataset[0]

{'inputText': 'Chasing Shadows: The Prehistory of Football Football is as old as the world .',
 'targetText': 'People have always played some form of football, from its very basic form of kicking a ball around to the game it is today.'}

Once we have the dataset, we can split it into different datasets to train, evaluate and test the model.

In [29]:
from sklearn.model_selection import train_test_split

trainDataset, tempDataset = train_test_split(dataset, test_size=0.3, random_state=42)
validationDataset, testDataset = train_test_split(tempDataset, test_size=0.5, random_state=42)

TypeError: '<' not supported between instances of 'int' and 'ellipsis'

Now we have to encode the input sentences, so the model can process them.

In [25]:
from transformers import AutoTokenizer

modelId = 'meta-llama/Meta-Llama-3-8B'

tokenizer = AutoTokenizer.from_pretrained(modelId)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenizeFunction(data):
    return tokenizer(data['inputText'], return_tensors='pt', padding=True, truncation=True, max_length=51)

tokenizedTrainDataset = trainDataset.map(tokenizeFunction)
tokenizedValidationDataset = validationDataset.map(tokenizeFunction)
tokenizedTestDataset = testDataset.map(tokenizeFunction)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/14723 [00:00<?, ? examples/s]

In [26]:
tokenizedTrainDataset[0]

{'inputText': 'Chasing Shadows: The Prehistory of Football Football is as old as the world .',
 'targetText': 'People have always played some form of football, from its very basic form of kicking a ball around to the game it is today.',
 'input_ids': [[128000,
   1163,
   4522,
   67549,
   25,
   578,
   5075,
   19375,
   315,
   21424,
   21424,
   374,
   439,
   2362,
   439,
   279,
   1917,
   662]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Now we have to setup the training.

In [27]:
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM
import numpy as np
import evaluate

model = AutoModelForCausalLM.from_pretrained(modelId)

metric = evaluate.load('accuracy')

def computeMetrics(evalPred):
    logits, labels = evalPred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainingArgs = TrainingArguments(
    output_dir='./fine-tuned-model',
    num_train_epochs=3,
    per_device_eval_batch_size=8,
    save_steps=100,
    compute_metrics=computeMetrics
)

trainer = Trainer(
    model=model,
    args=trainingArgs,
    train_dataset=tokenizedDataset
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

In [28]:
trainer.train()

  0%|          | 0/5523 [00:00<?, ?it/s]

ValueError: expected sequence of length 51 at dim 2 (got 23)

In [None]:
eval_results = trainer.evaluate()

In [None]:
# model = 'bert-base-uncased'

# tokenizer = AutoTokenizer.from_pretrained(model)

# encodedInput = tokenizer(sentences, padding='max_length', truncation=True)

# inputsIds = encodedInput['input_ids']
# attentionMask = encodedInput['attention_mask']

Now we have to create a dataset, so pytorch can process it.

### **Training**

### **Results**