# **Proyecto #2 - Soccer Chatbot**

## **Fine Tunning**

In this section, we are going to fine tune the Llama3 model with some pdfs from our knowledge database to improve its responses in terms of quality of the response and accuracy.

### **Data Preparation**

First, we have to load the pdf in memory to work with it.

In [None]:
import os
from langchain_community.document_loaders import PyMuPDFLoader

pdf_path = os.path.abspath('../docs/knowledge-database/documents/The ball is round.pdf')

loader = PyMuPDFLoader(pdf_path)
data = loader.load()

Once we have the pdf in memory, we can manipulate its contents to use them in a correct way.

We have to select the relevant pages. In this document the pages that contain relevant information are the pages from page 23 to page 987.

In [None]:
data = data[22:987]
data

Now we have clean the data to remove irrelevant characters.

In [None]:
import re
import roman

# Function to check if a string is a roman number
def isRomanNumeral(s):
    try:
        roman.fromRoman(s)
        return True
    except roman.InvalidRomanNumeralError:
        return False

# Extract page contents
def extractPageContents(data):
    pages = []
    for page in data:
        pages += [page.page_content]

    return pages

# Split pages by lines
def splitPagesIntoLines(pages):
    lines = []
    for page in pages:
        lines += page.split('\n')

    return lines

# Clean the lines
def cleanLines(lines):
    cleanedLines = []
    for line in lines:
        temp = line.strip()
        
        if (temp.isdigit()):
            continue
        elif (temp == ''):
            continue
        elif (isRomanNumeral(temp)):
            continue

        temp = re.sub(r"’\d", "’", temp)
        temp = re.sub(r"\.\d", ".", temp)
        
        cleanedLines += [temp]
    
    return cleanedLines

pages = extractPageContents(data)
lines = splitPagesIntoLines(pages)
cleanedLines = cleanLines(lines)

cleanedLines

# Merge all the lines into a single string
cleanedText = ' '.join(cleanedLines)
# print(cleanedText)

Once the text is cleaned, we can tokenize it. In this case we are going to tokenize it into sentences.

In [None]:
import nltk
nltk.download('punkt')

sentences = nltk.sent_tokenize(cleanedText)

sentences

There are still some junk chars, so we have to clean them.

In [None]:
cleanedSentences = []
for sentece in sentences:
    if (sentece != '.'):
        cleanedSentences += [sentece]

cleanedSentences

### **Datasets**

Once we have cleaned the data, we can prepare our dataset to fine tune the model.

First we have to label the sentences. In this case we are going to label a sentence with the next sentence. Also, we are going to create a Dataset object, so it can be processed by PyTorch.

In [None]:
from datasets import Dataset

dataDict = {
    'inputText': cleanedSentences ,
    'targetText': cleanedSentences[1:] + [None]
}

dataset = Dataset.from_dict(dataDict)

dataset[0]

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bosonai/Higgs-Llama-3-70B')

def tokenizeFunction(data):
    return tokenizer(data['inputText'], return_tensors='pt', padding=True, truncation=True, max_length=512)

tokenizedDataset = dataset.map(tokenizeFunction)

In [None]:
from transformers import Trainer, TrainingArguments, LlamaForConditionalGeneration

model = LlamaForConditionalGeneration.from_pretrained('llama')

trainingArgs = TrainingArguments(
    output_dir='./fine-tuned-model',
    num_train_epochs=3,
    per_device_eval_batch_size=8,
    save_steps=100
)

trainer = Trainer(
    model=model,
    args=trainingArgs,
    train_dataset=tokenizedDataset
)

In [None]:
trainer.train()

In [None]:
eval_results = trainer.evaluate()

In [None]:
# model = 'bert-base-uncased'

# tokenizer = AutoTokenizer.from_pretrained(model)

# encodedInput = tokenizer(sentences, padding='max_length', truncation=True)

# inputsIds = encodedInput['input_ids']
# attentionMask = encodedInput['attention_mask']

Now we have to create a dataset, so pytorch can process it.

### **Training**

### **Results**