# **Proyecto #2 - Soccer Chatbot**

## **Fine Tunning**

In this section, we are going to fine tune the Llama3 model with some pdfs from our knowledge database to improve its responses in terms of quality of the response and accuracy.

### **Data Preparation**

First, we have to load the pdf in memory to work with it.

In [None]:
import os
from langchain_community.document_loaders import PyMuPDFLoader

pdf_path = os.path.abspath('../docs/knowledge-database/documents/The ball is round.pdf')

loader = PyMuPDFLoader(pdf_path)
data = loader.load()

Once we have the pdf in memory, we can manipulate its contents to use them in a correct way.

We have to select the relevant pages. In this document the pages that contain relevant information are the pages from page 23 to page 987.

In [None]:
data = data[22:987]
data

Now we have clean the data to remove irrelevant characters.

In [None]:
import re
import roman

# Function to check if a string is a roman number
def is_roman_numeral(s):
    try:
        roman.fromRoman(s)
        return True
    except roman.InvalidRomanNumeralError:
        return False

# Extract page contents
pages = []
for page in data:
    pages  += [page.page_content]

# Split data by newlines
lines = []
for page in pages:
    lines += page.split('\n')

# Clean the lines
cleaned_lines = []
for line in lines:
    temp = line.strip()
    
    if (temp.isdigit()):
        continue
    elif (temp == ''):
        continue
    elif (is_roman_numeral(temp)):
        continue

    temp = re.sub(r"’\d", "’", temp)
    temp = re.sub(r"\.\d", ".", temp)
    
    cleaned_lines += [temp]

cleaned_lines

# Merge all the lines into a single string
cleaned_text = ' '.join(cleaned_lines)
# print(cleaned_text)

Once the text is cleaned, we can tokenize it. In this case we are going to tokenize it into sentences.

In [None]:
import nltk
nltk.download('punkt')

sentences = nltk.sent_tokenize(cleaned_text)

sentences

We have to tokenize each sentence, so we can use them to fine tune the model.

In [None]:
# from transformers import AutoTokenizer

# model = "meta-llama/Meta-Llama-3-8B"

# tokenizer = AutoTokenizer.from_pretrained(model)

# tokenized_sentences = tokenizer(sentences, padding='max_length', truncation=True)

# tokenized_sentences

Now we have to label the sentences, in this case we are going to label the sentence with the next sentence.

In [None]:


sentence_pairs = []
for i in range(len(sentences) - 1):
    sentence_pairs.append({
        'sentence': sentences[i],
        'label': sentences[i + 1]
    })

sentence_pairs

In [None]:
import pandas as pd

df = pd.DataFrame(sentence_pairs)

df

### **Datasets**

### **Training**

### **Results**