## Dependencies installation

We install the main dependencies that will be used alongside this notebook.

In [6]:
!pip install -q -r ../requirements.txt

## Load the Google API

In [4]:
import os, getpass
from dotenv import load_dotenv

# cargamos las variables/claves desde el .env
dotenv_loaded = load_dotenv()

if not os.environ.get("GOOGLE_API_KEY"):
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Google API Key was not set properly, please share it here: ")
    
# comprobamos que se han cargado correctamente
if os.environ["GOOGLE_API_KEY"]=="":
    print("'GOOGLE_API_KEY' wasn't set correctly. Please make sure the keys/variables are accesible")

Let's make a little trial to ensure the API key is valid.

In [5]:
from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Dime quien es el máximo goleador vasco de LaLiga en la temporada 2020-2021",
)

print(response.text)

El máximo goleador vasco de LaLiga en la temporada 2020-2021 fue **Mikel Oyarzabal**, de la Real Sociedad, con **11 goles**.


## Data Preprocessing

We will load the cities data so we can work on it and later divide it in chunks so it's more manageable. First we will code a function to help us divide each of the documents in pages before applying the chunks.

In [18]:
from PyPDF2 import PdfReader


def document_reader(path, doc_name):
    
    # Abrimos el archivo para leerlo de forma binaria
    doc_path = path + doc_name
    pdf_reader = PdfReader(doc_path)
    
    global_text = []
    for i, page in enumerate(pdf_reader.pages, start=1):
        text = page.extract_text()
        # global_text[pdf_reader.pages[i].extract_text()] = {f"Page {i+1}": f"{doc_name}"}
        
        global_text.append({
            "page_content": text.strip(), # limpia espacios sobrentes y los saltos de linea
            "doc_ubication": { "document": doc_name, "page": i }
        })
        
    return global_text
    
    

Now, by using function above we will create a list mixing everything in an only list so we have all the chunks/pages together and can operate easily.

In [None]:
def load_pdfs(data_dir="../data/"):
    documents = []
    for file in os.listdir("../data/"): # recorremos la lista de archivos en el directorio y aplicamos document_reader a cada uno de ellos
        docs = document_reader(data_dir, file)
        documents.extend(docs)
    return documents

documents = load_pdfs("../data/")
print(len(documents), documents[0]["doc_ubication"])
print(documents[-1]["doc_ubication"])


137 {'document': 'BARCELONA.pdf', 'page': 1}
{'document': 'VALENCIA.pdf', 'page': 16}


Now we will split up everything on chunks. Each document will be having around of 40.000 characters what is a extremely large quantity if we take the whole sum of characters for every document on the data folder. Furthermore, it is not very convenient for adding them to the context window of some models, it may be difficult for these models to find the information in excessively long inputs (not to mention the increased cost of each request to the model...). 

That's why we will use `RecursiveCharacterTextSplitter` to divide the format following a recursive strategy in the chunk_size we decide.

In [23]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    add_start_index=True
)

docs = [Document(page_content=d["page_content"], metadata=d["doc_ubication"]) for d in documents]

all_splits = text_splitter.split_documents(docs)

print(f"Total splits: {len(all_splits)}")

# Mostramos el primer split.
print(f"First split content:\n{all_splits[0]}\n")

Total splits: 223
First split content:
page_content='www.spain.infoBarcelona' metadata={'document': 'BARCELONA.pdf', 'page': 1, 'start_index': 0}

