## Dependencies installation

We install the main dependencies that will be used alongside this notebook.

In [10]:
%pip install -q -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Load the Google API

In [2]:
import os, getpass
from dotenv import load_dotenv

# cargamos las variables/claves desde el .env
dotenv_loaded = load_dotenv()

if not os.environ.get("GOOGLE_API_KEY"):
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Google API Key was not set properly, please share it here: ")
    
# comprobamos que se han cargado correctamente
if os.environ["GOOGLE_API_KEY"]=="":
    print("'GOOGLE_API_KEY' wasn't set correctly. Please make sure the keys/variables are accesible")

Let's make a little trial to ensure the API key is valid.

In [3]:
from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Dime quien es el máximo goleador vasco de LaLiga en la temporada 2020-2021",
)

print(response.text)

El máximo goleador vasco de LaLiga en la temporada 2020-2021 fue **Mikel Oyarzabal**, de la Real Sociedad, con **11 goles**.


## Data Preprocessing

We will load the cities data so we can work on it and later divide it in chunks so it's more manageable. First we will code a function to help us divide each of the documents in pages before applying the chunks.

In [4]:
from PyPDF2 import PdfReader


def document_reader(path, doc_name):
    
    # Abrimos el archivo para leerlo de forma binaria
    doc_path = path + doc_name
    pdf_reader = PdfReader(doc_path)
    
    global_text = []
    for i, page in enumerate(pdf_reader.pages, start=1):
        text = page.extract_text()
        # global_text[pdf_reader.pages[i].extract_text()] = {f"Page {i+1}": f"{doc_name}"}
        
        global_text.append({
            "page_content": text.strip(), # limpia espacios sobrentes y los saltos de linea
            "doc_ubication": { "document": doc_name, "page": i }
        })
        
    return global_text
    
    

Now, by using function above we will create a list mixing everything in an only list so we have all the chunks/pages together and can operate easily.

In [5]:
def load_pdfs(data_dir="../data/"):
    documents = []
    for file in os.listdir("../data/"): # recorremos la lista de archivos en el directorio y aplicamos document_reader a cada uno de ellos
        docs = document_reader(data_dir, file)
        documents.extend(docs)
    return documents

documents = load_pdfs("../data/")
print(len(documents), documents[0]["doc_ubication"])
print(documents[-1]["doc_ubication"])


137 {'document': 'BARCELONA.pdf', 'page': 1}
{'document': 'VALENCIA.pdf', 'page': 16}


Now we will split up everything on chunks. Each document will be having around of 40.000 characters what is a extremely large quantity if we take the whole sum of characters for every document on the data folder. Furthermore, it is not very convenient for adding them to the context window of some models, it may be difficult for these models to find the information in excessively long inputs (not to mention the increased cost of each request to the model...). 

That's why we will use `RecursiveCharacterTextSplitter` to divide the format following a recursive strategy in the chunk_size we decide.

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    # separators=["\n\n", "\n", ". ", " ", ""],
    add_start_index=True
)

docs = [Document(page_content=d["page_content"], metadata=d["doc_ubication"]) for d in documents]

docs_splitted = text_splitter.split_documents(docs)

print(f"Total splits: {len(docs_splitted)}")

# Mostramos el primer split.
print(f"First split content:\n{docs_splitted[0]}\n")

Total splits: 223
First split content:
page_content='www.spain.infoBarcelona' metadata={'document': 'BARCELONA.pdf', 'page': 1, 'start_index': 0}



In [7]:
print(f"Second split content:\n{docs_splitted[1].page_content}\n")

Second split content:
2
 Introducción 3
Vive Barcelona: principales zonas  4
  El Born 
  Gràcia y L´Eixample 
  Barrio Gótico 
  El Raval 
  Montjuïc 
  Plaça de Espanya 
  La Rambla 
  Basílica de la Sagrada Familia 
  Les Corts y Pedralbes 
Cultura  8
  Museos 
  Centros de exposiciones 
Saborea Barcelona  10
Barcelona en cada estación 12
  Verano 
  Otoño 
  Invierno 
  Primavera 
Playas 14
Cinco planes para disfrutar  15 
en familia
  PortAventura World 
  Parque de Atracciones Tibidabo 
  L ’Aquàrium 
  Zoo de Barcelona 
  Museu de la Música  
  Las Golondrinas La ciudad escondida 16
  Parques y jardines  
  Museos secretos 
  Monumentos  
  Los tejados de Barcelona 
Vivir la noche en Barcelona  19
Rutas y paseos por la ciudad 20
  Ruta romana 
  Ruta medieval 
  Ruta modernista 
  Ruta Gaudí 
  Ruta Miró 
  Ruta Picasso 
¿Qué visitar cerca de Barcelona? 23
  Ciudades y lugares de interés 
  Naturaleza 
¿Cómo llegar? 25
  AVE 
  Aeropuerto 
  Coche 
  Moverte por Barcelona



Now that we have all the chunks, it's time to apply the embedding to them and to save them inside a vector store. This is useful for searching similarities between vectors and to determine wich of them are the most relevant to the question asked. We will use a vector store defined in memory and to generate the embeddings we will be using google embedding models.

In [8]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

In [9]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)
document_ids = vector_store.add_documents(documents=docs_splitted)

print(document_ids[:3])

['370b4d1e-559b-4524-8e50-4d237f7534bd', 'ca83d60f-10c9-4f52-8843-81426b09d40c', '733086af-70cc-43ed-b250-bebb3d620466']


## Retrieval and Generation

With the embeddings already obtained and saved in the vector store now we will develop the model that will bring the more relevant parts of the text and give it to the LLM in order to give a supported answer.

In [44]:
from langchain_core.prompts import PromptTemplate

rag_prompt = PromptTemplate.from_template(
    "You are a precise tourist guide. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Always cite sources as as provided in the context. \n" +
    "Question: {question}\n" +
    "Context: {context}\n" +
    "Answer: "
)

In [None]:
from langchain.chat_models import init_chat_model

tourist_guide_llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")

In [51]:
question = "que puedo ver en Bilbao?"

retrieved_docs = vector_store.similarity_search(question, k=4)
# docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

print(f"Retrieved context:\n{retrieved_docs}\n")

Retrieved context:
[Document(id='3d04e712-d8d1-4920-903b-597fb6946210', metadata={'document': 'BILBAO.pdf', 'page': 1, 'start_index': 0}, page_content='www.spain.infoBilbao'), Document(id='e7a4c4d3-6341-46fe-b092-6b5f8033af90', metadata={'document': 'BARCELONA.pdf', 'page': 1, 'start_index': 0}, page_content='www.spain.infoBarcelona'), Document(id='e5e120a3-7a19-44a2-be09-ffe4ad332ae3', metadata={'document': 'TENERIFE.pdf', 'page': 2, 'start_index': 0}, page_content='o Auditorio  de Tenerife  [vídeo  - ubicación ] \n \n \no Plaza  de España [ vídeo  - ubicación ]'), Document(id='f23cd4fc-bbc5-4af3-aa4b-ad3c1ca107d5', metadata={'document': 'BILBAO.pdf', 'page': 8, 'start_index': 814}, page_content='quiat... T ambién hay espacio para escul -\nturas de artistas vascos como Eduardo \nChillida y Jorge Oteiza. \n L www.guggenheim-bilbao.eus\nSi buscas una pinacoteca más clásica, \nla colección del Museo de Bellas Artes \nes una de las más importantes de toda \nEspaña. Viaja en el tiempo a tr

In [49]:
def format_docs_with_citations(docs):
    parts = []
    for d in docs:
        src = d.metadata.get("source", "unknown").split("/")[-1]
        page = d.metadata.get("page", "NA")
        parts.append(f"{d.page_content}\n[Source: {src} Page: {page}]")
    return "\n\n---\n\n".join(parts)


In [53]:
context = format_docs_with_citations(retrieved_docs)
input = rag_prompt.invoke({"question": question, "context": context})
print(f"Input for the LLM:\n{input.text}\n")

Input for the LLM:
You are a precise tourist guide. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Always cite sources as as provided in the context. 
Question: que puedo ver en Bilbao?
Context: www.spain.infoBilbao
[Source: unknown Page: 1]

---

www.spain.infoBarcelona
[Source: unknown Page: 1]

---

o Auditorio  de Tenerife  [vídeo  - ubicación ] 
 
 
o Plaza  de España [ vídeo  - ubicación ]
[Source: unknown Page: 2]

---

quiat... T ambién hay espacio para escul -
turas de artistas vascos como Eduardo 
Chillida y Jorge Oteiza. 
 L www.guggenheim-bilbao.eus
Si buscas una pinacoteca más clásica, 
la colección del Museo de Bellas Artes 
es una de las más importantes de toda 
España. Viaja en el tiempo a través de 
piezas que cubren diversas manifesta -
ciones artísticas, desde el siglo XIII hasta 
nuestros días. Más de 10 000 objetos, 
entre los 

In [54]:
answer = tourist_guide_llm.invoke(input)
print(f"LLM answer:\n{answer.content}\n")

LLM answer:
In Bilbao, you can visit the Guggenheim Museum, which features contemporary art, and the Museo de Bellas Artes, which houses a collection of classical art (www.guggenheim-bilbao.eus, www.bilbaomuseoa.eus). The Museo de Bellas Artes contains over 10,000 objects from the 13th century to the present day [Source: unknown Page: 8].



## App Development

Now after trying each step separatedly we can build the chat by combining the retrieval part with the LLM use.

In [56]:
def ask_tguide(question: str, prompt, vector_store):
    
    retrieved_docs = vector_store.similarity_search(question, k=4)
    context = format_docs_with_citations(retrieved_docs)
    input = prompt.invoke({"question": question, "context": context})
    return tourist_guide_llm.invoke(input).content


print(ask_tguide("¿Me propones un itinerario de 1 día por Tenerife norte?", rag_prompt, vector_store))

For a day trip in northern Tenerife, I suggest visiting Puerto de la Cruz, where you can walk from the pier to Playa Martiánez, passing Plaza de Europa and Lago de Martiánez (Source: unknown Page: 11). You can also explore the Auditorio de Tenerife and Plaza de España (Source: unknown Page: 2). Remember to be cautious of the strong currents if you visit Playa del Ancón (Source: unknown Page: 11).


In [None]:
from langchain_core.documents import Document
from typing_extensions import List, TypedDict
from typing import Literal
from typing_extensions import Annotated
from langgraph.graph import START, StateGraph
from IPython.display import Image, display
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever

class Search(TypedDict):
    """Search query."""
    query: Annotated[str, ..., "Search query to run."]
    
class State(TypedDict):
    question: str
    query: Search
    context: List[Document]
    answer: str

def analyze_query(state: State):
    structured_llm = llm.with_structured_output(Search)
    query = structured_llm.invoke(state["question"])
    return {"query": query}

def retrieve(state: State):
    query = state["query"]["query"]
    
    # usamos similarity search + rearnking con cross encoder porque está demostrado que para mayor precisión usar este es optimo
    base_retriever = vector_store.as_retriever(search_kwargs={"k": 30})
    cross_encoder = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
    reranker = CrossEncoderReranker(model=cross_encoder, top_n=4)
    retriever = ContextualCompressionRetriever(base_retriever=base_retriever, base_compressor=reranker)
    retrieved_docs = retriever.get_relevant_documents(query)
    # retrieved_docs = vector_store.similarity_search(query["query"], k=4)
    
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

graph_builder = StateGraph(State).add_sequence([analyze_query, retrieve, generate])
graph_builder.add_edge(START, "analyze_query")
graph = graph_builder.compile()