# Setup

In [1]:
!pip install gradio PyMuPDF sentence-transformers transformers

Collecting gradio
  Downloading gradio-5.23.0-py3-none-any.whl.metadata (16 kB)
Collecting PyMuPDF
  Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloadi

In [2]:
import fitz  # PyMuPDF for PDF processing
from sentence_transformers import SentenceTransformer, util

# Load a model for creating embeddings
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def pdf_to_text(pdf_file):
    doc = fitz.open(pdf_file.name)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

def create_chunks(text, chunk_size=300):
    sentences = text.split('.')
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) < chunk_size:
            current_chunk += sentence + "."
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + "."

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks


def retrieve_from_chunks(query, chunks):
    embeddings = model.encode(chunks, convert_to_tensor=True)
    query_embedding = model.encode(query, convert_to_tensor=True)
    top_results = util.semantic_search(query_embedding, embeddings, top_k=3)
    return top_results[0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# 1. Retrieval

1. Buscar el fichero https://www.bsm.upf.edu/documents/BSM_Admission_Pack_es.pdf

2. Descarga el fichero y subelo al Colab

3. Usa las funciones definidas en el apartado anterior para cortar el pdf en chunks. Explora el contenido de los chunks. Cuántos chunks tenemos?

4. Usa la query = "Bancos que dan financiación para estudiantes". Cuáles son los chunks relevantes?

5. Lee el documento y prueba con otras queries

In [4]:
path_to_pdf = "/content/BSM_Admission_Pack_es.pdf"
pdf_file = fitz.open(path_to_pdf)

In [6]:
# Call the pdf_to_text function and assign the output to text
text = pdf_to_text(pdf_file)

# CHUNKS:
chunk_size = 300
chunks = create_chunks(text=text, chunk_size=chunk_size)
print(f"Número de chunks creados: {len(chunks)}")

Número de chunks creados: 69


In [7]:
for i, chunk in enumerate(chunks[:5]):
    print(f"Chunk {i+1}: {chunk}\n")

Chunk 1: Admission Pack
The Science of Business
Barcelona
School Of 
Management
{01} 
Ayudas económicas y matriculación
1.1. Cómo matricularte
1.2. Becas y ayudas
1.2.1. Becas Talento
1.2.2. Ayudas específicas por programa 
1.2.3. Financiación externa
1.3. Descuentos especiales
1.3.1.

Chunk 2: Asociaciones de antiguos alumnos
1.3.2. Otros descuentos
1.4. Entidades financieras colaboradoras
{02}
Requisitos administrativos
2.1. Presentación de documentación original
2.2. Envío de cartas originales
{03}
Servicios al estudiante
3.1. Antes del inicio del curso
3.1.1. Servicio de Bienvenida
3.

Chunk 3: 1.1.1. Seguro médico
3.1.1.2. Trámites legales
3.1.2. Opciones de alojamiento
3.1.3. Nuestros campus
3.2. Durante el curso académico
3.2.1. Coordinación del programa
3.2.2. Aprendizaje de idiomas
3.2.3. Servicio de Carreras Profesionales
3.2.3.1. Prácticas profesionales
3.2.3.2.

Chunk 4: Programa de desarrollo 
profesional
{04}
Condiciones generales 2013-2014
El Admission Pack es un documen

In [8]:
query = "Bancos que dan financiación para estudiantes"
relevant_chunks = retrieve_from_chunks(query, chunks)

# relevant chunks
print("Chunks relevantes:")
for result in relevant_chunks:
    index = result['corpus_id']
    score = result['score']
    print(f"Chunk {index + 1} (Score: {score}): {chunks[index]}\n")

Chunks relevantes:
Chunk 30 (Score: 0.6519840359687805): > Banco Bilbao Vizcaya Argentaria (BBVA)
    Condiciones generales
> CatalunyaCaixa 
Condiciones generales 
> SabadellAtlántico
Condiciones generales 
> La Caixa
Condiciones generales 
Puedes consultar el apartado “Entidades donde 
solicitar becas y financiación” en la pestaña 
“Servicios” de nuestra web para encontrar 
información adicional sobre las opciones de 
financiación.

Chunk 28 (Score: 0.6266148686408997): Entidades financieras colaboradoras
La Barcelona School of Management colabora 
con varios bancos y entidades financieras que 
conceden a los estudiantes condiciones de préstamo 
favorables.

Chunk 9 (Score: 0.5332677364349365): docs@bsm.upf.edu. Por 
favor, indica tu nombre y apellidos y el código del 
programa en el asunto del correo electrónico.
• Tarjeta de crédito/débito
El pago también se puede efectuar con tarjeta de 
crédito o débito.



Chunk de: 3.1.2. Aprendizaje de idiomas

In [None]:
query = " Los estudiantes que deseen realizar cursos"
relevant_chunks = retrieve_from_chunks(query, chunks)

# relevant chunks
print("Chunks relevantes:")
for result in relevant_chunks:
    index = result['corpus_id']
    score = result['score']
    print(f"Chunk {index + 1} (Score: {score}): {chunks[index]}\n")

Chunk de: “Vivir en Barcelona”

In [9]:
query = '“Vivir en Barcelona”'
relevant_chunks = retrieve_from_chunks(query, chunks)

# relevant chunks
print("Chunks relevantes:")
for result in relevant_chunks:
    index = result['corpus_id']
    score = result['score']
    print(f"Chunk {index + 1} (Score: {score}): {chunks[index]}\n")

Chunks relevantes:
Chunk 47 (Score: 0.5485602021217346): Para resolver 
cualquier pregunta sobre la estancia en Barcelona, 
se puede consultar el apartado “Vivir en Barcelona” 
de la pestaña “Servicios” de nuestra web o 
contactar con el Servicio de Bienvenida a través de 
la siguiente dirección: abroad@bsm.upf.edu
3.1.1.1.

Chunk 34 (Score: 0.4710226058959961): Ordenación Académica
C/ Balmes 132-134
08008 Barcelona - Spain
Es importante que recuerdes que si la titulación 
universitaria y el expediente académico están 
escritos en un idioma que no sea español, catalán, 
inglés, francés, italiano o portugués, deberán ir 
acompañados por una traducción jurada oficial 
al español o al catalán.

Chunk 54 (Score: 0.4321710467338562): Para obtener un 
visado, el estudiante deberá cumplir todos los 
requisitos de inmigración en España, que pueden 
cambiar de año en año.



# 2. Demo

Crea una demo que tenga como input un fichero de PDF (gr.File()) y una query y nos devuelva los chunks relevantes con su score.

In [11]:
import fitz  # PyMuPDF for PDF processing
from sentence_transformers import SentenceTransformer, util
import gradio as gr

# Load a model for creating embeddings
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

def pdf_to_text(pdf_file):
    doc = fitz.open(pdf_file.name)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

def create_chunks(text, chunk_size=300):
    sentences = text.split('.')
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) < chunk_size:
            current_chunk += sentence + "."
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + "."

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def retrieve_from_chunks(query, chunks):
    embeddings = model.encode(chunks, convert_to_tensor=True)
    query_embedding = model.encode(query, convert_to_tensor=True)
    top_results = util.semantic_search(query_embedding, embeddings, top_k=3)
    return top_results[0]

def generate_chunks_pdf(pdf_file, query):
    text = pdf_to_text(pdf_file)
    chunks = create_chunks(text)
    relevant_chunks = retrieve_from_chunks(query, chunks)

    # Formatear la salida para mostrar los chunks relevantes y sus scores
    output = ""
    for result in relevant_chunks:
        index = result['corpus_id']
        score = result['score']
        output += f"Chunk {index + 1} (Score: {score:.4f}): {chunks[index]}\n\n"

    return output if output else "No se encontraron chunks relevantes."

# Gradio interface
demo = gr.Interface(
    title="Return chunks from PDF file",
    description="Crea una demo que tenga como input un fichero de PDF (gr.File()) y una query y nos devuelva los chunks relevantes con su score.",
    fn=generate_chunks_pdf,  # Function to wrap a user interface (UI) around
    inputs=[
        gr.File(label="PDF", type="filepath"),
        gr.Textbox(label="Query")
    ],
    outputs="text"
)

demo.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://64dbc5063e2e4d9699.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [12]:
import gradio as gr

# The main function for retrieval
def retrieve(pdf_file, query):
    text = pdf_to_text(pdf_file)
    chunks = create_chunks(text)
    top_results = retrieve_from_chunks(query, chunks)

    formatted_results = []
    for i, result in enumerate(top_results):
        score = result['score']
        chunk = chunks[result['corpus_id']]

        formatted_results.append(f"🎯 **Result {i+1}:** (Score: {score:.4f})\n---\n{chunk}\n---\n")

    formatted_result = f"📊 Total chunks searched: {len(chunks)}\n\n-----------------------------------------\n\n" + "\n\n-----------------------------------------\n\n".join(formatted_results)
    return formatted_result

# Define Gradio UI
demo = gr.Interface(
    fn=retrieve,
    inputs=[gr.File(label="Upload your PDF file"), gr.Textbox(label="Enter your query")],
    outputs=gr.Textbox(label="Top 3 Relevant Results"),
    title="📚 Simple RAG Demo",
    description="Upload a PDF and ask questions. The app retrieves the top 3 relevant sections using RAG (Retrieval-Augmented Generation)."
)

demo.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://b7c3fef74e27f2e463.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


