# Búsqueda Semántica y Preguntas y Respuestas (QA)

En este notebook, presentaremos la búsqueda semántica y los sistemas de preguntas y respuestas (question-answering) utilizando [`sentence-transformers`](https://sbert.net/), una biblioteca de Python para embeddings de frases que esta por encima de [Hugging Face](https://huggingface.co/), texto e imágenes de última generación. Estos embeddings son útiles para tareas de similitud semántica, tales como la recuperación de información y los sistemas de respuesta a preguntas.

*Empezamos a aplicar los llamados transformers*

In [1]:
# Instalación de la libreria sentence-transformers
# !pip install -U sentence-transformers

In [2]:
import json  # Permite manipular archivos JSON
from sentence_transformers import SentenceTransformer, CrossEncoder, util
# SentenceTransformer: Para convertir frases en vectores numéricos (embeddings).
# CrossEncoder: Para re-clasificar resultados comparando pares de frases de forma profunda.
# util: Funciones utilitarias para búsqueda semántica y cálculo de similitudes.
from sklearn.metrics.pairwise import cosine_similarity # Se usa para calcular qué tan parecidos son dos vectores (mide el ángulo entre ellos).
import numpy as np
import pandas as pd
import time # Permite medir el tiempo de ejecución del código
import gzip # Sirve para comprimir o descomprimir archivos sobre la marcha
import os # Permite interactuar con el sistema operativo



Aquí tienes la traducción al español:

Utilizaremos un modelo de Sentence Transformer pre-entrenado para generar embeddings de frases. Hay muchos modelos pre-entrenados disponibles [aquí](https://www.sbert.net/docs/pretrained_models.html).

In [3]:
model_name = 'all-MiniLM-L6-v2' # Define un string con el nombre específico del modelo preentrenado que queremos descargar de la biblioteca de Hugging Face.
model = SentenceTransformer(model_name) # Cargamos el modelo definido de Hugging Face

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Para nuestra tarea de búsqueda semántica y de preguntas y respuestas, necesitamos una lista de documentos o párrafos en los que buscar información relevante.

In [4]:

paragraphs = [
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
    "The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.",
    "The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials, generally built along an east-to-west line across the historical northern borders of China.",
    "The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.",
    "The Taj Mahal is an ivory-white marble mausoleum on the southern bank of the river Yamuna in the Indian city of Agra."
] # Definimos varios parrafos

paragraphs = np.array(paragraphs) # Los convertimos en un array de la libreria Numpy

In [5]:
corpus_embeddings = model.encode(paragraphs) # Aplicamos el modelo a los parrafos para que los convierta en vectores
print(corpus_embeddings.shape) # Imprimimos la forma de dichos vectores

(5, 384)


Ahora, definamos una función para realizar la búsqueda semántica, dada una consulta y una lista de embeddings de párrafos.

In [6]:
# Realizamos una función para definir la busqueda semantica, el modelo nos devolvera la frase mas cercana a la query
def semantic_search(query, model, corpus_embeddings, paragraphs, top_k=2):
    query_embedding = model.encode([query])[0] # Convierte la query en un vector
    similarities = cosine_similarity([query_embedding], corpus_embeddings)[0] # busca la frase similar dentro de los vectores creados anteriormente
    indexes = np.argpartition(similarities, -top_k)[-top_k:]
    indexes = indexes[np.argsort(-similarities[indexes])]
    print(f"Input query: {query}") # Printea la query
    print()
    for text, sim in zip(list(paragraphs[indexes]), similarities[indexes].tolist()):
        print(f"{sim:.3f}\t{text}") # Printea la respuesta


semantic_search('Where is the Colosseum', model, corpus_embeddings, paragraphs, top_k=2)

Input query: Where is the Colosseum

0.801	The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.
0.226	The Taj Mahal is an ivory-white marble mausoleum on the southern bank of the river Yamuna in the Indian city of Agra.


In [7]:
# Nos dará la similaridad de la query en los paragraphs previamente definidos

## Modelos multilingües

En [Hugging Face](https://huggingface.co/) podemos encontrar más tipo de modelos de Tranfer Learning. Podemos seleccionar el que más nos convenga.  

In [8]:
# Este esta usando el previamente definido, el 'all-MiniLM-L6-v2'
semantic_search('¿Dónde está el Coliseo?', model, corpus_embeddings, paragraphs, top_k=2)

Input query: ¿Dónde está el Coliseo?

0.086	The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.
0.067	The Taj Mahal is an ivory-white marble mausoleum on the southern bank of the river Yamuna in the Indian city of Agra.


Tenemos modelos multilingües disponibles [aquí](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models).

Tambien podemos encontrar otros modelos en [HuggingFace](https://huggingface.co/)

In [9]:
# Vayamos a utilizar otro modelo desde la librería HuggingFace
model_name = 'BAAI/bge-m3' # Nombramos el modelo escogido
multi_model = SentenceTransformer(model_name) # Descargamos el modelo escogido

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [10]:
# multi_model es un objeto de un modelo pre-entrenado (ej. SentenceTransformer)
# El método .encode() transforma la lista de textos 'paragraphs' en vectores numéricos
multi_corpus_embeddings = multi_model.encode(paragraphs)
# .shape es un atributo de las matrices de NumPy que indica las dimensiones
# Se imprime para verificar cuántos párrafos se procesaron y el tamaño del vector de cada uno
print(multi_corpus_embeddings.shape)

(5, 1024)


In [None]:
# 5 vectores de 1024 dimensiones

In [12]:
semantic_search(
    '¿Dónde está el Coliseo?', # 1. La consulta (query) en lenguaje natural.
    multi_model, # 2. El modelo que convierte la pregunta en un vector.
    multi_corpus_embeddings, # 3. La base de datos de vectores que generamos antes.
    paragraphs, # 4. El texto original para poder mostrar la respuesta.
    top_k=2 # 5. Indica que solo devuelva los 2 mejores resultados.
    )

Input query: ¿Dónde está el Coliseo?

0.559	The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.
0.424	The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.


### Búsqueda semántica en Wikipedia

Como conjunto de datos, utilizamos la Wikipedia en inglés sencillo (Simple English Wikipedia). En comparación con la versión completa en inglés, esta cuenta con solo unos 170,000 artículos. Hemos dividido estos artículos en párrafos.

In [13]:
wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz' # Definimos la ruta local donde se guardará el archivo descargado

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath) # Comprobamos si el archivo ya existe en nuestra carpeta para no descargarlo dos veces

passages = [] # Inicializamos una lista vacía para guardar todos los fragmentos de texto

with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn: # Abrimos el archivo .gz (comprimido) en modo lectura de texto ('rt') con codificación UTF-8
    for line in fIn: # Creamos un bucle para el archivo abierto
        data = json.loads(line.strip()) # Convertimos la línea de texto en diccionario de Python
        for paragraph in data['paragraphs']: # Para cada unidad en los datos parapgraphs
            #
            passages.append(data['title']+':  '+ paragraph) # Combinamos titulo con el paragrafo

# Mostramos cada elemento
print("Passages:", len(passages))
print(passages[0])
print(passages[1])

  0%|          | 0.00/50.2M [00:00<?, ?B/s]

Passages: 509663
Ted Cassidy:  Ted Cassidy (July 31, 1932 - January 16, 1979) was an American actor. He was best known for his roles as Lurch and Thing on "The Addams Family".
Aileen Wuornos:  Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956 – October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.


In [14]:
reduced_passages = np.array(passages[:5000]) # Reducimos los datos extraidos, pues tene,os 509.663 y los reducimos a 5000
reduced_passages.shape

(5000,)

In [15]:
corpus_embeddings = model.encode(reduced_passages, show_progress_bar=True) # Aplicamos el modelo

Batches:   0%|          | 0/157 [00:00<?, ?it/s]

In [16]:
semantic_search('Best american actor', model, corpus_embeddings, reduced_passages, top_k=2) # Buscamos la frase mas cercana

Input query: Best american actor

0.539	Aaron Kwok:  Aaron won the Best Actor Award again at the forty-third Golden Horse Awards on 24 November 2006 for his role in the movie "After This Our Exile". He became the second actor in the history of the Golden Horse Awards to win the Best Actor Award year after year. Jackie Chan first achieved this back in the 1990s.
0.425	James L. Brooks:  He is best known for creating American television programs such as "The Mary Tyler Moore Show", "The Simpsons", "Rhoda" and "Taxi". His best-known movie is "Terms of Endearment", for which he received three Academy Awards in 1984.


In [17]:
semantic_search('Number countries Europe', model, corpus_embeddings, reduced_passages, top_k=2)# Buscamos la frase mas cercana

Input query: Number countries Europe

0.502	European Union member state:  A European Union member state is any one of the twenty-seven countries that have joined the European Union (EU) since it was found in 1958 as the European Economic Community (EEC). From an original membership of six states, there have been five successive enlargements. The largest happened on 1 May 2004, when ten member states joined.
0.465	European Space Agency:  The member countries of ESA are Austria, Belgium, Czech Republic, Denmark, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom.


### Pregunta 1: Carga un modelo de Sentence Transformer pre-entrenado diferente y compara su rendimiento con el del modelo anterior utilizando el mismo conjunto de párrafos y consultas. ¿Qué modelo funciona mejor?

In [21]:
# Cargamos un modelo que nos guste de sentence-transformer, para ello vamos a hugging face a echarle un ojo
model_name = 'sentence-transformers/all-mpnet-base-v2'
new_model = SentenceTransformer(model_name)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [22]:
corpus_embeddings = new_model.encode(paragraphs,show_progress_bar=True)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [23]:
semantic_search('¿Dónde está el Coliseo?', new_model, corpus_embeddings, paragraphs, top_k=2 )

Input query: ¿Dónde está el Coliseo?

0.095	The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.
0.059	The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.


### Pregunta 2: Encontrar duplicados de texto Intenta encontrar textos duplicados o casi duplicados en un corpus determinado basándote en su similitud semántica utilizando sentence-transformers.

In [27]:
corpus = [
    "e un giorno soleggiato",
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox leaps over the lazy dog.",
    "The sky is blue, and the grass is green.",
    "The grass is green, and the sky is blue.",
    "It's a sunny day today.",
    "The weather is sunny today.",
    "She was wearing a beautiful red dress.",
    "She had on a gorgeous red dress.",
    "I'm going to the supermarket to buy some groceries.",
    "I'm heading to the supermarket to purchase some groceries.",
    "He didn't like the movie because it was too long.",
    "He disliked the movie as it was too lengthy.",
    "The train was delayed due to technical issues.",
    "Technical issues caused the train to be delayed.",
    "I'll have a cup of coffee with milk and sugar, please.",
    "Can I get a coffee with milk and sugar, please?",
    "The conference was very informative and interesting.",
    "The conference turned out to be interesting and informative.",
    "He enjoys listening to classical music in his free time.",
    "In his leisure time, he likes to listen to classical music.",
    "Please make sure you turn off the lights before leaving.",
    "Before leaving, ensure that you switch off the lights.",
    "The boy was delighted with the gift he received.",
    "Receiving the present made the young lad ecstatic.",
    "She has a preference for Italian cuisine.",
    "Her favorite type of food is from Italy.",
    "The software engineer resolved the issue by modifying the code.",
    "By altering the programming, the tech expert fixed the problem.",
    "Due to the inclement weather, the baseball game was postponed.",
    "The baseball match was rescheduled because of bad weather conditions.",
    "The house was engulfed in a raging fire.",
    "Flames rapidly consumed the residence.",
    "He is constantly browsing the internet for the latest news.",
    "He frequently scours the web to stay updated on current events.",
    "The puppy was playing with a toy in the garden.",
    "In the yard, the young dog was frolicking with its plaything.",
    "The artist painted a beautiful landscape on the canvas.",
] # Definimos una nueva base de datos

In [28]:
# Selecionamos un nuevo modelo
model = 'BAAI/bge-m3'
model = SentenceTransformer(model)

In [29]:
# Obgtenemos los vectores del corpus
embeddings = model.encode(corpus,show_progress_bar=True)

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

In [30]:
similarity_threshold = 0.8  # Definimos el indice de similaridad

duplicates = [] # Creamos una lista vacia llamada duplicates

for i, emb1 in enumerate(embeddings): # Aplica un bulce en los embeddings enumerados al primer embedding
    for j, emb2 in enumerate(embeddings[i + 1:]): # # Aplica un bulce en los embeddings enumerados al embedding +1
        similarity = cosine_similarity([emb1], [emb2])[0][0] # Apica la similaridad
        if similarity > similarity_threshold: # si la similaridad es mayor al umbral, los guarda en la duplicates
            duplicates.append((corpus[i], corpus[i + j + 1], similarity))

In [31]:
print("Duplicate sentences:")
for sent1, sent2, sim in duplicates:
    print(f"{sent1} | {sent2} | Similarity: {sim:.2f}")
    print() # Imprime las frases duplicadas

Duplicate sentences:
e un giorno soleggiato | It's a sunny day today. | Similarity: 0.87

e un giorno soleggiato | The weather is sunny today. | Similarity: 0.84

The quick brown fox jumps over the lazy dog. | The quick brown fox leaps over the lazy dog. | Similarity: 0.98

The sky is blue, and the grass is green. | The grass is green, and the sky is blue. | Similarity: 0.99

It's a sunny day today. | The weather is sunny today. | Similarity: 0.98

She was wearing a beautiful red dress. | She had on a gorgeous red dress. | Similarity: 0.98

I'm going to the supermarket to buy some groceries. | I'm heading to the supermarket to purchase some groceries. | Similarity: 0.99

He didn't like the movie because it was too long. | He disliked the movie as it was too lengthy. | Similarity: 0.98

The train was delayed due to technical issues. | Technical issues caused the train to be delayed. | Similarity: 0.98

I'll have a cup of coffee with milk and sugar, please. | Can I get a coffee with milk

## Agrupación de documentos (Document Clustering)

El agrupamiento por K-means es un algoritmo popular de aprendizaje automático no supervisado que agrupa puntos de datos en k clústeres basándose en su similitud. En nuestro caso, queremos agrupar documentos basándonos en su similitud semántica. El algoritmo requiere que especifiquemos de antemano el número de clústeres k.

In [32]:
corpus = [
    "The apple is a sweet fruit",
    "Oranges are citrus fruits",
    "Bananas are rich in potassium",
    "Strawberries are red fruits",
    "Dogs are domesticated animals",
    "Cats are also pets",
    "Elephants are the largest land mammals",
    "Cows provide us with milk",
    "Sharks are marine predators",
    "Whales are the largest marine mammals",
    "Dolphins are very intelligent",
    "Artificial intelligence is the future",
    "Machine learning is a subset of AI",
    "Deep learning is a part of machine learning",
    "Neural networks are used in deep learning",
] #  Nuevo dataset

df = pd.DataFrame({'documents': corpus}) # Lo pasa a dataframe

In [33]:
model = SentenceTransformer('all-MiniLM-L6-v2') # Nuevo modelo aplicado de Hugging Face

document_embeddings = model.encode(corpus) # Aplica vectores al corpus

In [34]:
from sklearn.cluster import KMeans # Importamos el algoritmo KMeans
num_clusters = 3 # El numero de clusters que queremos
clustering_model = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=300, n_init=10) # Definimos el algoritmo kmeans
clustering_model.fit(document_embeddings) # Lo aplicamos al corpus vectorizado
cluster_assignment = clustering_model.labels_  # Definimos los labels

df['cluster'] = cluster_assignment # Printeamos los distintos clusters

In [35]:
for i in range(num_clusters):
    print(f"Cluster {i}:")
    print(df[df['cluster'] == i]['documents'].values, "\n") # Imrpime los distintos clusters

Cluster 0:
['The apple is a sweet fruit' 'Oranges are citrus fruits'
 'Bananas are rich in potassium' 'Strawberries are red fruits'] 

Cluster 1:
['Dogs are domesticated animals' 'Cats are also pets'
 'Elephants are the largest land mammals' 'Cows provide us with milk'
 'Sharks are marine predators' 'Whales are the largest marine mammals'
 'Dolphins are very intelligent'] 

Cluster 2:
['Artificial intelligence is the future'
 'Machine learning is a subset of AI'
 'Deep learning is a part of machine learning'
 'Neural networks are used in deep learning'] 



## Detección de Comunidades con Sentence Transformers

La librería `sentence_transformers` proporciona una utilidad para la detección de comunidades, la cual aplica un umbral (threshold) sobre la puntuación de similitud de coseno para identificar grupos distintos de oraciones que son semánticamente similares. Este método es especialmente útil para organizar grandes volúmenes de texto (corpus) en grupos significativos.

La función está diseñada para encontrar clústeres basados en el significado. Estos son sus componentes principales:

* `document_embeddings`: Es la lista de vectores (embeddings) de los documentos de tu corpus. Se pueden crear con cualquier modelo de Sentence Transformer. Deben tener la forma de un tensor 2D o una lista de tensores 1D, donde cada vector representa el significado semántico del documento.

* `threshold (Umbral)`: Un valor decimal entre 0 y 1 que determina el punto de corte para unir dos documentos en una misma comunidad. Se basa en la similitud de coseno. Si la similitud entre dos documentos es mayor al umbral, se consideran parte de la misma comunidad.

* `min_community_size`: El número mínimo de documentos que debe tener un grupo para ser válido. Si una comunidad tiene menos documentos que este valor, se descarta. El valor por defecto es 1, pero aumentarlo ayuda a filtrar el "ruido" y encontrar temas más relevantes.

* `batch_size`: Dado que la función calcula similitudes entre pares de documentos, puede consumir mucha memoria en corpus grandes. Los cálculos se realizan en lotes (batches); un tamaño mayor de lote es más rápido pero consume más RAM, mientras que uno menor es más lento pero eficiente en memoria.

La función devuelve una lista de comunidades, donde cada comunidad es una lista de índices que corresponden a la posición original de los documentos en tu lista inicial. Cada grupo representa un conjunto de documentos semánticamente similares según el umbral configurado.

In [36]:
from sentence_transformers.util import community_detection
document_embeddings = model.encode(
    corpus, show_progress_bar=True, convert_to_tensor=True
)
communities = community_detection(
    document_embeddings, threshold=0.5, min_community_size=2, batch_size=1024
)
for i, comm in enumerate(communities):
    print('_'*50)
    print(f'community: {i}, size: {len(comm)}')
    print('\n'.join([corpus[ind] for ind in comm]))
    print()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

__________________________________________________
community: 0, size: 4
Deep learning is a part of machine learning
Neural networks are used in deep learning
Machine learning is a subset of AI
Artificial intelligence is the future

__________________________________________________
community: 1, size: 3
Strawberries are red fruits
The apple is a sweet fruit
Oranges are citrus fruits

__________________________________________________
community: 2, size: 3
Whales are the largest marine mammals
Elephants are the largest land mammals
Sharks are marine predators



En la salida, veremos las comunidades de oraciones semánticamente similares. Ten en cuenta que la elección del valor del umbral puede afectar enormemente los resultados: un umbral más bajo dará lugar a comunidades más grandes pero menos cohesionadas, mientras que un umbral más alto dará lugar a comunidades más pequeñas pero más estrechamente vinculadas.

La función community_detection es una forma rápida y eficiente de agrupar oraciones similares, pero ten en cuenta que es un método bastante simple basado en la aplicación de un umbral a la similitud de coseno, y métodos de detección de comunidades más sofisticados podrían arrojar mejores resultados para ciertas tareas o conjuntos de datos.

Esta función es una excelente manera de explorar la estructura semántica de tu corpus y obtener una comprensión de alto nivel de los principales temas o tópicos en tus datos de texto.