[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QFQQNPt83KujAKd1EmDmlfWE48hWWq11?usp=sharing)

# Semantic search

In this notebook, we'll introduce semantic search and question-answering using [`sentence-transformers`](https://www.sbert.net/), a Python library for state-of-the-art sentence, text and image embeddings. These embeddings are useful for semantic similarity tasks, such as information retrieval and question-answering systems.

<br>

If you want to check all models availabe (in Models section):

https://huggingface.co/sentence-transformers

In [None]:
!pip install -q sentence-transformers

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
import time
import gzip
import os

We'll use a pre-trained Sentence Transformer model to generate sentence embeddings. Many pre-trained models are available [here](https://www.sbert.net/docs/pretrained_models.html)

In [None]:
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

For our semantic search and question-answering task, we need a list of documents or paragraphs to search through for relevant information.

In [None]:
# Sample paragraphs
paragraphs = [
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
    "The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.",
    "The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials, generally built along an east-to-west line across the historical northern borders of China.",
    "The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.",
    "The Taj Mahal is an ivory-white marble mausoleum on the southern bank of the river Yamuna in the Indian city of Agra."
]

paragraphs = np.array(paragraphs)

In [None]:
# Generate embeddings for paragraphs
corpus_embeddings = model.encode(paragraphs)
print(corpus_embeddings.shape)

Now, let's define a function to perform semantic search, given a query and a list of paragraph embeddings.

In [None]:
def semantic_search(query, model, corpus_embeddings, paragraphs, top_k=2):
    query_embedding = model.encode([query])[0]
    similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]
    indexes = np.argpartition(similarities, -top_k)[-top_k:]
    indexes = indexes[np.argsort(-similarities[indexes])]
    print(f"Input query: {query}")
    print()
    for text, sim in zip(list(paragraphs[indexes]), similarities[indexes].tolist()):
        print(f"{sim:.3f}\t{text}")

model_name = 'sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1'
model_2 = SentenceTransformer(model_name)
semantic_search('Where is the Colosseum', model, corpus_embeddings, paragraphs, top_k=2)

## Multilingual models


In [None]:
# lets try in other languages
model_name = 'sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1'
# Generate embeddings for paragraphs
corpus_embeddings = model_2.encode(paragraphs)
print(corpus_embeddings.shape)
model_2 = SentenceTransformer(model_name)

semantic_search('¿Dónde está el Coliseo?', model_2, corpus_embeddings, paragraphs, top_k=2)

We have multilinguals models available [here](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models)

In [None]:
# we can use multilingual models
model_name = 'paraphrase-multilingual-MiniLM-L12-v2'
multi_model = SentenceTransformer(model_name)

In [None]:
multi_corpus_embeddings = multi_model.encode(paragraphs)
print(multi_corpus_embeddings.shape)

In [None]:
semantic_search('¿Dónde está el Coliseo?', multi_model, multi_corpus_embeddings, paragraphs, top_k=2)

## Wikipedia semantic search

As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
about 170k articles. We split these articles into paragraphs

In [None]:
wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append(data['title']+':  '+ paragraph)

# If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))
print(passages[0])
print(passages[1])

In [None]:
reduced_passages = np.array(passages[:5000])
reduced_passages.shape

In [None]:
corpus_embeddings = model.encode(reduced_passages, show_progress_bar=True)

In [None]:
semantic_search('Best american actor', model, corpus_embeddings, reduced_passages, top_k=2)

In [None]:
semantic_search('Number countries Europe', model, corpus_embeddings, reduced_passages, top_k=2)

## Find text duplicates

Try to find duplicate or near-duplicate texts in a given corpus based on their semantic similarity using sentence-transformers.

In [None]:
texts = [
    "The weather today is sunny and warm.",
    "Today's forecast calls for clear skies and pleasant temperatures.",
    "I like pizza with extra cheese and pepperoni.",
    "She enjoys salads with fresh vegetables and a light vinaigrette.",
    "Cats are known for their independent and aloof nature.",
    "Dogs are typically seen as loyal and affectionate companions.",
    "Mountains are majestic and serene in the early morning light.",
    "Cities are bustling and noisy at all hours of the day.",
]

In [None]:
# Step 1: Initialize the SentenceTransformer model
model = SentenceTransformer('paraphrase-distilroberta-base-v2')

In [None]:
# Step 2: Obtain corpus embeddings
embeddings = model.encode(texts, convert_to_tensor=False)

In [None]:
# Step 3: Calculate similarity and find duplicates

# TODO: Define similarity
similarities = cosine_similarity(embeddings, embeddings)

# TODO: Define a similarity threshold
similarity_threshold = 0.9

# TODO: Iterate over each pair of embeddings in the corpus
# Calculate the cosine similarity between the embeddings
# If the similarity is above the threshold, add the sentences to the duplicates list
duplicates = []

for i, emb1 in enumerate(embeddings):
    for j, emb2 in enumerate(embeddings[i + 1:]):
        if similarities[i][j] >= similarity_threshold:
            duplicates.append((texts[i], texts[i + j + 1], similarities[i][j]))

In [None]:
# Sort duplicates by similarity score in descending order
duplicates.sort(key=lambda x: x[2], reverse=True)

# Show the top 5 duplicates
top_5_duplicates = duplicates[:5]

# Print the top 5 duplicates
for i, (text1, text2, similarity) in enumerate(top_5_duplicates, start=1):
    print(f"Top {i} Similarity Score: {similarity:.2f}")
    print(f"Text 1: '{text1}'")
    print(f"Text 2: '{text2}'")
    print()

Results are no so good so we can finetune. We do not have so much data but we can see following notebook example to get an idea of how we could do it.

https://huggingface.co/blog/how-to-train-sentence-transformers


# Clustering

We can use BERTopic a clustering algorithm library that use sentence transformer model as baseline to create topics/clusters.

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

- Documentation: https://maartengr.github.io/BERTopic/index.html
- Notebook example: https://colab.research.google.com/#fileId=https%3A//huggingface.co/spaces/davanstrien/blog_notebooks/blob/main/BERTopic_hub_starter.ipynb


