It is highly recommended to use a powerful **GPU**, you can use it for free uploading this notebook to [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb).
<table align="center">
 <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ezponda/intro_deep_learning/blob/main/class/NLP/semantic_search_QA_clustering.ipynb">
        <img src="https://colab.research.google.com/img/colab_favicon_256px.png"  width="50" height="50" style="padding-bottom:5px;" />Run in Google Colab</a></td>
  <td align="center"><a target="_blank" href="https://github.com/ezponda/intro_deep_learning/blob/main/class/NLP/semantic_search_QA_clustering.ipynb">
        <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png"  width="50" height="50" style="padding-bottom:5px;" />View Source on GitHub</a></td>
</table>

# Semantic search

In this notebook, we'll introduce semantic search and question-answering using [`sentence-transformers`](https://www.sbert.net/), a Python library for state-of-the-art sentence, text and image embeddings. These embeddings are useful for semantic similarity tasks, such as information retrieval and question-answering systems.

<br>

If you want to check all models availabe (in Models section):

https://huggingface.co/sentence-transformers

In [1]:
!pip install -q sentence-transformers

In [2]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
import time
import gzip
import os

We'll use a pre-trained Sentence Transformer model to generate sentence embeddings. Many pre-trained models are available [here](https://www.sbert.net/docs/pretrained_models.html)

In [3]:
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

For our semantic search and question-answering task, we need a list of documents or paragraphs to search through for relevant information.

In [4]:
# Sample paragraphs
paragraphs = [
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
    "The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.",
    "The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials, generally built along an east-to-west line across the historical northern borders of China.",
    "The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.",
    "The Taj Mahal is an ivory-white marble mausoleum on the southern bank of the river Yamuna in the Indian city of Agra."
]

paragraphs = np.array(paragraphs)

In [5]:
# Generate embeddings for paragraphs
corpus_embeddings = model.encode(paragraphs)
print(corpus_embeddings.shape)

(5, 384)


Now, let's define a function to perform semantic search, given a query and a list of paragraph embeddings.

In [6]:
def semantic_search(query, model, corpus_embeddings, paragraphs, top_k=2):
    query_embedding = model.encode([query])[0]
    similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]
    indexes = np.argpartition(similarities, -top_k)[-top_k:]
    indexes = indexes[np.argsort(-similarities[indexes])]
    print(f"Input query: {query}")
    print()
    for text, sim in zip(list(paragraphs[indexes]), similarities[indexes].tolist()):
        print(f"{sim:.3f}\t{text}")


semantic_search('Where is the Colosseum', model, corpus_embeddings, paragraphs, top_k=2)

Input query: Where is the Colosseum

0.801	The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.
0.226	The Taj Mahal is an ivory-white marble mausoleum on the southern bank of the river Yamuna in the Indian city of Agra.


## Multilingual models


In [7]:
# lets try in other languages
semantic_search('¿Dónde está el Coliseo?', model, corpus_embeddings, paragraphs, top_k=2)

Input query: ¿Dónde está el Coliseo?

0.086	The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.
0.067	The Taj Mahal is an ivory-white marble mausoleum on the southern bank of the river Yamuna in the Indian city of Agra.


We have multilinguals models available [here](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models)

In [8]:
# we can use multilingual models
model_name = 'paraphrase-multilingual-MiniLM-L12-v2'
multi_model = SentenceTransformer(model_name)

In [9]:
multi_corpus_embeddings = multi_model.encode(paragraphs)
print(multi_corpus_embeddings.shape)

(5, 384)


In [10]:
semantic_search('¿Dónde está el Coliseo?', multi_model, multi_corpus_embeddings, paragraphs, top_k=2)

Input query: ¿Dónde está el Coliseo?

0.439	The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.
0.299	The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.


## Wikipedia semantic search

As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
about 170k articles. We split these articles into paragraphs

In [11]:
wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append(data['title']+':  '+ paragraph)

# If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))
print(passages[0])
print(passages[1])

Passages: 509663
Ted Cassidy:  Ted Cassidy (July 31, 1932 - January 16, 1979) was an American actor. He was best known for his roles as Lurch and Thing on "The Addams Family".
Aileen Wuornos:  Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956 – October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.


In [12]:
reduced_passages = np.array(passages[:5000])
reduced_passages.shape

(5000,)

In [13]:
corpus_embeddings = model.encode(reduced_passages, show_progress_bar=True)

Batches:   0%|          | 0/157 [00:00<?, ?it/s]

In [14]:
semantic_search('Best american actor', model, corpus_embeddings, reduced_passages, top_k=2)

Input query: Best american actor

0.539	Aaron Kwok:  Aaron won the Best Actor Award again at the forty-third Golden Horse Awards on 24 November 2006 for his role in the movie "After This Our Exile". He became the second actor in the history of the Golden Horse Awards to win the Best Actor Award year after year. Jackie Chan first achieved this back in the 1990s.
0.425	James L. Brooks:  He is best known for creating American television programs such as "The Mary Tyler Moore Show", "The Simpsons", "Rhoda" and "Taxi". His best-known movie is "Terms of Endearment", for which he received three Academy Awards in 1984.


In [15]:
semantic_search('Number countries Europe', model, corpus_embeddings, reduced_passages, top_k=2)

Input query: Number countries Europe

0.502	European Union member state:  A European Union member state is any one of the twenty-seven countries that have joined the European Union (EU) since it was found in 1958 as the European Economic Community (EEC). From an original membership of six states, there have been five successive enlargements. The largest happened on 1 May 2004, when ten member states joined.
0.465	European Space Agency:  The member countries of ESA are Austria, Belgium, Czech Republic, Denmark, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom.


## Find text duplicates

Try to find duplicate or near-duplicate texts in a given corpus based on their semantic similarity using sentence-transformers.

In [16]:
texts = [
    "The weather today is sunny and warm.",
    "Today's forecast calls for clear skies and pleasant temperatures.",
    "I like pizza with extra cheese and pepperoni.",
    "She enjoys salads with fresh vegetables and a light vinaigrette.",
    "Cats are known for their independent and aloof nature.",
    "Dogs are typically seen as loyal and affectionate companions.",
    "Mountains are majestic and serene in the early morning light.",
    "Cities are bustling and noisy at all hours of the day.",
]

In [22]:
# Step 1: Initialize the SentenceTransformer model
model = SentenceTransformer('paraphrase-distilroberta-base-v2')

Downloading (…)2b9e5/.gitattributes:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)3c1ed2b9e5/README.md:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading (…)1ed2b9e5/config.json:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)c1ed2b9e5/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)2b9e5/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading (…)c1ed2b9e5/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)ed2b9e5/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [42]:
# Step 2: Obtain corpus embeddings
embeddings = model.encode(texts, convert_to_tensor=False)

In [43]:
# Step 3: Calculate similarity and find duplicates

# TODO: Define similarity
similarities = cosine_similarity(embeddings, embeddings)

# TODO: Define a similarity threshold
similarity_threshold = 0.9

# TODO: Iterate over each pair of embeddings in the corpus
# Calculate the cosine similarity between the embeddings
# If the similarity is above the threshold, add the sentences to the duplicates list
duplicates = []

for i, emb1 in enumerate(embeddings):
    for j, emb2 in enumerate(embeddings[i + 1:]):
        if similarities[i][j] >= similarity_threshold:
            duplicates.append((texts[i], texts[i + j + 1], similarities[i][j]))

In [44]:
# Sort duplicates by similarity score in descending order
duplicates.sort(key=lambda x: x[2], reverse=True)

# Show the top 5 duplicates
top_5_duplicates = duplicates[:5]

# Print the top 5 duplicates
for i, (text1, text2, similarity) in enumerate(top_5_duplicates, start=1):
    print(f"Top {i} Similarity Score: {similarity:.2f}")
    print(f"Text 1: '{text1}'")
    print(f"Text 2: '{text2}'")
    print()

Top 1 Similarity Score: 1.00
Text 1: 'The weather today is sunny and warm.'
Text 2: 'Today's forecast calls for clear skies and pleasant temperatures.'

Top 2 Similarity Score: 1.00
Text 1: 'Today's forecast calls for clear skies and pleasant temperatures.'
Text 2: 'She enjoys salads with fresh vegetables and a light vinaigrette.'

Top 3 Similarity Score: 1.00
Text 1: 'She enjoys salads with fresh vegetables and a light vinaigrette.'
Text 2: 'Cities are bustling and noisy at all hours of the day.'

Top 4 Similarity Score: 1.00
Text 1: 'I like pizza with extra cheese and pepperoni.'
Text 2: 'Dogs are typically seen as loyal and affectionate companions.'



Results are no so good so we can finetune. We do not have so much data but we can see following notebook example to get an idea of how we could do it.

https://huggingface.co/blog/how-to-train-sentence-transformers


# Clustering

We can use BERTopic a clustering algorithm library that use sentence transformer model as baseline to create topics/clusters.

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

- Documentation: https://maartengr.github.io/BERTopic/index.html
- Notebook example: https://colab.research.google.com/#fileId=https%3A//huggingface.co/spaces/davanstrien/blog_notebooks/blob/main/BERTopic_hub_starter.ipynb


