It is highly recommended to use a powerful **GPU**, you can use it for free uploading this notebook to [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb).
<table align="center">
 <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ezponda/intro_deep_learning/blob/main/class/NLP/semantic_search_QA.ipynb">
        <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
  <td align="center"><a target="_blank" href="https://github.com/ezponda/intro_deep_learning/blob/main/class/NLP/semantic_search_QA.ipynb">
        <img src="https://i.ibb.co/xfJbPmL/github.png"  height="70px" style="padding-bottom:5px;"  />View Source on GitHub</a></td>
</table>

# Semantic search & QA

In this notebook, we'll introduce semantic search and question-answering using [`sentence-transformers`](https://www.sbert.net/), a Python library for state-of-the-art sentence, text and image embeddings. These embeddings are useful for semantic similarity tasks, such as information retrieval and question-answering systems.

In [None]:
# Install the sentence-transformers library
!pip install -U sentence-transformers

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import time
import gzip
import os

We'll use a pre-trained Sentence Transformer model to generate sentence embeddings. Many pre-trained models are available [here](https://www.sbert.net/docs/pretrained_models.html)

In [None]:
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

For our semantic search and question-answering task, we need a list of documents or paragraphs to search through for relevant information.

In [None]:
# Sample paragraphs
paragraphs = [
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
    "The Statue of Liberty is a colossal neoclassical sculpture on Liberty Island in New York Harbor within New York City, in the United States.",
    "The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials, generally built along an east-to-west line across the historical northern borders of China.",
    "The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre in the centre of the city of Rome, Italy.",
    "The Taj Mahal is an ivory-white marble mausoleum on the southern bank of the river Yamuna in the Indian city of Agra."
]

paragraphs = np.array(paragraphs)

In [None]:
# Generate embeddings for paragraphs
corpus_embeddings = model.encode(paragraphs)
print(corpus_embeddings.shape)

Now, let's define a function to perform semantic search, given a query and a list of paragraph embeddings.

In [None]:
def semantic_search(query, model, corpus_embeddings, paragraphs, top_k=2):
    query_embedding = model.encode([query])[0]
    similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]
    indexes = np.argpartition(similarities, -top_k)[-top_k:]
    indexes = indexes[np.argsort(-similarities[indexes])]
    print(f"Input query: {query}")
    print()
    for text, sim in zip(list(paragraphs[indexes]), similarities[indexes].tolist()):
        print(f"{sim:.3f}\t{text}")
              

semantic_search('Where is the Colosseum', model, corpus_embeddings, paragraphs, top_k=2)

## Multilingual models


In [None]:
# lets try in other languages
semantic_search('¿Dónde está el Coliseo?', model, corpus_embeddings, paragraphs, top_k=2)

We have multilinguals models available [here](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models)

In [None]:
# we can use multilingual models 
model_name = 'clip-ViT-B-32-multilingual-v1'
multi_model = SentenceTransformer(model_name)

In [None]:
multi_corpus_embeddings = multi_model.encode(paragraphs)
print(multi_corpus_embeddings.shape)

In [None]:
semantic_search('¿Dónde está el Coliseo?', multi_model, multi_corpus_embeddings, paragraphs, top_k=2)

## Wikipedia semantic search

As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
about 170k articles. We split these articles into paragraphs

In [None]:
wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append(data['title']+':  '+ paragraph)

# If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))
print(passages[0])
print(passages[1])

In [None]:
reduced_passages = np.array(passages[:5000])
reduced_passages.shape

In [None]:
corpus_embeddings = model.encode(reduced_passages, show_progress_bar=True)

In [None]:
semantic_search('Best american actor', model, corpus_embeddings, reduced_passages, top_k=2)

In [None]:
semantic_search('Number countries Europe', model, corpus_embeddings, reduced_passages, top_k=2)

### Question1: Load a different pre-trained Sentence Transformer model and compare its performance to the last model on the same set of paragraphs and queries. Which model performs better?

In [None]:
# Load a different pre-trained model, generate embeddings, and test with the same queries
model_name = ...
new_model = SentenceTransformer(model_name)

## Question 2: Find text duplicates

Try to find duplicate or near-duplicate texts in a given corpus based on their semantic similarity using sentence-transformers. 

In [None]:
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox leaps over the lazy dog.",
    "The sky is blue, and the grass is green.",
    "The grass is green, and the sky is blue.",
    "It's a sunny day today.",
    "The weather is sunny today.",
    "She was wearing a beautiful red dress.",
    "She had on a gorgeous red dress.",
    "I'm going to the supermarket to buy some groceries.",
    "I'm heading to the supermarket to purchase some groceries.",
    "He didn't like the movie because it was too long.",
    "He disliked the movie as it was too lengthy.",
    "The train was delayed due to technical issues.",
    "Technical issues caused the train to be delayed.",
    "I'll have a cup of coffee with milk and sugar, please.",
    "Can I get a coffee with milk and sugar, please?",
    "The conference was very informative and interesting.",
    "The conference turned out to be interesting and informative.",
    "He enjoys listening to classical music in his free time.",
    "In his leisure time, he likes to listen to classical music.",
    "Please make sure you turn off the lights before leaving.",
    "Before leaving, ensure that you switch off the lights."
]

corpus += [
    "The boy was delighted with the gift he received.",
    "Receiving the present made the young lad ecstatic.",
    "She has a preference for Italian cuisine.",
    "Her favorite type of food is from Italy.",
    "The software engineer resolved the issue by modifying the code.",
    "By altering the programming, the tech expert fixed the problem.",
    "Due to the inclement weather, the baseball game was postponed.",
    "The baseball match was rescheduled because of bad weather conditions.",
    "The house was engulfed in a raging fire.",
    "Flames rapidly consumed the residence.",
    "He is constantly browsing the internet for the latest news.",
    "He frequently scours the web to stay updated on current events.",
    "The puppy was playing with a toy in the garden.",
    "In the yard, the young dog was frolicking with its plaything.",
    "The artist painted a beautiful landscape on the canvas.",
]

In [None]:
# Step 1: Initialize the SentenceTransformer model
model = ...

In [None]:
# Step 2: Obtain corpus embeddings
embeddings = ...

In [None]:
# Step 3: Calculate similarity and find duplicates

# TODO: Define a similarity threshold
similarity_threshold = ...

# TODO: Iterate over each pair of embeddings in the corpus
# Calculate the cosine similarity between the embeddings
# If the similarity is above the threshold, add the sentences to the duplicates list
duplicates = []

for i, emb1 in enumerate(embeddings):
    for j, emb2 in enumerate(embeddings[i + 1:]):
        similarity = cosine_similarity([emb1], [emb2])[0][0]
        if ...:
            duplicates.append((corpus[i], corpus[i + j + 1], similarity))

In [None]:
print("Duplicate sentences:")
for sent1, sent2, sim in duplicates:
    print(f"{sent1} | {sent2} | Similarity: {sim:.2f}")
    print()