# **Cyshield Task #3**: Semantic Search in English <br>by **Mohammed Aly**
- [LinkedIn](https://www.linkedin.com/in/mohammed-aly-1854a020a/)  
- [GitHub](https://github.com/MohammedAly22/)


<html>
    <img src="https://pbs.twimg.com/media/FqyH_WzaYAQl3w6.png:large" width="100%">
</html>

**Semantic search** in Natural Language Processing (NLP) is an advanced approach to **information retrieval** that goes beyond the traditional method of matching keywords. It involves a profound understanding of the **meanings behind words** and the contextual nuances in which they are used.

By leveraging techniques from NLP, semantic search aims to comprehend the intricacies of human language. This includes recognizing entities, such as people, places, and organizations, and understanding the relationships between them.

The ultimate goal is to provide **more precise and relevant search results** by considering **not just the words** in a query but also the **underlying semantics** and user intent, enhancing the overall search experience.

# 1.0 Importing Required Packages

In [None]:
import re
from tqdm import tqdm

from nltk.corpus import stopwords

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import tensorflow as tf

from transformers import AutoTokenizer, TFAutoModel

# 2.0 Loading & Exploring the Dataset
I've chosen the [AG-News-Classification-Dataset](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset) due to its substantial size which is large enough to train a quite robust semantic search algorithm. It consists of the following fields: [`Title`, `Description`, `Class Index`]. The `Class Index` column is an integer ranging from 1 to 4 with these corresponding classes:
- 1 -> "World"
- 2 ->  "Sports"
- 3 -> "Business"
- 4 -> "Science/Technology"

In total, there are **120,000 training samples** and **7600 testing samples** split into two files.


In [None]:
train_df = pd.read_csv("/kaggle/input/ag-news-classification-dataset/train.csv")
test_df = pd.read_csv("/kaggle/input/ag-news-classification-dataset/test.csv")

df = pd.concat([train_df, test_df])

In [None]:
df.head()

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 127600 entries, 0 to 7599
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Class Index  127600 non-null  int64 
 1   Title        127600 non-null  object
 2   Description  127600 non-null  object
dtypes: int64(1), object(2)
memory usage: 3.9+ MB


# 3.0 Preparing the Dataset

## 3.1 Normalize Column Names

In [None]:
df.columns = [col.lower() for col in df.columns]
df.head()

Unnamed: 0,class index,title,description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


## 3.2 Create `text` Column by Compining `title` and `description`

In [None]:
df["text"] = df["title"] + " " + df["description"]

In [None]:
df.head()

Unnamed: 0,class index,title,description,text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Wall St. Bears Claw Back Into the Black (Reute...
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."


## 3.3 Select the Relvenat Features Only

In [None]:
df = df[["text", "class index"]]

In [None]:
df.head()

Unnamed: 0,text,class index
0,Wall St. Bears Claw Back Into the Black (Reute...,3
1,Carlyle Looks Toward Commercial Aerospace (Reu...,3
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,3
3,Iraq Halts Oil Exports from Main Southern Pipe...,3
4,"Oil prices soar to all-time record, posing new...",3


## 3.4 Create `category` Column by Mapping the `class index` to a String

In [None]:
class_mapper = {
    1: "World",
    2: "Sports",
    3: "Business",
    4: "Science/Technology"
}


def convert_id_to_class(row):
    """
    Convert `row` to its corresponding class from `class_mapper`.

    Parameters
    ----------
    - row : pd.Series
        A pandas series of class index.
    
    Returns
    -------
    - str
        The corresponding string value of the class index.
    """
    
    return class_mapper[row]


df["category"] = df["class index"].apply(convert_id_to_class)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["category"] = df["class index"].apply(convert_id_to_class)


In [None]:
df.head()

Unnamed: 0,text,class index,category
0,Wall St. Bears Claw Back Into the Black (Reute...,3,Business
1,Carlyle Looks Toward Commercial Aerospace (Reu...,3,Business
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,3,Business
3,Iraq Halts Oil Exports from Main Southern Pipe...,3,Business
4,"Oil prices soar to all-time record, posing new...",3,Business


# 4.0 Preprocessing the Dataset

In [None]:
def _clean_text(text):
    """
    Preproces and clean `text` including:
    - lowercasing
    - keep only alphanumeric characters
    - remove stopwords

    Parameters
    ----------
    - text: str
        A string needed to be processed.
    
    Returns
    -------
    - str
        The cleaned version of the passed `text`.
    """

    text = text.lower()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    text = text.split()
    text = [word for word in text if word not in stopwords.words("english")]

    return " ".join(text)


def clean_df_text(texts):
    """
    Apply `_clean_text()` method on each text in `texts`.

    Parameters
    ----------
    - texts : pd.Series
        A pandas series of texts.
    
    Returns
    -------
    - all_cleaned_texts : list[str]
        A list contains the cleaned version for each text in `texts`.
    """

    all_cleaned_texts = []

    for text in tqdm(texts):
        cleaned_text = _clean_text(text)
        all_cleaned_texts.append(cleaned_text)

    return all_cleaned_texts

In [None]:
cleaned_texts = clean_df_text(df["text"])
df["clean_text"] = cleaned_texts

100%|██████████| 127600/127600 [10:23<00:00, 204.81it/s]


In [None]:
df.head()

Unnamed: 0,text,class index,category,clean_text
0,Wall St. Bears Claw Back Into the Black (Reute...,3,Business,wall st bears claw back black reuters reuters ...
1,Carlyle Looks Toward Commercial Aerospace (Reu...,3,Business,carlyle looks toward commercial aerospace reut...
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,3,Business,oil economy cloud stocks outlook reuters reute...
3,Iraq Halts Oil Exports from Main Southern Pipe...,3,Business,iraq halts oil exports main southern pipeline ...
4,"Oil prices soar to all-time record, posing new...",3,Business,oil prices soar time record posing new menace ...


# 5.0 Semantic Search
Now that our dataset is all cleaned up and good to go, we can start training semantic search algorithms on it. I've specifically chosen three techniques for this, covering a range from simple and traditional ones like `TF-IDF`, to a slightly more advanced approach using word embeddings with `Doc2Vec`, and finally, a more sophisticated method based on the transformer architecture with attention mechanisms called `MiniLM`.

## 5.1 Semantic Search Using Term-Frequency Inverse-Document-Frequency (TF-IDF)

### 5.1.1 Training a `TfidfVectorizer`

In [None]:
tf_idf_vectorizer = TfidfVectorizer(stop_words="english")
tf_idf_matrix = tf_idf_vectorizer.fit_transform(cleaned_texts)

In [None]:
def semantic_search_tf_idf(query, tf_idf_matrix, tf_idf_vectorizer, top_n=5):
    """
    Perform semantic search for the `query` and print the `top_n`
    similar results the the given `query` based on TF-IDF technique.

    Parameters
    ----------
    - query : str
        A specific query that we need to return similar documents to it.

    - tf_idf_matrix : np.sparse_array
        A numpy array contains the tf_idf vectors for each sample in the
        dataset.
    
    - tf_idf_vectorizer : keras.TfidfVectorizer
        A TfidfVectroizer to be able to vectorize the given `query`.
    
    - top_n : int, default=5
        An integer value indicated the number of returned similar documents.
    """

    cleaned_query = _clean_text(query)
    query_vector = tf_idf_vectorizer.transform([cleaned_query])

    cosine_similarities = cosine_similarity(query_vector, tf_idf_matrix).flatten()
    related_docs_indices = cosine_similarities.argsort()[:-top_n-1:-1]
    realted_doc_scores = sorted(cosine_similarities, reverse=True)[:top_n]

    # Display the top related documents
    print(f"Top {top_n} Results for Query: '{query}'")
    for i, idx in enumerate(related_docs_indices):
        print(f"{i + 1}. Category: {df.iloc[idx]['category']}\n")
        print(f"   Text: {df.iloc[idx]['clean_text']}\n")
        print(f"   Similarity: {realted_doc_scores[i]:.4f}")
        print("="*50)

### 5.1.2 Testing on a Random Query from the Dataset

In [None]:
random_index = np.random.randint(0, len(df))
query = df["clean_text"].iloc[random_index]
category = df["category"].iloc[random_index]

print(f"Query of index: {random_index}: ")
print(query)
print(f"\nCategory of index: {random_index}: ")
print(category)

Query of index: 100955: 
spurs defeat mavericks 94 80 tim duncan scored 27 points san antonio spurs held dallas 3 17 shooting fourth quarter 94 80 victory mavericks wednesday night

Category of index: 100955: 
Sports


In [None]:
# Perform semantic search using TF-IDF
semantic_search_tf_idf(query, tf_idf_matrix, tf_idf_vectorizer)

Top 5 Results for Query: 'spurs defeat mavericks 94 80 tim duncan scored 27 points san antonio spurs held dallas 3 17 shooting fourth quarter 94 80 victory mavericks wednesday night'
1. Category: Sports

   Text: spurs defeat mavericks 94 80 tim duncan scored 27 points san antonio spurs held dallas 3 17 shooting fourth quarter 94 80 victory mavericks wednesday night

   Similarity: 1.0000
2. Category: Sports

   Text: spurs beat magic 94 91 ap ap tim duncan 24 points 14 rebounds lead san antonio spurs 94 91 victory wednesday night orlando magic

   Similarity: 0.4976
3. Category: Sports

   Text: nba game summary san antonio dallas dallas tx sports network tim duncan 20 points 13 rebounds five blocks devin brown scored 14 16 points fourth quarter leading san antonio spurs 107 89 victory dallas mavericks american airlines center

   Similarity: 0.4658
4. Category: Sports

   Text: streaking spurs roll past sixers 88 80 ap ap tim duncan scored season high 34 points grabbed 13 rebounds le

### 5.1.3 Conclusion
As observed, despite its **simplicity**, this technique performs quite well and delivers quick and effective results. With minimal effort, we can obtain the top similar results from our dataset for a query like: `spurs defeat mavericks 94 80 tim duncan scored 27 points san antonio spurs held dallas 3 17 shooting fourth quarter 94 80 victory mavericks wednesday night`.

Additionally, we notice that the category of this query is **sport**, and our `TF-IDF-based semantic search algorithm` aims to retrieve similar **sports-related** results as much as possible.

## 5.2 Semantic Search Using Doc2Vec
**Doc2Vec**, an abbreviation for **Document to Vector**, is a notable natural language processing (NLP) technique that extends the principles of **Word2Vec** to entire documents or sentences.

In contrast to Word2Vec, which represents words as vectors in a continuous vector space, Doc2Vec focuses on encoding the semantic meaning of **entire documents**. The primary implementation of Doc2Vec is known as **the Paragraph Vector model**, where each document in a corpus is associated with a **unique vector**.

This model employs two training approaches:
- `PV-DM (Distributed Memory)`, akin to Word2Vec's Continuous Bag of Words (CBOW) model, considers both context words and the paragraph vector for word predictions.
- `PV-DBOW (Distributed Bag of Words)` relies solely on the paragraph vector for predicting target words. The resulting vector representations encapsulate the semantic content of documents, facilitating tasks like document similarity, clustering, and classification.

### 5.2.1 Creating a `Doc2Vec` Model

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# tokenize that data to be in the form that is expected by `Doc2Vec` model.
tokenized_data = [text.split() for text in cleaned_texts]
# Tag each document with an index
tagged_data = [TaggedDocument(words=words, tags=[str(i)]) for i, words in enumerate(tokenized_data)]
# Initialize the Doc2Vec model
doc2vec_model = Doc2Vec(vector_size=300, window=5, min_count=1, workers=4, epochs=20)
# Build the vocabulary
doc2vec_model.build_vocab(tagged_data)

### 5.2.2 Training the `Doc2Vec` Model on the Dataset

In [None]:
# Train the model
doc2vec_model.train(tqdm(tagged_data), total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

100%|██████████| 127600/127600 [00:22<00:00, 5591.81it/s]


### 5.2.3 Using the Trained `Doc2Vec` Model for Getting the Embeddings of each Sample in the Dataset

In [None]:
# Get the embeddings for each document in dataframe
documents_doc2vec_embeddings = np.array([doc2vec_model.infer_vector(words) for words in tqdm(tokenized_data)])
print(f"documents_doc2vec_embeddings shape: {documents_doc2vec_embeddings.shape}")

100%|██████████| 127600/127600 [05:31<00:00, 385.09it/s]


documents_doc2vec_embeddings shape: (127600, 300)


In [None]:
def semantic_search_doc2vec(query, documents_doc2vec_embeddings, model, top_n=5):
    """
    Perform semantic search for the `query` and print the `top_n`
    similar results the the given `query` based on Doc2Vec model.

    Parameters
    ----------
    - query : str
        A specific query that we need to return similar documents to it.
    
    - documents_doc2vec_embeddings : np.array
        A numpy array contains the doc2vec embedding vectors for each
        sample in the dataset.
    
    - model : gensim.Doc2Vec
        A Doc2Vec model to be able to vectorize the given `query`.
    
    - top_n : int, default=5
        An integer value indicated the number of returned similar documents.
    """
    
    # Clean the query
    cleaned_query = _clean_text(query)
    # Put the query in format suitable for the `model`
    query_words = cleaned_query.split()
    # Get the embedding of the query
    query_embedding = model.infer_vector(query_words)
    # Calculate cosine similarities between the query and all vectors in the dataset
    cosine_similarities = cosine_similarity([query_embedding], documents_doc2vec_embeddings)[0]
    # Get the top_n similar documents to the query
    related_docs_indices = np.argsort(cosine_similarities)[-top_n:][::-1]

    # Display the top related documents
    print(f"Top {top_n} Results for Query: '{query}'")
    for i, idx in enumerate(related_docs_indices):
        print(f"{i + 1}. Category: {df.iloc[idx]['category']}\n")
        print(f"   Text: {df.iloc[idx]['clean_text']}\n")
        print(f"   Similarity: {cosine_similarities[idx]:.4f}")
        print("="*50)

### 5.2.4 Testing on a Random Query from the Dataset

In [None]:
# Perform semantic search using Doc2Vec
semantic_search_doc2vec(query, documents_doc2vec_embeddings, doc2vec_model)

Top 5 Results for Query: 'spurs defeat mavericks 94 80 tim duncan scored 27 points san antonio spurs held dallas 3 17 shooting fourth quarter 94 80 victory mavericks wednesday night'
1. Category: Sports

   Text: spurs defeat mavericks 94 80 tim duncan scored 27 points san antonio spurs held dallas 3 17 shooting fourth quarter 94 80 victory mavericks wednesday night

   Similarity: 0.9542
2. Category: World

   Text: 3 bombings resort towns sinai three explosions shook three egyptian sinai resorts popular vacationing israelis killing least 30 people wounding 100

   Similarity: 0.7559
3. Category: Business

   Text: sec sues 3 former kmart execs washington federal regulators filed civil fraud charges three former kmart executives five current former managers suppliers

   Similarity: 0.7360
4. Category: Science/Technology

   Text: siebel moves toward self repairing software com october 11 2004 3 34 pm pt fourth priority 39 main focus enterprise directories organizations spawn projects

### 5.2.5 Conclusion
As observed, the outcomes are somewhat subpar when compared to the performance of the `TF-IDF based semantic search algorithm`. Once more, despite the query falling under the **sports** category, the model yielded results from different categories such as **world** and **business**.

## 5.3 Semantic Search Using Sentence Transformers
**Sentence Transformer** is a state-of-the-art natural language processing (NLP) model designed for **transforming sentences or phrases into meaningful vector representations in a continuous vector space**. Unlike traditional embeddings that capture word meanings, Sentence Transformer focuses on **encoding the semantic content of entire sentences**.

The model is based on **transformer architecture**, a powerful neural network architecture that has shown remarkable success in various NLP tasks. Sentence Transformer is trained on large corpora using unsupervised learning, where it learns to generate dense vectors for sentences. One of the key advantages of Sentence Transformer is its ability to produce **contextualized embeddings**, meaning the representation of a sentence can vary based on the context in which it appears.

### 5.3.1 Loading the Tokenizer and the Model

In [None]:
model_ckpt = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = TFAutoModel.from_pretrained(model_ckpt)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/91.0M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at sentence-transformers/all-MiniLM-L6-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


### 5.3.2 Getting the `input_ids` and `attention_mask` for each Sample in the Dataset

In [None]:
encoded_documents = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="tf")

### 5.3.3 Get the Embeddings for each Sample in the Dataset
Because of the constraints on the **available GPU memory**, we are unable to input the entire set of `encoded_documents` into the model. Consequently, I divided the original `encoded_documents` into **4400** splits, with each split constituting a batch comprising **29** examples. This division ensures that the batches can be accommodated by the model without encountering any memory allocation errors. It's worth noting that the number **384** corresponds to the `embedding_dim` of the **MiniLM** model.

In [None]:
documents_minilm_embeddings = np.zeros((len(df), 384))
input_ids_batches = np.array_split(encoded_documents['input_ids'], 4400)
attention_mask_batches = np.array_split(encoded_documents['attention_mask'], 4400)
batch_size = len(df) // 4400  # 29

for i, (input_ids_batch, attention_mask_batch) in tqdm(enumerate(zip(input_ids_batches, attention_mask_batches))):
    batch_minilm_embeddings = model(input_ids=input_ids_batch, attention_mask=attention_mask_batch).pooler_output.numpy()
    documents_minilm_embeddings[i*batch_size: (i+1)*batch_size] = batch_minilm_embeddings

4400it [07:45,  9.45it/s]


In [None]:
documents_minilm_embeddings

array([[-0.02879272,  0.06663623,  0.04896924, ..., -0.04945499,
        -0.09991661,  0.05025662],
       [ 0.03139488,  0.01146773, -0.10593864, ..., -0.02002352,
        -0.13299237,  0.014741  ],
       [-0.00694024,  0.064034  , -0.02520756, ...,  0.09505007,
        -0.05253974, -0.00017   ],
       ...,
       [-0.01191332, -0.03757676, -0.04574275, ..., -0.05708188,
        -0.07425457, -0.02271388],
       [-0.03884383,  0.02682269,  0.01142227, ..., -0.04864487,
        -0.10057689, -0.1114378 ],
       [-0.07792903,  0.06630514,  0.09481338, ...,  0.06931278,
        -0.04942315, -0.0144629 ]])

In [None]:
def semantic_search_minilm(query, documents_minilm_embeddings, tokenizer, model, top_n=5):
    """
    Perform semantic search for the `query` and print the `top_n`
    similar results the the given `query` based on MiniLM model.

    Parameters
    ----------
    - query : str
        A specific query that we need to return similar documents to it.
    
    - documents_minilm_embeddings : np.array
        A numpy array contains the MiniLM embedding vectors for each sample
        in the dataset.
    
    - tokenizer : transformers.AutoTokenizer
        A pretrained tokenizer associated with the `model` to tokenize the
        `query`.
    
    - model : transformers.TFAutoModel
        A pretrained transformer mode to be able to get the embedding of the
        `query`.
    
    - top_n : int, default=5
        An integer value indicated the number of returned similar documents.
    """
    
    # Clean the query
    cleaned_query = _clean_text(query)
    # Tokenize the `cleaned_query` to get `input_ids` and `attention_mask` 
    encoded_input = tokenizer(cleaned_query, truncation=True, return_tensors="tf")
    # Get the query embedding by passing the encoded input to the model
    query_embedding = model(**encoded_input).pooler_output.numpy()
    # Calculate cosine similarities between the query and all other documents in the dataset
    cosine_similarities = cosine_similarity(query_embedding, documents_minilm_embeddings)[0]
    # Get the top_n similar documents to the query
    related_docs_indices = np.argsort(cosine_similarities)[-top_n:][::-1]

    # Display the top related documents
    print(f"Top {top_n} Results for Query: '{query}'")
    for i, idx in enumerate(related_docs_indices):
        print(f"{i + 1}. Category: {df.iloc[idx]['category']}\n")
        print(f"   Text: {df.iloc[idx]['clean_text']}\n")
        print(f"   Similarity: {cosine_similarities[idx]:.4f}")
        print("="*50)

### 5.3.4 Testing on a Random Query from the Dataset

In [None]:
# Perform semantic search using MiniLM
semantic_search_minilm(query, documents_minilm_embeddings, tokenizer, model)

Top 5 Results for Query: 'spurs defeat mavericks 94 80 tim duncan scored 27 points san antonio spurs held dallas 3 17 shooting fourth quarter 94 80 victory mavericks wednesday night'
1. Category: Sports

   Text: spurs defeat mavericks 94 80 tim duncan scored 27 points san antonio spurs held dallas 3 17 shooting fourth quarter 94 80 victory mavericks wednesday night

   Similarity: 1.0000
2. Category: Sports

   Text: spurs run mavericks 107 89 ap ap devin brown sparked fourth quarter spurt two three point plays two dunks helping san antonio spurs beat dallas mavericks 107 89 monday night spoil pseudo coaching debut avery johnson

   Similarity: 0.9195
3. Category: Sports

   Text: spurs 107 mavericks 89 devin brown sparked fourth quarter spurt two three point plays two dunks helping san antonio spurs beat dallas mavericks 107 89 monday night spoil pseudo coaching debut avery johnson

   Similarity: 0.9107
4. Category: Sports

   Text: duncan leads spurs past hornets 83 69 ap ap tim du

### 5.3.5 Conclusion
As evident from the results, the **attention mechanisms** play a crucial role in providing **contextualized embeddings** for each sample in the dataset. This feature enables us to obtain the most accurate matching results for our query, which specifically discusses a **basketball match between the Spurs and Mavericks**. The model successfully retrieves all documents related to the **Spurs and Mavericks**, showcasing a commendable similarity score.

## 5.4 Testing all Techniques on an External Query

In [None]:
query = "Real Madrid beat Barcelona 4-1 in the Spanish Super Cup in Saudi Arabia, as El Clasico delivered drama and brilliant goals once again."

In [None]:
# Perform semantic search using TF-IDF
print("TF-IDF Model\n")
semantic_search_tf_idf(query, tf_idf_matrix, tf_idf_vectorizer)

TF-IDF Model

Top 5 Results for Query: 'Real Madrid beat Barcelona 4-1 in the Spanish Super Cup in Saudi Arabia, as El Clasico delivered drama and brilliant goals once again.'
1. Category: Sports

   Text: barcelona beat real madrid barcelona moved seven points clear top spanish league saturday following three nil victory home second placed real madrid

   Similarity: 0.3614
2. Category: Sports

   Text: barcelona shuts rival real madrid madrid spain barcelona moved ahead spanish league beating rival real madrid 3 0 saturday country 39 biggest match

   Similarity: 0.3411
3. Category: Sports

   Text: barcelona real madrid post home wins barcelona spain sports network david beckham scored game winner real madrid 39 galacticos 39 barcelona week two spanish premier division

   Similarity: 0.3307
4. Category: Sports

   Text: barcelona beats real madrid spanish league barcelona moved ahead spanish league beating rival real madrid 3 0 saturday country 39 biggest match samuel eto 39 giovan

In [None]:
# Perform semantic search using Doc2Vec
print("DOC2VEC Model\n")
semantic_search_doc2vec(query, documents_doc2vec_embeddings, doc2vec_model)

DOC2VEC Model

Top 5 Results for Query: 'Real Madrid beat Barcelona 4-1 in the Spanish Super Cup in Saudi Arabia, as El Clasico delivered drama and brilliant goals once again.'
1. Category: Sports

   Text: uefa cup champ takes super cup 2 1 victory porto uefa cup holders valencia beat european champion porto 2 1 win super cup monaco 39 stade louis ii friday midfielder vicente laid valencia goals ruben baraja heading

   Similarity: 0.4815
2. Category: Sports

   Text: fa investigate chelsea west ham violence league cup match football association investigate crowd violence marred chelsea 39 1 0 win west ham league cup mateja kezman scored goal wednesday night stamford

   Similarity: 0.3916
3. Category: Sports

   Text: marseille 39 european cup winning 39 sorcerer 39 dies belgian 1993 european cup french side marseille time side france captured european club football 39 premier trophy 1978 cup winners cup belgian giants anderlecht

   Similarity: 0.3915
4. Category: Sports

   Text: u

In [None]:
# Perform semantic search using MiniLM
print("MINILM Model\n")
semantic_search_minilm(query, documents_minilm_embeddings, tokenizer, model)

MINILM Model

Top 5 Results for Query: 'Real Madrid beat Barcelona 4-1 in the Spanish Super Cup in Saudi Arabia, as El Clasico delivered drama and brilliant goals once again.'
1. Category: Sports

   Text: spain real madrid crush levante ronaldo scored twice real madrid ended two game winless slide 5 0 spanish league victory seventh placed levante santiago bernabeu sunday

   Similarity: 0.8590
2. Category: Sports

   Text: barcelona beats real madrid spanish league barcelona moved ahead spanish league beating rival real madrid 3 0 saturday country 39 biggest match samuel eto 39 giovanni van bronckhorst scored first half ronaldinho

   Similarity: 0.8589
3. Category: Sports

   Text: liga sunday wrap madrid answer critics real madrid ended talk crisis club thumped levante 5 0 bernabeu valencia moved back champions league places 2 0 win mallorca

   Similarity: 0.8574
4. Category: Sports

   Text: barcelona 3 0 real madrid cameroon 39 samuel eto 39 fils helped barcelona trounce real mad

# 6.0 Final Conclusion
In conclusion, based on the outcomes discussed above, it is evident that fundamental techniques like `TF-IDF` continue to perform remarkably well even without the use of neural networks. The results obtained with `Doc2Vec` demonstrate decent performance relying on fixed embeddings. However, the most effective technique appears to be the `MiniLM transformer-based` model, primarily owing to its utilization of **attention mechanisms** that can harness **contextualized embeddings**.