<a href="https://colab.research.google.com/github/Bazinga97/ML_Projects/blob/main/Vectorization_%26_query_optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Building Efficient Semantic Search with Word Embeddings and Transformers**

1. Word embeddings and their applications in NLP
2. Sentence embeddings for encoding text semantics
3. Indexing and similarity search with libraries like FAISS
4. Fine-tuning pre-trained language models for specific tasks
5. Using the Hugging Face Transformers library


## **Word Embeddings in NLP**

Word embeddings :  dense vector representations of words that capture their semantic meaning.

Popular word embedding techniques include:

Word2Vec: Predicts context words given a target word (skip-gram) or vice versa (CBOW)

GloVe: Combines global matrix factorization and local context window methods

FastText: Extends Word2Vec by using subword information

Word embeddings; semantic similarity between words and have many applications:

Text classification and sentiment analysis

Information retrieval and search

Machine translation

Question answering systems

## **Sentence Embeddings**

 word embeddings encode individual words, sentence embeddings represent the meaning of entire sentences or documents.

Approaches for generating sentence embeddings include:

Doc2Vec: Extends Word2Vec to learn document-level embeddings
Sentence-BERT: Fine-tunes BERT to produce semantically meaningful sentence embeddings
Sentence embeddings allow computing semantic textual similarity between sentences and passages. This enables applications like semantic search, clustering, paraphrase detection, etc.


## Indexing and Similarity Search

Efficient similarity search --> retrieving relevant documents from a large corpus.

Libraries like FAISS provide optimized algorithms for indexing and searching dense vectors:

1. Flat indexes (brute-force search)
2. Inverted file indexes with vector quantization (IVF)
3. Hierarchical navigable small world graphs (HNSW)

These techniques allow trading off speed vs accuracy to scale semantic search to millions of documents.

Fine-Tuning Pre-trained Language Models

Transfer learning with pre-trained language models has revolutionized NLP. Models like BERT, RoBERTa, XLNet, etc. can be fine-tuned on downstream tasks with less data and compute.


```



In [1]:
!pip install sentence-transformers
!pip install faiss-cpu

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [2]:
import pandas as pd
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import faiss
import time
import numpy as np

In [5]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [6]:
import pandas as pd

# Path to the CSV file in Google Drive
path = "/content/drive/MyDrive/Colab Notebooks/Indonesian_News_Dataset.csv"

# Load the dataset into a Pandas dataframe called ratings and skip any lines that return an error
df = pd.read_csv(path)
df.drop(['image', 'embedding'], axis=1, inplace=True)
df['title'].fillna('', inplace=True)

In [7]:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['title'].tolist())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
query_titles = [
    'Kebakaran Depo Plumpang',
    'Gagal Ginjal',
    'Anies Baswedan',
    'Komisi Pemberantasan Korupsi',
    'Presiden Jokowi',
    'Timnas Indonesia',
    'Polisi Tahan'
]

In [11]:
query_embeddings = model.encode(query_titles)
d = embeddings.shape[1]

In [12]:
def perform_search(index, query_embedding, k=5):
    start_time = time.time()
    D, I = index.search(query_embedding.reshape(1, -1), k)
    end_time = time.time()
    execution_time = end_time - start_time
    return D[0], I[0], execution_time

In [13]:
# Flat indexing (before optimization)
index_flat = faiss.IndexFlatL2(d)
index_flat.add(embeddings)


In [14]:
results_flat = []
for query_embedding in query_embeddings:
    D_flat, I_flat, time_flat = perform_search(index_flat, query_embedding)
    results_flat.append((D_flat, I_flat, time_flat))

In [15]:

# IVF indexing with quantization (after optimization)
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(d)
index_ivf = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)

index_ivf.train(embeddings)
index_ivf.add(embeddings)

results_ivf = []
for query_embedding in query_embeddings:
    D_ivf, I_ivf, time_ivf = perform_search(index_ivf, query_embedding)
    results_ivf.append((D_ivf, I_ivf, time_ivf))

In [16]:

# HNSW indexing (advanced)
index_hnsw = faiss.IndexHNSWFlat(d, 32)
index_hnsw.hnsw.efConstruction = 200
index_hnsw.hnsw.efSearch = 128
index_hnsw.add(embeddings)

results_hnsw = []
for query_embedding in query_embeddings:
    D_hnsw, I_hnsw, time_hnsw = perform_search(index_hnsw, query_embedding)
    results_hnsw.append((D_hnsw, I_hnsw, time_hnsw))

In [17]:
# Output comparison
comparison_results = []
for i, query in enumerate(query_titles):
    query_result = {
        'query': query,
        'title_flat': df.iloc[results_flat[i][1][0]]['title'],
        'title_ivf': df.iloc[results_ivf[i][1][0]]['title'],
        'title_hnsw': df.iloc[results_hnsw[i][1][0]]['title'],
        'time_flat': results_flat[i][2],
        'time_ivf': results_ivf[i][2],
        'time_hnsw': results_hnsw[i][2]
    }
    comparison_results.append(query_result)

df_comparison = pd.DataFrame(comparison_results)
print(df_comparison)

                          query  \
0       Kebakaran Depo Plumpang   
1                  Gagal Ginjal   
2                Anies Baswedan   
3  Komisi Pemberantasan Korupsi   
4               Presiden Jokowi   
5              Timnas Indonesia   
6                  Polisi Tahan   

                                          title_flat  \
0  Investigasi Kebakaran Depo BBM Plumpang Tuntas...   
1  Tim Advokasi Gagal Ginjal Akut Minta Jokowi Tu...   
2  Anies Baswedan Bicara soal Menko yang Mau Ubah...   
3  KPK Jerat Bupati Kapuas dan Istri sebagai Ters...   
4  Presiden Jokowi Lepas Keberangkatan Jenazah Is...   
5  Stefano Lilipaly Resmi Dipanggil Timnas Indone...   
6                       Puasa dan Pendidikan Politik   

                                           title_ivf  \
0  Soal Kebakaran Depo Pertamina Plumpang, Polri ...   
1                  5 Perusahaan Milik Nikita Mirzani   
2  Antara Ganjar, Raja-raja Demak, dan Spirit Tol...   
3  KPK Jerat Bupati Kapuas dan Istri sebagai T

In [18]:
# Evaluate recall
def evaluate_recall(results, ground_truth, k=5):
    recall_scores = []
    for i in range(len(results)):
        retrieved_ids = results[i][1][:k]
        relevant_ids = ground_truth[i][:k]
        intersection = np.intersect1d(retrieved_ids, relevant_ids)
        recall = len(intersection) / len(relevant_ids)
        recall_scores.append(recall)
    return np.mean(recall_scores)

ground_truth = [results_flat[i][1] for i in range(len(query_titles))]

In [19]:
recall_flat = evaluate_recall(results_flat, ground_truth)
recall_ivf = evaluate_recall(results_ivf, ground_truth)
recall_hnsw = evaluate_recall(results_hnsw, ground_truth)

print(f"Recall - Flat indexing: {recall_flat:.2f}")
print(f"Recall - IVF indexing: {recall_ivf:.2f}")
print(f"Recall - HNSW indexing: {recall_hnsw:.2f}")

Recall - Flat indexing: 1.00
Recall - IVF indexing: 0.49
Recall - HNSW indexing: 1.00
