<a href="https://colab.research.google.com/github/Aditya100300/LLMs_from_scratch/blob/main/Chapter_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **End-to-End Workflow: Downloading Text Data, Building Sentence Embeddings, and Using Faiss for Nearest Neighbor Search**

This notebook covers:

1. **Data Gathering**: Downloading text data and pre-computed embeddings.
2. **Data Preparation**: Reading and merging multiple STS datasets to create a large corpus of sentences.
3. **Sentence Embeddings**: Using a Transformer model to encode text into high-dimensional vectors.
4. **Faiss**: Building an index for efficient similarity search (both Flat L2 and IVFFlat).
5. **Queries**: Comparing queries to the corpus to find the nearest (most similar) sentences.

We’ve annotated each code block with bullet points to clearly show each step.

---

## **1. Download Data**

```python
# Cell 1: Data Download for pre-computed embeddings

import requests
import os

data_url = "https://raw.githubusercontent.com/jamescalam/data/main/sentence_embeddings_15K/"
# We'll create a data directory to store downloaded files.
if not os.path.exists('./data'):
    os.mkdir('./data')

# Download the NumPy binary files for embeddings
for i in range(57):
    # Format index to a 2-digit string if needed
    if i < 10:
        i = '0' + str(i)
    res = requests.get(data_url+f"embeddings_{i}.npy")
    
    with open(f'./data/embeddings_{i}.npy', 'wb') as fp:
        for chunk in res:
            fp.write(chunk)
    print('.', end='')

# Download the corresponding text file
res = requests.get(f"{data_url}sentences.txt")
with open(f"./data/sentences.txt", 'wb') as fp:
    for chunk in res:
        fp.write(chunk)

# The above loops:
# 1. Creates a "data" folder.
# 2. Iterates through 57 partial embedding files, fetching each .npy file.
# 3. Writes the chunked download to disk.
# 4. Prints a dot to show progress.
# 5. Finally, downloads a 'sentences.txt' with all sentences.
print("\nDone downloading embeddings and text file.")


#Read Data

In [1]:

import requests
import os

data_url = "https://raw.githubusercontent.com/jamescalam/data/main/sentence_embeddings_15K/"
s=[]
# create data directory to store data
if not os.path.exists('./data'):
    os.mkdir('./data')

# download the numpy binary files (dense vectors)
for i in range(57):
    if i < 10:
        i = '0' + str(i)
    res = requests.get(data_url+f"embeddings_{i}.npy")

    with open(f'./data/embeddings_{i}.npy', 'wb') as fp:
        for chunk in res:
            fp.write(chunk)

    print('.', end='')

# and download the respective text file
res = requests.get(f"{data_url}sentences.txt")
with open(f"./data/sentences.txt", 'wb') as fp:
    for chunk in res:
        fp.write(chunk)

.........................................................

In [2]:
urls = [
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.train.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2013/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/images.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/images.test.tsv'
]

In [3]:
import pandas as pd

import requests
from io import StringIO
res = requests.get('https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/sick2014/SICK_train.txt')
# create dataframe
data = pd.read_csv(StringIO(res.text), sep='\t')
data.head()

Unnamed: 0,pair_ID,sentence_A,sentence_B,relatedness_score,entailment_judgment
0,1,A group of kids is playing in a yard and an ol...,A group of boys in a yard is playing and a man...,4.5,NEUTRAL
1,2,A group of children is playing in the house an...,A group of kids is playing in a yard and an ol...,3.2,NEUTRAL
2,3,The young boys are playing outdoors and the ma...,The kids are playing outdoors near a man with ...,4.7,ENTAILMENT
3,5,The kids are playing outdoors near a man with ...,A group of kids is playing in a yard and an ol...,3.4,NEUTRAL
4,9,The young boys are playing outdoors and the ma...,A group of kids is playing in a yard and an ol...,3.7,NEUTRAL


### Explanation:
- **urls**: A list of SemEval STS datasets to eventually parse.
- **SICK 2014**: Another STS-like dataset. We fetch it from GitHub, parse via `pandas.read_csv`.
- **StringIO(res.text)**: Treats the raw text content as if it were a file for `read_csv`.
- **data.head()**: Quickly inspects the first 5 rows.


In [4]:
# we take all samples from both sentence A and B
sentences = data['sentence_A'].tolist()
sentences[:5]

['A group of kids is playing in a yard and an old man is standing in the background',
 'A group of children is playing in the house and there is no man standing in the background',
 'The young boys are playing outdoors and the man is smiling nearby',
 'The kids are playing outdoors near a man with a smile',
 'The young boys are playing outdoors and the man is smiling nearby']

In [5]:
# we take all samples from both sentence A and B
sentences = data['sentence_A'].tolist()
sentence_b = data['sentence_B'].tolist()
sentences.extend(sentence_b)  # merge them
len(set(sentences))  # together we have ~4.5K unique sentences

4802

In [6]:
sentences[:5]

['A group of kids is playing in a yard and an old man is standing in the background',
 'A group of children is playing in the house and there is no man standing in the background',
 'The young boys are playing outdoors and the man is smiling nearby',
 'The kids are playing outdoors near a man with a smile',
 'The young boys are playing outdoors and the man is smiling nearby']

### Explanation:
- **data['sentence_A']**: The SICK dataset’s first sentence column.
- **sentence_B**: The second column for the same data.
- **sentences.extend()**: Merge both lists.
- **set(sentences)**: Eliminates any exact duplicates.


In [7]:


# each of these dataset have the same structure, so we loop through each creating our sentences data
for url in urls:
    res = requests.get(url)
    # extract to dataframe

    smalldata = pd.read_csv(StringIO(res.text), sep='\t', header=None, on_bad_lines='skip')
    # data=data.append(smalldata,ignore_index=False)
    # add to columns 1 and 2 to sentences list
    sentences.extend(smalldata[1].tolist())
    sentences.extend(smalldata[2].tolist())

In [8]:
# smalldata.head()

In [9]:
# remove duplicates and NaN
sentences = [word for word in list(set(sentences)) if type(word) is str]

In [10]:
len(sentences)

14504

In [11]:
sentences[:5]

['The blue train is at the station.',
 'The woman with a knife is slicing a pepper',
 'A man in a red uniform is swiftly making a jump in a dirt bike race',
 'Two people are walking by the ocean.',
 'A man is resting on a chair and rubbing his eyes']

### Explanation:
- **on_bad_lines='skip'**: Skips lines that aren’t well-formed TSV.
- **sentences.extend()**: For each STS file, columns 1 and 2 are the sentences.
- **Final unique set**: Consolidate into a single list with no duplicates.


In [12]:
import numpy as np
!pip install faiss-gpu #cpu

[31mERROR: Could not find a version that satisfies the requirement faiss-gpu (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for faiss-gpu[0m[31m
[0m

# raw text > into Embeddings [using an encoder] > save it in FAISS > raw text query > into an Embedding > Euclidean distance to retrieve the text closest to the query

In [13]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-4.0.1-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_

### Explanation:
- **faiss-gpu** or **faiss-cpu**:
  - GPU version is faster if you have a CUDA environment.
  - CPU version can be used otherwise.
- **sentence-transformers**: We’ll use this library to encode text.


In [14]:
# !git clone https://github.com/jamescalam/data.git data-embeddings

In [15]:
# path = '/content/data-embeddings/sentence_embeddings_15K/'
# sentence_embeddings = []
# for i in range(0,57):
#     # if i < 10:
#     #     i = '0' + str(i)
#     res = path+f"embeddings_{i}.npy"
#     print(res)
#     sm = np.load(res)
#     sentence_embeddings.append(sm)

In [16]:
# arr = np.concatenate(sentence_embeddings)

In [17]:
# arr.shape

In [18]:
# sentences = sentences[:100]

In [19]:
from sentence_transformers import SentenceTransformer
# initialize sentence transformer model
#model = SentenceTransformer('bert-base-nli-mean-tokens') #encoding model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Explanation:
- **SentenceTransformer**: A flexible library for a variety of pretrained models.
- **model.encode(sentences)**: Creates an array of shape `[num_sentences, embedding_dim]`.
- **all-mpnet-base-v2** typically has a 768-dimensional embedding space.
- We use `%%time` (a Jupyter magic) to see how long the encoding step takes.


In [20]:
len(sentences)

14504

In [21]:
%%time
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

CPU times: user 15.9 s, sys: 300 ms, total: 16.2 s
Wall time: 16.5 s


(14504, 768)

In [22]:
sentences[0]

'The blue train is at the station.'

In [23]:
sentence_embeddings[0]

array([-3.99602614e-02, -5.74061833e-02, -1.79971615e-03,  3.06702103e-03,
       -1.16028097e-02,  9.68051050e-03,  9.70228296e-03, -8.37675296e-03,
       -8.88904557e-02,  1.37418369e-02,  6.01364262e-02,  6.66417368e-03,
       -3.04194055e-02, -3.19189429e-02,  5.01921847e-02, -4.54170667e-02,
        4.07873802e-02,  1.90555083e-03, -4.22590189e-02,  2.18143631e-02,
        2.54997965e-02,  5.28704375e-02, -4.48958203e-02,  1.17269428e-02,
       -5.30501194e-02,  1.26087144e-02, -9.80779249e-03,  1.88884940e-02,
        3.23912650e-02, -1.25711747e-02, -3.68603170e-02,  2.40126811e-02,
       -1.82127859e-02,  5.00943810e-02,  1.80715881e-06, -3.05367168e-03,
       -2.88588703e-02,  2.70409677e-02, -2.49318723e-02, -1.31752372e-01,
        1.75932180e-02,  1.79181609e-03, -2.35143267e-02,  2.31805164e-02,
        1.31025333e-02,  1.44792143e-02,  4.63404395e-02,  1.04292966e-01,
        2.89229024e-02,  2.13709958e-02,  1.57561973e-02, -5.37288189e-02,
       -4.84666452e-02,  

### Explanation:
- **sentences[0]**: A sample string from the dataset.
- **sentence_embeddings[0]**: The 768-dimensional float vector for that sentence.


In [24]:
# with open(f'./sim_sentences/embeddings_X.npy', 'wb') as fp:
#     np.save(fp, sentence_embeddings[0:256])

In [25]:
# # saving data
# split = 256
# file_count = 0
# for i in range(0, sentence_embeddings.shape[0], split):
#     end = i + split
#     if end > sentence_embeddings.shape[0] + 1:
#         end = sentence_embeddings.shape[0] + 1
#     file_count = '0' + str(file_count) if file_count < 0 else str(file_count)
#     with open(f'./sim_sentences/embeddings_{file_count}.npy', 'wb') as fp:
#         np.save(fp, sentence_embeddings[i:end, :])
#     print(f"embeddings_{file_count}.npy | {i} -> {end}")
#     file_count = int(file_count) + 1

In [30]:
!pip install faiss-cpu #cpu



In [31]:
import faiss
# sentence_embeddings = arr

In [32]:
d = sentence_embeddings.shape[1]
d

768

In [33]:
sentences[0]

'The blue train is at the station.'

In [34]:
sentence_embeddings[0].shape

(768,)

In [35]:
faiss.Index

In [36]:
index = faiss.IndexFlatL2(d) #euclidean method.  - intialize

In [37]:
index.is_trained

True

In [38]:
index.add(sentence_embeddings)

In [39]:
index.ntotal

14504

### Explanation:
- **faiss.IndexFlatL2(d)**: A basic Faiss index that does Euclidean distance (L2).
- **index.is_trained**: True for a Flat index by default.
- **index.add()**: Appends all embeddings. `ntotal` should match the number of sentences.


### Explanation:
- **faiss.IndexFlatL2(d)**: A basic Faiss index that does Euclidean distance (L2).
- **index.is_trained**: True for a Flat index by default.
- **index.add()**: Appends all embeddings. `ntotal` should match the number of sentences.


converted raw text into embeddings > saved the embeddings into FAISS and we are going to use Euclidean distance to measure the distances

In [40]:
sentences[100:101]

['a man wearing an orange U. Miami tee shirt playing tennis']

In [41]:
k = 10
xq = model.encode(["Someone is performing a dance admidst the rainfall."])

Converted the query into an embedding and now we will run Euclidean search on the entire corpus and compare it to the query

In [42]:
# index

In [43]:
%%time
D, I = index.search(xq, k)  # search
print(I)

[[ 3031  8047   214  9070  3680 13286  4007 11660  8102 10070]]
CPU times: user 6.37 ms, sys: 0 ns, total: 6.37 ms
Wall time: 5.69 ms


In [44]:
[f'{i}: {sentences[i]}' for i in I[0]]

['3031: A person is dancing in the rain',
 '8047: A person is dancing',
 '214: A man is dancing in the rain',
 '9070: A woman is performing in the rain',
 '3680: A male is dancing',
 '13286: A man is dancing',
 '4007: The man is dancing',
 '11660: A girl dances on a sidewalk.',
 '8102: A hiker is on top of the mountain and is dancing',
 '10070: A woman is dancing']

### Explanation:
- **query_text**: We pick a random scenario (“dancing in the rain”).
- **model.encode([query_text])**: Creates a single embedding for our query.
- **index.search(xq, k)**: Returns two arrays:
  - D (distances) of shape (1, k)
  - I (indices) of shape (1, k)
- We loop through the indices `I[0]` to see the actual sentences.


# Quantization

In [45]:
nlist = 100
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)

In [46]:
index.is_trained

False

In [47]:
index.train(sentence_embeddings)
index.is_trained

True

In [48]:
index.add(sentence_embeddings)
index.ntotal

14504

In [49]:
%%time
D, I = index.search(xq, k)  # search
print(I)

[[ 3031  8047   214  3680 13286  4007  8102 10070  8833 11393]]
CPU times: user 606 µs, sys: 0 ns, total: 606 µs
Wall time: 549 µs


### Explanation:
- **nlist**: The number of cluster centroids.
- **IndexIVFFlat**: Uses the same distance metric but organizes embeddings into buckets.
- **index_ivf.train(...)**: K-means on the data to learn centroids.
- After training, we add vectors to assign them to their nearest centroid.


In [50]:
index.nprobe = 10

In [51]:
%%time
D, I = index.search(xq, k)  # search
print(I)

[[ 3031  8047   214  3680 13286  4007 11660  8102 10070  8833]]
CPU times: user 844 µs, sys: 0 ns, total: 844 µs
Wall time: 1.03 ms


### Explanation:
- **nprobe**: The number of clusters to probe for candidate vectors. Higher nprobe => better recall but slower.
- **index_ivf.search(xq, k)**: Returns top k neighbors.
- Compare results and speed to the Flat index.


['582: A person is dancing in the rain',
 '7999: A man is dancing in the rain',
 '14299: A woman is performing in the rain',
 '8088: The dancer is dancing in front of the sound equipment']

In [52]:
[f'{i}: {sentences[i]}' for i in I[0]]

['3031: A person is dancing in the rain',
 '8047: A person is dancing',
 '214: A man is dancing in the rain',
 '3680: A male is dancing',
 '13286: A man is dancing',
 '4007: The man is dancing',
 '11660: A girl dances on a sidewalk.',
 '8102: A hiker is on top of the mountain and is dancing',
 '10070: A woman is dancing',
 '8833: A woman is staging a dance']

In [54]:
index.make_direct_map()

### Explanation:
- **make_direct_map()**: Allows you to do direct lookups.  
- Typically used to quickly retrieve vector by ID, but it can use more memory.


In [56]:
# Cell 13: Another quick query example
another_query = "A man is cooking a large meal in the kitchen."
xq2 = model.encode([another_query])

k = 5
# Use 'index' instead of 'index_ivf'
D2, I2 = index.search(xq2, k)

print("\nQuery:", another_query)
print("Top 5 nearest results (IVF):")
for idx in I2[0]:
    print(f"{idx}: {sentences[idx]}")


Query: A man is cooking a large meal in the kitchen.
Top 5 nearest results (IVF):
13564: A chef is preparing a meal
6822: A person is cooking some food
8255: A chef is preparing some food
6380: A man is preparing some dish
10808: Some food is being prepared by a chef


**We've seen how to:**
1. Download textual datasets from multiple STS sources and unify them.
2. Generate embeddings with a Sentence Transformer model.
3. Create a Faiss index (both FlatL2 and IVF) for efficient retrieval.
4. Query the index with new text and retrieve the most similar sentences.

**Next Steps / Homework:**
- Explore other Faiss indexes (e.g., IVF-PQ, HNSW).
- Analyze search speed vs. memory usage trade-offs.
- Evaluate retrieval quality by comparing the actual text results.

This completes our annotated end-to-end pipeline!


Technique > Time > Subjective performance