# Step 2: Sentence Embedding Generation

In this notebook, we generate vector representations (embeddings) for each sentence using a pre-trained multilingual model. We use `sentence-transformers` for an easy-to-use interface.

## 2.1 Load Data
We reload the dataset to ensure we have the text to embed.

In [1]:
import pandas as pd
import numpy as np
import os

# 1. Load the data csv we created in the previous step
input_file = "../data/indic_corp_v2_2000.csv"
if not os.path.exists(input_file):
    print("Error: File not found! Please run Step 1 notebook first.")
else:
    df = pd.read_csv(input_file)
    print(f"Data loaded! Found {len(df)} sentences.")

Data loaded! Found 16000 sentences.


## 2.2 Initialize Model
We use **LaBSE** (Language-agnostic BERT Sentence Embedding) as it is excellent for Indian languages.
This model understands many languages and maps similar meanings to similar numbers.

In [2]:
from sentence_transformers import SentenceTransformer
import torch

# 2. Load the LaBSE model
# This might take a minute to download the first time
model_name = "sentence-transformers/LaBSE"
print("Loading LaBSE model...")

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

model = SentenceTransformer(model_name, device=device)
print("Model loaded successfully!")

Loading LaBSE model...
Using device: cuda


  torch.load(os.path.join(input_path, "pytorch_model.bin"), map_location=torch.device("cpu"))


Model loaded successfully!


## 2.3 Generate Embeddings
We encode the sentences. The `encode` function handles batching automatically.

In [4]:
import time

# 3. Convert text to embeddings
sentences = df['text'].tolist()

start = time.time()

print("Converting sentences to embeddings... (This may take some time)")
# We use batch_size=64 to process 64 sentences at a time, which is faster
embeddings = model.encode(sentences, batch_size=64, show_progress_bar=True)

end = time.time()

print(f"Embedding time: {(end - start)/60:.2f} minutes")
print(f"Done! Created embeddings with shape: {embeddings.shape}")

Converting sentences to embeddings... (This may take some time)


Batches:   0%|          | 0/250 [00:00<?, ?it/s]

Embedding time: 1.54 minutes
Done! Created embeddings with shape: (16000, 768)


## 2.4 Add Language Family info for Analysis later
 We know these facts about Indian languages:

In [5]:
family_map = {
    'Hindi': 'Indo-Aryan', 
    'Bengali': 'Indo-Aryan', 
    'Marathi': 'Indo-Aryan',
    'Gujarati': 'Indo-Aryan',
    'Tamil': 'Dravidian', 
    'Telugu': 'Dravidian', 
    'Kannada': 'Dravidian', 
    'Malayalam': 'Dravidian'
}

df['Family'] = df['language'].map(family_map)
df.head()

Unnamed: 0,language,iso_code,text,Family
0,Hindi,hin_Deva,लोगों को बिलों संबंधी सुविधा देना ही उनका काम,Indo-Aryan
1,Hindi,hin_Deva,इनेलो 1987 में उस वक्त ऐसे ही दोराहे पर खड़ी थ...,Indo-Aryan
2,Hindi,hin_Deva,जहां आई थी तबाही उस घाटी क्षेत्र में खतरा ज्यादा,Indo-Aryan
3,Hindi,hin_Deva,इसके बाद केंद्र की ओर से प्रदेश सरकार को पीएमज...,Indo-Aryan
4,Hindi,hin_Deva,यह पूछने पर कि इस बड़े मैच से पहले उनकी नींद ग...,Indo-Aryan


## 2.5 Save Embeddings
We save the embeddings and the corresponding metadata (languages) so we don't have to re-compute them in the next step.

In [6]:
# Save the embeddings and the updated CSV
np.save("../data/embeddings.npy", embeddings)
df.to_csv("../data/metadata.csv", index=False)

print("Files saved:")
print("- ../data/embeddings.npy")
print("- ../data/metadata.csv")

Files saved:
- ../data/embeddings.npy
- ../data/metadata.csv
