 ## STEP 1: Installing Required Libraries

This installs the necessary libraries:

1. datasets: To load the Tatoeba dataset.

2. transformers: For translation models.

3. sentence-transformers: To generate sentence embeddings.

4. faiss-cpu: For fast nearest neighbor search.

5. chromadb: For vector database storage (not used in this script).

In [None]:
!pip install datasets transformers sentence-transformers faiss-cpu chromadb


Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-man

## STEP2 : Loading the English-Hindi Dataset

What this does?:

 Load & Clean Hindi Dataset
- The `datasets` library loads the Tatoeba dataset for English-to-Hindi translation.
- Hindi sentences are cleaned to remove unwanted characters.
- The cleaned dataset is stored in `hindi_colloquial_data.csv`.


In [None]:
from datasets import load_dataset
import pandas as pd
import re

# Load dataset with streaming to prevent memory crash
dataset = load_dataset("Helsinki-NLP/tatoeba_mt", "eng-hin", split="test", streaming=True)

# Function to clean Hindi text
def clean_text(text):
    text = re.sub(r'[^\u0900-\u097F\s]', '', text)  # Keep only Hindi characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Process and store in CSV
cleaned_sentences = []
for row in dataset.take(5000):  # Only process first 5000 sentences
    cleaned_sentences.append(clean_text(row["targetString"]))

# Save as CSV
df = pd.DataFrame({"Hindi Sentences": cleaned_sentences})
df.to_csv("hindi_colloquial_data.csv", index=False)

print("✅ Hindi dataset saved successfully!")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.1k [00:00<?, ?B/s]

tatoeba_mt.py:   0%|          | 0.00/15.5k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

The repository for Helsinki-NLP/tatoeba_mt contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/Helsinki-NLP/tatoeba_mt.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] Y
✅ Hindi dataset saved successfully!


What this does? :

View the Cleaned Dataset
- We load the Hindi dataset from the saved CSV file.
- The first few rows are displayed to check the data.

In [None]:
df = pd.read_csv("hindi_colloquial_data.csv")
print(df.head())  # Display first few rows


                      Hindi Sentences
0           पौधे बारिश के बिना मर गए।
1  मेरे रेनकोट से एक बटन निकल आया है।
2        एक बिल्ली चूहे के पीछे भागी।
3             घड़ी के दो हाथ होते हैं
4         देश एक खतरनाक मशीन होती है।


## STEP3 : Generating Sentence Embeddings

What this does?:

Generate Sentence Embeddings
- Sentence embeddings allow efficient similarity searches.
- Loads a multilingual Sentence Transformer model to create vector representations of Hindi sentences.
- The embeddings are stored in a NumPy file.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd

# Load optimized Hindi Sentence Transformer model
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Load cleaned dataset
df = pd.read_csv("hindi_colloquial_data.csv")

# Generate sentence embeddings in small batches
batch_size = 50  # Prevents RAM crash
all_embeddings = []

for i in range(0, len(df), batch_size):
    batch = df["Hindi Sentences"].iloc[i : i + batch_size].tolist()
    batch_embeddings = model.encode(batch, normalize_embeddings=True)  # Normalize for FAISS
    all_embeddings.append(batch_embeddings)

# Convert list to numpy array
embeddings = np.vstack(all_embeddings)
np.save("hindi_embeddings.npy", embeddings)  # Save embeddings

print("✅ Hindi sentence embeddings generated!")


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.89k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Hindi sentence embeddings generated!


## STEP4: Creating FAISS Index for Fast Searches

What this does?:

1. Loads Hindi sentence embeddings.

2. Uses FAISS (Facebook AI Similarity Search) to create an efficient search index.(FAISS helps in fast nearest neighbor searches on embeddings)

3. Saves the FAISS index for future retrieval.

In [None]:
import faiss

# Load embeddings
embeddings = np.load("hindi_embeddings.npy")

# Create FAISS index (L2 normalized)
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

# Add embeddings to FAISS index in small batches
batch_size = 500
for i in range(0, len(embeddings), batch_size):
    index.add(embeddings[i : i + batch_size])

# Save FAISS index
faiss.write_index(index, "hindi_faiss.index")

print("✅ FAISS index created and saved!")


✅ FAISS index created and saved!


## STEP 6 : Searching for Similar Sentences

What this does?:

Search for Similar Hindi Sentences


- Takes a Hindi query sentence.

- Searches for the most relevant Hindi sentences using FAISS.

- Takes a Hindi query sentence.

- Searches for the most relevant Hindi sentences using FAISS.


In [None]:
def search_hindi_sentence(query, model, index, df, k=3):
    query_embedding = model.encode([query], normalize_embeddings=True)  # Normalize query embedding
    _, indices = index.search(query_embedding, k)  # FAISS search
    results = [df["Hindi Sentences"].iloc[i] for i in indices[0]]
    return results

# Load FAISS index
index = faiss.read_index("hindi_faiss.index")

# User query
query = "बारिश में क्या होता है?"  # Example question
results = search_hindi_sentence(query, model, index, df)

print("🔍 Relevant Hindi Sentences:", results)


🔍 Relevant Hindi Sentences: ['जब बारिश होती है तो ज़बरदस्त होती है।', 'मुसलाधार बारिश होती है।', 'मुसलाधार वर्षा होती है।']


## step 7 :  English ↔ Hindi Translation (Combined)

What this does?

- Loads models for English-to-Hindi and Hindi-to-English translation.
- Defines functions for bidirectional translation.
- We use `transformers` to translate between English and Hindi.
- Two separate models handle English-to-Hindi and Hindi-to-English conversion.


In [None]:
from transformers import pipeline, MarianMTModel, MarianTokenizer

# Load English-to-Hindi translation model
eng_to_hi_translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-hi")

# Load Hindi-to-English translation model
hi_to_eng_model_name = "Helsinki-NLP/opus-mt-hi-en"
hi_to_eng_tokenizer = MarianTokenizer.from_pretrained(hi_to_eng_model_name)
hi_to_eng_model = MarianMTModel.from_pretrained(hi_to_eng_model_name)

# Function to translate English to Hindi
def translate_en_to_hi(text):
    return eng_to_hi_translator(text)[0]['translation_text']

# Function to translate Hindi to English
def translate_hi_to_en(text):
    inputs = hi_to_eng_tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    translated = hi_to_eng_model.generate(**inputs)
    return hi_to_eng_tokenizer.batch_decode(translated, skip_special_tokens=True)

# Example translations
english_text = "I love reading books."
hindi_translation = translate_en_to_hi(english_text)
print("📝 English to Hindi:", hindi_translation)

hindi_text = "यह जगह बहुत खूबसूरत है।"
english_translation = translate_hi_to_en(hindi_text)
print("📝 Hindi to English:", english_translation)


Device set to use cuda:0


📝 English to Hindi: मैं किताबों को पढ़ने के लिए प्यार करता हूँ.
📝 Hindi to English: ['This place is very beautiful.']


In [None]:
!pip install transformers sentencepiece




In [None]:
from google.colab import files
files.download('hindi_colloquial_data.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from google.colab import files
files.download("hindi_faiss.index")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>