## BASHINI LLM FOR LANGUAGE TRANSLATION (ENGLISH <-> HINDI)

## Hindi Text Scraping from Websites

- Objective
This script scrapes Hindi text from multiple websites and saves it as a structured dataset for further use in translation models.

## Steps in the Code:
Define Target Websites: The script scrapes text from sources like BBC Hindi, Wikipedia, and TypingBaba.

- Scrape and Extract Hindi Text:

- Uses requests to fetch web pages.

- Uses BeautifulSoup to extract text from <p> tags.

- Filters and cleans the text, keeping only Hindi characters.

- Store Data in CSV: The collected sentences are saved to scraped_hindi_data.csv.

- Output:
The script creates a dataset with Hindi sentences, which can be used for training translation models.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# List of target websites (you can add more)
SITES = [
    "https://www.bbc.com/hindi",  # BBC Hindi News
    "https://hi.wikipedia.org/wiki/भारत",  # Wikipedia page on India (replace with other pages)
    "https://www.typingbaba.com/keyboard/online-hindi-keyboard.php",  # Example of Hindi typing practice sentences
]

# Function to clean Hindi text
def clean_text(text):
    text = re.sub(r'[^\u0900-\u097F\s]', '', text)  # Keep only Hindi characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# List to store sentences
hindi_sentences = []

# Scraping function
def scrape_hindi_text(url):
    try:
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(response.text, "html.parser")

        # Extract text from <p> tags (common for news/articles)
        paragraphs = soup.find_all("p")
        for para in paragraphs:
            cleaned = clean_text(para.get_text())
            if len(cleaned) > 10:  # Avoid very short texts
                hindi_sentences.append(cleaned)

    except Exception as e:
        print(f"Error scraping {url}: {e}")

# Loop through each site and scrape data
for site in SITES:
    scrape_hindi_text(site)

# Save to CSV
df = pd.DataFrame({"Hindi Sentences": hindi_sentences})
df.to_csv("scraped_hindi_data.csv", index=False)

print("Scraped Hindi dataset saved as `scraped_hindi_data.csv`!")


Scraped Hindi dataset saved as `scraped_hindi_data.csv`!


In [2]:
!pip install datasets


Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

In [3]:
!pip install faiss-cpu sentence-transformers datasets transformers torch


Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.meta

# **FAISS-Based Hindi-English Retrieval System**

### **Overview**
This script performs **Retrieval-Augmented Generation (RAG)** by indexing Hindi and English sentence embeddings using **FAISS** for efficient search and retrieval.

### **Key Steps:**
1. **Load Scraped Hindi Data**  
   - Reads the dataset containing Hindi text.

2. **Generate Synthetic English Translations** *(Placeholder)*  
   - Creates sample English translations for FAISS indexing (Replace with real translations).

3. **Convert Text to Vector Embeddings**  
   - Uses **Sentence Transformers** (`paraphrase-multilingual-MiniLM-L12-v2`) to encode Hindi and English sentences into numerical vector representations.

4. **Initialize FAISS Index**  
   - **Hindi → English Index**: Stores Hindi sentence embeddings.  
   - **English → Hindi Index**: Stores English sentence embeddings.

5. **Save FAISS Index and Sentence Mappings**  
   - Saves the FAISS index for future fast lookup.  
   - Stores Hindi-English sentence mappings in a CSV file.

### **Outcome:**
- The model can **retrieve relevant translations** using FAISS, optimizing lookup speed for real-time applications.


In [4]:
import pandas as pd
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Load Scraped Data
df = pd.read_csv("scraped_hindi_data.csv")

# Create Synthetic English Translations (Replace with actual)
df["English Sentences"] = ["This is a sample translation" for _ in range(len(df))]

# Initialize Sentence Transformer Model (Multilingual)
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Convert Sentences to Embeddings
hindi_embeddings = model.encode(df["Hindi Sentences"].tolist(), convert_to_numpy=True)
english_embeddings = model.encode(df["English Sentences"].tolist(), convert_to_numpy=True)

# Store in FAISS Index
dimension = hindi_embeddings.shape[1]
index_hi_en = faiss.IndexFlatL2(dimension)  # Hindi → English
index_hi_en.add(hindi_embeddings)

index_en_hi = faiss.IndexFlatL2(dimension)  # English → Hindi
index_en_hi.add(english_embeddings)

# Save FAISS Index
faiss.write_index(index_hi_en, "hindi_to_english_faiss.index")
faiss.write_index(index_en_hi, "english_to_hindi_faiss.index")

# Save Hindi-English Mappings
df.to_csv("hindi_english_mappings.csv", index=False)

print("FAISS Index & Mappings Saved!")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.89k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

FAISS Index & Mappings Saved!


# **Loading Pretrained Translation Models**

### **Overview**
This code loads **Helsinki-NLP's Opus-MT** models for **English ⇄ Hindi** translation using the **MarianMTModel** from Hugging Face.

### **Key Steps:**
1. **Load English → Hindi Model**  
   - Uses `Helsinki-NLP/opus-mt-en-hi` to translate English text into Hindi.

2. **Load Hindi → English Model**  
   - Uses `Helsinki-NLP/opus-mt-hi-en` to translate Hindi text into English.

3. **Initialize Tokenizers**  
   - Tokenizers preprocess input text for model inference.

### **Outcome:**
- Successfully loads both models and tokenizers, enabling **bidirectional translation** between English and Hindi.


In [5]:
from transformers import MarianMTModel, MarianTokenizer

# Load English → Hindi Model
model_en_hi = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-hi")
tokenizer_en_hi = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-hi")

# Load Hindi → English Model
model_hi_en = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-hi-en")
tokenizer_hi_en = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-hi-en")

print("Both Translation Models Loaded!")


config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/306M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/304M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/304M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/813k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

Both Translation Models Loaded!


# **Hindi → English Translation with FAISS & Helsinki-NLP**

### **Overview**
This code combines **FAISS-based retrieval** and **Helsinki-NLP's Opus-MT model** to improve Hindi-to-English translation.

### **Key Steps:**
1. **Load FAISS Index & Data**
   - Reads precomputed **Hindi-English sentence embeddings** from FAISS index.
   - Loads **Hindi-English sentence mappings** from a CSV file.

2. **Retrieve Closest Translation using FAISS**
   - Converts input Hindi text to **vector embeddings** using `SentenceTransformer`.
   - Searches FAISS for the **closest matching sentence** in the dataset.
   - Retrieves the corresponding English translation.

3. **Generate Improved Translation using Helsinki-NLP**
   - Passes the Hindi text to the **Helsinki-NLP Opus-MT model**.
   - Generates an English translation using the transformer model.

4. **Compare Translations**
   - Prints the **retrieved translation** from FAISS.
   - Prints the **generated translation** from Helsinki-NLP.

### **Outcome:**
- The **retrieved translation** provides a quick lookup based on existing data.
- The **generated translation** improves accuracy using deep learning.


In [9]:
import faiss
import torch

# Load FAISS Index & Data
df = pd.read_csv("hindi_english_mappings.csv")
index_hi_en = faiss.read_index("hindi_to_english_faiss.index")

# Function to Retrieve & Generate Hindi → English Translation
def translate_hindi_to_english(hindi_text):
    query_embedding = model.encode([hindi_text], convert_to_numpy=True)

    # Retrieve nearest translation
    _, nearest_indices = index_hi_en.search(query_embedding, k=1)
    retrieved_hindi = df.iloc[nearest_indices[0][0]]["Hindi Sentences"]
    retrieved_english = df.iloc[nearest_indices[0][0]]["English Sentences"]

    # Generate improved translation using Helsinki-NLP
    inputs = tokenizer_hi_en(hindi_text, return_tensors="pt", padding=True, truncation=True)
    output_ids = model_hi_en.generate(**inputs)
    generated_translation = tokenizer_hi_en.batch_decode(output_ids, skip_special_tokens=True)[0]

    return retrieved_english, generated_translation

# Test Translation
hindi_text = " इनपुट उपकरण को ऑनलाइन आज़माएं."
retrieved, generated = translate_hindi_to_english(hindi_text)

print(f"Retrieved Translation: {retrieved}")
print(f"Helsinki-NLP Generated Translation: {generated}")


Retrieved Translation: This is a sample translation
Helsinki-NLP Generated Translation: Test the input device online.


# **English → Hindi Translation with FAISS & Helsinki-NLP**

### **Overview**
This code integrates **FAISS-based retrieval** with **Helsinki-NLP's Opus-MT model** to enhance English-to-Hindi translation.

### **Key Steps:**
1. **Load FAISS Index**
   - Reads the **English-to-Hindi FAISS index** containing precomputed sentence embeddings.

2. **Retrieve Closest Translation using FAISS**
   - Converts input English text into **vector embeddings** using `SentenceTransformer`.
   - Searches FAISS for the **nearest sentence match** in the dataset.
   - Retrieves the corresponding **Hindi translation**.

3. **Generate Improved Translation using Helsinki-NLP**
   - Uses the **Helsinki-NLP Opus-MT model** for neural machine translation.
   - Produces a more refined translation from English to Hindi.

4. **Compare Translations**
   - Prints the **retrieved Hindi translation** from FAISS.
   - Prints the **generated Hindi translation** from Helsinki-NLP.

### **Outcome:**
- The **retrieved translation** gives a quick lookup based on previously stored translations.
- The **generated translation** refines and enhances accuracy using deep learning.


In [10]:
# Load FAISS Index
index_en_hi = faiss.read_index("english_to_hindi_faiss.index")

# Function to Retrieve & Generate English → Hindi Translation
def translate_english_to_hindi(english_text):
    query_embedding = model.encode([english_text], convert_to_numpy=True)

    # Retrieve nearest translation
    _, nearest_indices = index_en_hi.search(query_embedding, k=1)
    retrieved_english = df.iloc[nearest_indices[0][0]]["English Sentences"]
    retrieved_hindi = df.iloc[nearest_indices[0][0]]["Hindi Sentences"]

    # Generate improved translation using Helsinki-NLP
    inputs = tokenizer_en_hi(english_text, return_tensors="pt", padding=True, truncation=True)
    output_ids = model_en_hi.generate(**inputs)
    generated_translation = tokenizer_en_hi.batch_decode(output_ids, skip_special_tokens=True)[0]

    return retrieved_hindi, generated_translation

# Test Translation
english_text = "this is the translated text"
retrieved, generated = translate_english_to_hindi(english_text)

print(f"Retrieved Translation: {retrieved}")
print(f"Helsinki-NLP Generated Translation: {generated}")


Retrieved Translation: पिछले महीने डोनाल्ड ट्रंप ने यूक्रेन के राष्ट्रपति ज़ेलेंस्की की तीखी आलोचना के बाद पुतिन की तारीफ़ की थी लेकिन अब ऐसा क्या हुए कि वो पुतिन से बेहद नाराज़ हो गए हैं पुतिन और ट्रंप के बीच के समीकरण कैसे बदल रहे हैं
Helsinki-NLP Generated Translation: यह अनुवादित पाठ है
