<a href="https://colab.research.google.com/github/Ayush-mishra-0-0/ML/blob/main/ayush_12240340_assign2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<span script="color:cyan">Ayush Kumar Mishra</span>  
<span script="color:cyan">12240340</span>  
<span script="color:cyan">Assignment-2</span>


In [1]:
!export CUDA_LAUNCH_BLOCKING=1

# First Cloning the git repository AI4Bharat/IndicTrans2
## And running all important libraries

In [2]:
# Clone the required Git repository for IndicTrans2
%%capture
!git clone https://github.com/AI4Bharat/IndicTrans2.git

In [3]:
# Clone the Hugging face interface from github
%%capture
%cd /content/IndicTrans2/huggingface_interface

In [4]:
# Install other essential dependecies for working of the transformer
%%capture
!python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece

!git clone https://github.com/VarunGumma/IndicTransTokenizer
%cd IndicTransTokenizer
!python3 -m pip install --editable ./
%cd ..

# Importing all the necessary libraries

In [5]:
import sys
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig
from transformers.utils import is_flash_attn_2_available, is_flash_attn_greater_or_equal_2_10
from IndicTransTokenizer import IndicProcessor
from mosestokenizer import MosesSentenceSplitter
from nltk import sent_tokenize
from indicnlp.tokenize.sentence_tokenize import sentence_split, DELIM_PAT_NO_DANDA

## Although in the assignment we need only one translation, Though i am making checkpoints for all three directions possible

1. ENGLISH TO INDIC
2. INDIC TO ENGLISH
3. INDIC TO INDIC

In [6]:
en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"  # this is checkpoint for english to indic translations
indic_en_ckpt_dir = "ai4bharat/indictrans2-indic-en-1B"  #  this is checkpoint for indic to english translations
indic_indic_ckpt_dir = (
    "ai4bharat/indictrans2-indic-indic-dist-320M"  # this is checkpoint for indic to indic translations
)

## Tokenising the sentences
---
### Function: `split_sentences`

This function, `split_sentences`, is designed to split sentences based on the language specified. Here's a breakdown of how it works:

#### Parameters:
- **`input_txt`**: The text that needs to be split into sentences.
- **`lg`**: The language code in a short format. If the language is English, it's represented by `"el"`.

#### Workflow:
1. **English Language (`el`) Handling:**
   - The function first tokenizes the input text using the `sent_tokenize` function and stores the sentences in the variable `s1`.
   - It then uses the `MosesSentenceSplitter` (with the appropriate language code from `flores_codes`) to split the sentences further, storing the result in `s2`.
   - Another sentence tokenization is performed on the input text, with results stored in `s3`.
   - A comparison is made between `s3` and `s2`. If `s3` has fewer sentences, it's assigned to `s1`. Otherwise, `s2` is used.
   - Any soft hyphen characters (`"\xad"`) in the sentences are removed.

2. **Other Languages:**
   - For languages other than English, the `sentence_split` function is used to split the sentences. The language code from `flores_codes` and a delimiter pattern (`DELIM_PAT_NO_DANDA`) are provided as arguments.

#### Return:
- The function returns the split sentences in the form of a list, stored in the variable `s1`.

#### Example:
```python
sentences = sp("This is an example text.", "el")


In [19]:
flores_codes = {"asm_Beng": "as", "awa_Deva": "hi", "ben_Beng": "bn", "bho_Deva": "hi", "brx_Deva": "hi", "doi_Deva": "hi", "eng_Latn": "en", "gom_Deva": "kK", "guj_Gujr": "gu", "hin_Deva": "hi", "hne_Deva": "hi", "kan_Knda": "kn", "kas_Arab": "ur", "kas_Deva": "hi", "kha_Latn": "en", "lus_Latn": "en", "mag_Deva": "hi", "mai_Deva": "hi", "mal_Mlym": "ml", "mar_Deva": "mr", "mni_Beng": "bn", "mni_Mtei": "hi", "npi_Deva": "ne", "ory_Orya": "or", "pan_Guru": "pa", "san_Deva": "hi", "sat_Olck": "or", "snd_Arab": "ur", "snd_Deva": "hi", "tam_Taml": "ta", "tel_Telu": "te", "urd_Arab": "ur"}


In [7]:

def split_sentences(input_text, lang):
    if lang == "eng_Latn":
        input_sentences = sent_tokenize(input_text)
        with MosesSentenceSplitter(flores_codes[lang]) as splitter:
            sents_moses = splitter([input_text])
        sents_nltk = sent_tokenize(input_text)
        if len(sents_nltk) < len(sents_moses):
            input_sentences = sents_nltk
        else:
            input_sentences = sents_moses
        input_sentences = [sent.replace("\xad", "") for sent in input_sentences]
    else:
        input_sentences = sentence_split(
            input_text, lang=flores_codes[lang], delim_pat=DELIM_PAT_NO_DANDA
        )
    return input_sentences


## Now That we have tokens ready.. We have to make a inference of the `IndicTrans` model
### Function: `initialize_model_and_tokenizer`

This function, `initialize_model_and_tokenizer`, initializes a model and tokenizer based on specific configurations. Here's a breakdown of how it works:

#### Parameters:
- **`ck`**: The directory path to the model checkpoint.
- **`qz`**: The quantization type, which can be either `"4-bit"` or `"8-bit"`.
- **`ai`**: The attention implementation method. The options include `"fa2"` for Flash Attention 2 or `"eg"` for eager attention.

#### Workflow:
1. **Quantization Configuration:**
   - If `qz` is `"4-bit"`, the `BitsAndBytesConfig` is initialized with 4-bit settings and stored in `qc`.
   - If `qz` is `"8-bit"`, `BitsAndBytesConfig` is initialized with 8-bit settings and stored in `qc`.
   - If neither, `qc` is set to `None`.

2. **Attention Implementation:**
   - If `ai` is `"fa2"` (Flash Attention 2), the function checks if Flash Attention 2 is available and compatible.
   - If not, `ai` is set to `"eg"` (eager attention).

3. **Model and Tokenizer Initialization:**
   - The tokenizer is loaded using `AutoTokenizer` from the checkpoint specified by `ck`, with `trust_remote_code` enabled. The tokenizer is stored in `tk`.
   - The model is loaded using `AutoModelForSeq2SeqLM`, with the attention implementation `ai`, quantization configuration `qc`, and low CPU memory usage enabled. The model is stored in `m`.
   - If `qc` is `None`, the model is moved to the specified device (`DEVICE`), and half-precision floating point (`.half()`) is applied.

4. **Model Evaluation Mode:**
   - The model is set to evaluation mode using `m.eval()`.

#### Return:
- The function returns the initialized tokenizer (`tk`) and model (`m`).

#### Example:
```python
tokenizer, model = im("path/to/checkpoint", "4-bit", "fa2")


In [8]:
def initialize_model_and_tokenizer(ck, qz, ai):
    if qz == "4-bit":
        qc = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
        )
    elif qz == "8-bit":
        qc = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_use_double_quant=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
        )
    else:
        qc = None

    if ai == "fa2":
        if is_flash_attn_2_available() and is_flash_attn_greater_or_equal_2_10():
            ai = "fa2"
        else:
            ai = "eg"

    tk = AutoTokenizer.from_pretrained(ck, trust_remote_code=True)
    m = AutoModelForSeq2SeqLM.from_pretrained(
        ck,
        trust_remote_code=True,
        attn_implementation=ai,
        low_cpu_mem_usage=True,
        quantization_config=qc,
    )

    if qc == None:
        m = m.to(DEVICE)
        m.half()

    m.eval()

    return tk, m


### 🔄 **Batch Translation Function**

The `batch_translate` function performs translation of a batch of input sentences from a source language to a target language using a pre-trained model. Here's a step-by-step breakdown:

---

#### **1. Process Batches**
- The function iterates over the input sentences in batches of size `BATCH_SIZE`. Each batch is processed individually to handle large datasets efficiently.

---

#### **2. Preprocess Batch**
- **Preprocessing**: Each batch is preprocessed by the `ip` (IndicProcessor) to prepare the data for tokenization and translation. This step involves tasks like normalizing text and handling entity mappings.

---

#### **3. Tokenize Input**
- **Tokenization**: The preprocessed batch is tokenized using the provided `tokenizer`. This converts the text into input encodings suitable for the model. The tokenization includes padding and truncation to handle variable-length sentences.

---

#### **4. Generate Translations**
- **Model Inference**: The model generates translations based on the tokenized inputs. This is done in a no-gradient context to save memory and computational resources. The `generate` method is used with parameters like `num_beams` for beam search, `min_length`, and `max_length` for controlling output length.

---

#### **5. Decode Tokens**
- **Decoding**: The generated tokens are decoded back into human-readable text using the tokenizer. Special tokens are removed, and the text is cleaned up to ensure proper formatting.

---

#### **6. Postprocess Translations**
- **Postprocessing**: The decoded translations are further processed by the `ip` to handle any necessary transformations, such as entity replacement or formatting adjustments.

---

#### **7. Clean Up**
- **Memory Management**: After processing each batch, the function clears up the memory by deleting intermediate variables and calling `torch.cuda.empty_cache()` to free GPU memory.

---

### **Summary**
The `batch_translate` function efficiently translates batches of sentences from a source language to a target language. It includes steps for preprocessing, tokenization, model inference, decoding, and postprocessing, with careful memory management throughout.

---

#### **Process Overview**

Below is a visual representation of the batch translation process:
<!--
![Batch Translation Process](https://drive.google.com/file/d/1gpu0Q-M_4Z7y7S0aobp2qFrNQdeNAiMY/view?usp=sharing) -->

![Batch Translation Process](https://drive.google.com/uc?export=view&id=1gpu0Q-M_4Z7y7S0aobp2qFrNQdeNAiMY)



In [9]:

def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
    translations = []
    for i in range(0, len(input_sentences), BATCH_SIZE):
        batch = input_sentences[i : i + BATCH_SIZE]

        # Preprocess the batch and extract entity mappings
        batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)

        # Tokenize the batch and generate input encodings
        inputs = tokenizer(
            batch,
            truncation=True,
            padding="longest",
            return_tensors="pt",
            return_attention_mask=True,
        ).to(DEVICE)

        # Generate translations using the model
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                use_cache=True,
                min_length=0,
                max_length=256,
                num_beams=5,
                num_return_sequences=1,
            )

        # Decode the generated tokens into text
        with tokenizer.as_target_tokenizer():
            generated_tokens = tokenizer.batch_decode(
                generated_tokens.detach().cpu().tolist(),
                skip_special_tokens=True,
                clean_up_tokenization_spaces=True,
            )

        # Postprocess the translations, including entity replacement
        translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)

        del inputs
        torch.cuda.empty_cache()

    return translations



## Initialization of Models and Tokenizers

In this section, we initialize the `IndicProcessor` for inference and set up three different models using the `im` function. Here’s the process:

#### IndicProcessor Initialization:
- **`ip`**: An instance of the `IndicProcessor` class is created with inference mode enabled.

#### Models and Tokenizers Initialization:
- **`t1, m1`**: The first pair represents the tokenizer and model for English to Indic translation. They are initialized using the `im` function with the checkpoint directory `en_indic_ckpt_dir`, quantization settings, and attention implementation specified.
  
- **`t2, m2`**: The second pair represents the tokenizer and model for Indic to English translation. These are also initialized using the `im` function, but with the `indic_en_ckpt_dir` directory.

- **`t3, m3`**: The third pair is for Indic to Indic translation. They are initialized using the checkpoint directory `indic_indic_ckpt_dir`.

#### Example:
```python
# Initialize the IndicProcessor
ip = IndicProcessor(inference=True)

# Initialize the models and tokenizers
t1, m1 = im(en_indic_ckpt_dir, quantization, attn_implementation)
t2, m2 = im(indic_en_ckpt_dir, quantization, attn_implementation)
t3, m3 = im(indic_indic_ckpt_dir, quantization, attn_implementation)


#### I am using flash-attention to optimise the computation of attention.

In [10]:
ip = IndicProcessor(inference=True)
quantization = "4-bit"
attn_implementation = "fa2"

en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(
    en_indic_ckpt_dir, quantization, attn_implementation
)

indic_en_tokenizer, indic_en_model = initialize_model_and_tokenizer(
    indic_en_ckpt_dir, quantization, attn_implementation
)

indic_indic_tokenizer, indic_indic_model = initialize_model_and_tokenizer(
    indic_indic_ckpt_dir, quantization, attn_implementation
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [11]:
BATCH_SIZE = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

hi_sents = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "उसके पास बहुत सारी पुरानी किताबें हैं, जिन्हें उसने अपने दादा-परदादा से विरासत में पाया।",
    "मुझे समझ में नहीं आ रहा कि मैं अपनी समस्या का समाधान कैसे ढूंढूं।",
    "वह बहुत मेहनती और समझदार है, इसलिए उसे सभी अच्छे मार्क्स मिले।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "वह अपनी दीदी के साथ बाजार गयी थी ताकि वह नई साड़ी खरीद सके।",
    "राज ने मुझसे कहा कि वह अगले महीने अपनी नानी के घर जा रहा है।",
    "सभी बच्चे पार्टी में मज़ा कर रहे थे और खूब सारी मिठाइयाँ खा रहे थे।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]
src_lang, tgt_lang = "hin_Deva", "mar_Deva"
mr_translations = batch_translate(
    hi_sents, src_lang, tgt_lang, indic_indic_model, indic_indic_tokenizer, ip
)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(hi_sents, mr_translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")





hin_Deva - mar_Deva
hin_Deva: जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।
mar_Deva: जेव्हा मी लहान होतो, तेव्हा मी दररोज उद्यानात जायचे. 
hin_Deva: उसके पास बहुत सारी पुरानी किताबें हैं, जिन्हें उसने अपने दादा-परदादा से विरासत में पाया।
mar_Deva: तिच्याकडे बरेच जुने पुस्तक आहे, जे तिला तिच्या आजी-आजोबांकडून वारसा मिळाले. 
hin_Deva: मुझे समझ में नहीं आ रहा कि मैं अपनी समस्या का समाधान कैसे ढूंढूं।
mar_Deva: माझ्या समस्येचे निराकरण कसे करावे हे मला समजत नाही. 
hin_Deva: वह बहुत मेहनती और समझदार है, इसलिए उसे सभी अच्छे मार्क्स मिले।
mar_Deva: तो खूप मेहनती आणि समजदार आहे, म्हणून त्याला सर्व चांगले गुण मिळाले. 
hin_Deva: हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।
mar_Deva: आम्ही गेल्या आठवड्यात एक नवीन चित्रपट पाहिला जो खूप प्रेरणादायी होता. 
hin_Deva: अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।
mar_Deva: जर तुम्हाला त्या वेळी माझ्याकडे भेट दिली असती, तर आम्ही बाहेर जेवायला गेलो होतो. 
hin_Deva: वह अपनी दीदी के साथ बाजार गयी थी ताकि वह नई साड़ी खरीद सके।
ma

# Evaluation metrics

## For evaluation how this model is performing, we must have to original translations on which we can check whether this model is close to this or not.

## For this purpose i am using ` FLORES-22 Indic dev set` dataset which is english - indic translations

### First i am mounting the google drive in which dataset is present.

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
!pip install sacrebleu rouge-score




## For showcase, I am using english-bengali dataset from the `FLORES-22 Indic dev set`

### 📝 **Translation Process from English to Bengali**

In this translation process, we convert English sentences to Bengali using a pre-trained model. Below is a breakdown of the steps:

---

#### 📂 **Step 1: Loading Sentences**
- **English Sentences**: The English text is loaded from a file. Each line in the file corresponds to a different English sentence.
- **Bengali Sentences**: The Bengali text is loaded from another file. Each line represents the corresponding Bengali sentence.

---

#### 🌐 **Step 2: Setting Language Codes**
- **Source Language**: The language code for English is specified as `eng_Latn`, which denotes English in Latin script.
- **Target Language**: The language code for Bengali is set to `ben_beng`, representing Bengali in Bengali script.

---

#### 🔄 **Step 3: Translation Using Pre-trained Model**
- The English sentences are passed to a pre-trained model designed for translation. This model translates English sentences into Bengali.

---

#### 📊 **Step 4: Printing Results**
- The translated Bengali sentences are then displayed alongside the original English sentences for comparison.

---


In [14]:
eng_path = "/content/drive/MyDrive/flores-22_dev/flores-22_dev/all/eng_Latn-asm_Beng/dev.eng_Latn"
beng_path = "/content/drive/MyDrive/flores-22_dev/flores-22_dev/all/eng_Latn-asm_Beng/dev.asm_Beng"
# /content/drive/MyDrive/flores-22_dev/flores-22_dev/all/eng_Latn-asm_Beng/dev.asm_Beng
def get_first_n_lines(sentences, n):
    """Returns the first n lines as a single string."""
    return ' '.join(sentence.strip() for sentence in sentences[:n])

# Define the number of lines to include in the paragraph
num_lines = 10

# Load the English sentences
with open(eng_path, 'r', encoding='utf-8') as f:
    eng_sentences = f.readlines()

# Load the Bengali sentences
with open(beng_path, 'r', encoding='utf-8') as f:
    beng_sentences = f.readlines()

# Get the first 10 lines for both English and Bengali
eng_paragraph = get_first_n_lines(eng_sentences, num_lines)
beng_paragraph = get_first_n_lines(beng_sentences, num_lines)

# Print the paragraphs
print("First 10 lines of English sentences as a paragraph:")
print(eng_paragraph)

print("\nFirst 10 lines of Bengali sentences as a paragraph:")
print(beng_paragraph)



First 10 lines of English sentences as a paragraph:
On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each. Lead researchers say this may bring early detection of cancer, tuberculosis, HIV and malaria to patients in low-income countries, where the survival rates for illnesses such as breast cancer can be half those of richer countries. The JAS 39C Gripen crashed onto a runway at around 9:30 am local time (0230 UTC) and exploded, closing the airport to commercial flights. The pilot was identified as Squadron Leader Dilokrit Pattavee. Local media reports an airport fire vehicle rolled over while responding. 28-year-old Vidal had joined Barça three seasons ago, from Sevilla. Since moving to the Catalan-capital, Vidal had played 49 games for the club. The protest started around 11:00 

In [15]:
eng_paragraph

"On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each. Lead researchers say this may bring early detection of cancer, tuberculosis, HIV and malaria to patients in low-income countries, where the survival rates for illnesses such as breast cancer can be half those of richer countries. The JAS 39C Gripen crashed onto a runway at around 9:30 am local time (0230 UTC) and exploded, closing the airport to commercial flights. The pilot was identified as Squadron Leader Dilokrit Pattavee. Local media reports an airport fire vehicle rolled over while responding. 28-year-old Vidal had joined Barça three seasons ago, from Sevilla. Since moving to the Catalan-capital, Vidal had played 49 games for the club. The protest started around 11:00 local time (UTC+1) on Whitehall opposite the police

In [17]:
def translate_paragraph(input_text, src_lang, tgt_lang, model, tokenizer, ip):
    input_sentences = split_sentences(input_text, src_lang)
    translated_text = batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip)
    return " ".join(translated_text)

In [20]:
translate_paragraph(
    "On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each. Lead researchers say this may bring early detection of cancer, tuberculosis, HIV and malaria to patients in low-income countries, where the survival rates for illnesses such as breast cancer can be half those of richer countries. The JAS 39C Gripen crashed onto a runway at around 9:30 am local time (0230 UTC) and exploded, closing the airport to commercial flights. The pilot was identified as Squadron Leader Dilokrit Pattavee. Local media reports an airport fire vehicle rolled over while responding. 28-year-old Vidal had joined Barça three seasons ago, from Sevilla. Since moving to the Catalan-capital, Vidal had played 49 games for the club. The protest started around 11:00 local time (UTC+1) on Whitehall opposite the police-"
    , src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip
)

'सोमवार को, स्टैनफोर्ड यूनिवर्सिटी स्कूल ऑफ मेडिसिन के वैज्ञानिकों ने एक नए नैदानिक उपकरण के आविष्कार की घोषणा की जो कोशिकाओं को प्रकार के अनुसार क्रमबद्ध कर सकता हैः एक छोटी छापने योग्य चिप जिसे संभवतः लगभग एक यू. एस. के लिए मानक इंकजेट प्रिंटर का उपयोग करके बनाया जा सकता है । प्रत्येक प्रतिशत । प्रमुख शोधकर्ताओं का कहना है कि इससे कम आय वाले देशों में कैंसर, तपेदिक, एच. आई. वी. और मलेरिया का जल्दी पता चल सकता है, जहां स्तन कैंसर जैसी बीमारियों के लिए जीवित रहने की दर अमीर देशों की तुलना में आधी हो सकती है । जे. ए. एस. 39सी. ग्रिपेन स्थानीय समयानुसार सुबह लगभग 9:30 बजे (0230 यू. टी. सी.) एक रनवे पर दुर्घटनाग्रस्त हो गया और विस्फोट हो गया, जिससे हवाई अड्डे को वाणिज्यिक उड़ानों के लिए बंद कर दिया गया । पायलट की पहचान स्क्वाड्रन लीडर दिलोक्रित पट्टवी के रूप में की गई थी । स्थानीय मीडिया ने बताया कि हवाई अड्डे पर आग बुझाने वाला एक वाहन जवाबी कार्रवाई करते हुए पलट गया । 28 वर्षीय विडाल तीन सत्र पहले सेविला से बार्सिलोना में शामिल हुआ था । कैटलन - राजधानी में जाने के बाद से, विडाल ने क्लब क

In [21]:
BATCH_SIZE = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [22]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'


In [26]:

src_lang, tgt_lang = "eng_Latn", "ben_beng"
translate_paragraph(
    '''On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.
Lead researchers say this may bring early detection of cancer, tuberculosis, HIV and malaria to patients in low-income countries, where the survival rates for illnesses such as breast cancer can be half those of richer countries.
The JAS 39C Gripen crashed onto a runway at around 9:30 am local time (0230 UTC) and exploded, closing the airport to commercial flights.
The pilot was identified as Squadron Leader Dilokrit Pattavee.
Local media reports an airport fire vehicle rolled over while responding.
28-year-old Vidal had joined Barça three seasons ago, from Sevilla.
Since moving to the Catalan-capital, Vidal had played 49 games for the club.
The protest started around 11:00 local time (UTC+1) on Whitehall opposite the police-guarded entrance to Downing Street, the Prime Minister's official residence.
''', src_lang, tgt_lang, en_indic_model, indic_indic_tokenizer, ip
)

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [25]:
src_lang, tgt_lang = "eng_Latn", "ben_beng"
mr_translation = translate_paragraph(
    eng_paragraph, src_lang, tgt_lang, en_indic_model, indic_indic_tokenizer, ip
)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(eng_sentences, mr_translation):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")




RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


### 📝 **Evaluating Translation Quality: BLEU and ROUGE Scores**

This evaluation process measures the quality of translations using BLEU and ROUGE metrics. Below is a breakdown of the steps involved:

---

#### 📂 **Step 1: Import Libraries**
- **BLEU Calculation**: Import `corpus_bleu` from the `nltk.translate.bleu_score` library for computing the BLEU score.
- **ROUGE Calculation**: Import `RougeScorer` from the `rouge_score` library for calculating ROUGE scores.

---

#### 📄 **Step 2: Load and Preprocess Data**
- **Reference Sentences**: Load the reference Bengali sentences from a file and split them into tokens (words).
- **Machine-Generated Translations**: Similarly, process the machine-generated Bengali translations into tokens.

---

#### 📊 **Step 3: Calculate BLEU Score**
- **BLEU Score**: Use the `corpus_bleu` function to compute the BLEU score, which measures how closely the machine-generated translations match the reference sentences in terms of n-gram precision.

#### 📈 **Step 4: Calculate ROUGE Scores**
ROUGE Scorer: Initialize the RougeScorer with ROUGE metrics (rouge1, rouge2, rougeL) to compute ROUGE scores, which evaluate the overlap of n-grams and sequences between the generated and reference translations.

In [None]:
bleu = sacrebleu.corpus_bleu(beng_sentences, [mr_references])
print(f"BLEU score: {bleu.score}")

In [None]:
from nltk.translate.bleu_score import corpus_bleu
from rouge_score import rouge_scorer

# Assuming you have reference Bengali sentences loaded
with open(beng_path, 'r', encoding='utf-8') as f:
    reference_sentences = f.readlines()

# Preprocess reference sentences to match BLEU/ROUGE format
reference_sentences = [sent.split() for sent in reference_sentences]
mr_translations = [sent.split() for sent in mr_translations]

# Compute BLEU score
bleu_score = corpus_bleu([reference_sentences], mr_translations)
print(f"BLEU score: {bleu_score}")

# Compute ROUGE score
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

rouge_scores = {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}
for ref, pred in zip(reference_sentences, mr_translations):
    scores = scorer.score(' '.join(ref), ' '.join(pred))
    for key in rouge_scores:
        rouge_scores[key] += scores[key].fmeasure

# Average ROUGE scores
num_sentences = len(reference_sentences)
rouge_scores = {key: value / num_sentences for key, value in rouge_scores.items()}
print(f"ROUGE scores: {rouge_scores}")
