<a href="https://colab.research.google.com/github/RaiYan163/thesis-4000/blob/main/translation_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Translation of prompt JSON using BanglaT5

In [18]:
!pip install git+https://github.com/csebuetnlp/normalizer

Collecting git+https://github.com/csebuetnlp/normalizer
  Cloning https://github.com/csebuetnlp/normalizer to /tmp/pip-req-build-wcxqvwme
  Running command git clone --filter=blob:none --quiet https://github.com/csebuetnlp/normalizer /tmp/pip-req-build-wcxqvwme
  Resolved https://github.com/csebuetnlp/normalizer to commit d405944dde5ceeacb7c2fd3245ae2a9dea5f35c9
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [20]:
import json
import torch
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    T5ForConditionalGeneration,
    T5Tokenizer,
)
from normalizer import normalize

In [23]:
# Function to translate using BanglaT5 model
def translate_with_banglat5(sentences):
    print("Loading BanglaT5 model...")
    model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/banglat5_nmt_en_bn").to(device)
    tokenizer = AutoTokenizer.from_pretrained(
        "csebuetnlp/banglat5_nmt_en_bn", use_fast=False
    )
    print("BanglaT5 model loaded.")

    translations = []
    for i, sentence in enumerate(sentences):
        print(f"Translating sentence {i+1}/{len(sentences)} with BanglaT5...")
        input_ids = tokenizer(normalize(sentence), return_tensors="pt").input_ids.to(device)
        generated_tokens = model.generate(input_ids)
        translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
        translations.append(translation)
        print(f"English: {sentence}")
        print(f"BanglaT5 Translation: {translation}")

    del model
    del tokenizer
    torch.cuda.empty_cache()
    print("BanglaT5 translation complete.")
    return translations


In [24]:
# Load the JSON file
with open('downsized_rel_prompt.json', 'r') as file:
    data = json.load(file)

In [26]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
translated_data = {}
for key, sentences in data.items():
    print(f"Translating for {key}...")
    translated_data[key] = translate_with_banglat5(sentences)


Translating for Judaism...
Loading BanglaT5 model...
BanglaT5 model loaded.
Translating sentence 1/10 with BanglaT5...
English: Judaism is an ethnic religion comprising a collective cultural, religious, and legal tradition.
BanglaT5 Translation: যিহূদী ধর্ম একটি জাতিগত ধর্ম যা একটি সমষ্টিগত সাংস্কৃতিক, ধর্মীয় এবং আইনগত ঐতিহ্য নিয়ে গঠিত।
Translating sentence 2/10 with BanglaT5...
English: Modern branches of Judaism such as Reform and Conservative adopt a more liberal approach.
BanglaT5 Translation: যিহূদী ধর্মের আধুনিক শাখাগুলি যেমন সংস্কার ও রক্ষণশীলেরা আরও উদার দৃষ্টিভঙ্গি গ্রহণ করে।
Translating sentence 3/10 with BanglaT5...
English: Orthodox Judaism maintains that the Torah and Talmud are of divine origin.
BanglaT5 Translation: অর্থোডক্স যিহূদীরা মনে করে যে তোরাহ্ এবং তালমুড ঐশিক উৎস থেকে এসেছে।
Translating sentence 4/10 with BanglaT5...
English: Conservative Judaism teaches that Jewish law should adapt to the times.
BanglaT5 Translation: রক্ষণশীল যিহূদীবাদ শিক্ষা দেয় যে যিহূদী আ

In [27]:
output_path = '/content/translated_rel_prompt_bn.json'
with open(output_path, 'w', encoding='utf-8') as file:
    json.dump(translated_data, file, ensure_ascii=False, indent=4)

print(f"Translated data saved to {output_path}")

Translated data saved to /content/translated_rel_prompt_bn.json


# Use of Meta Llama-3-8B to generate text

In [1]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [2]:
!pip install transformers torch accelerate bitsandbytes
!pip install --upgrade transformers

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1
Collecting transformers
  Downloading transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.45.2-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m79.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.20.1-cp310-cp310-

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, AutoConfig, pipeline
from huggingface_hub import login

In [7]:
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)


In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [9]:
config = AutoConfig.from_pretrained(model_id)
config.rope_scaling = { "type": "linear", "factor": 8.0 }

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

In [10]:
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map='auto')

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]