Demo Translation 

- Bulgarian written in the Cyrillic script

    - Учените от медицинското училище в университета в Станфод обявиха в понеделник изобретяването на нов диагностичен инструмент, който може да сортира клетките по тип: малък печатен чип, който може да бъде произведен с помощта на стандартни мастилено-струйни принтери за вероятно около един американски цент всеки.

- English
    - On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.

In [1]:
LRL = ["Учените от медицинското училище в университета в Станфод обявиха в понеделник изобретяването на нов диагностичен инструмент, който може да сортира клетките по тип: малък печатен чип, който може да бъде произведен с помощта на стандартни мастилено-струйни принтери за вероятно около един американски цент всеки."]

# Kinyarwanda is one of the lowest resrouce langauges in FLORES
LRL = ["Kuwa mbere, abahanga ba siyansi bo mu Ishuri rikuru ry’ubuvuzi rya kaminuza ya Stanford batangaje ko havumbuwe igikoresho gishya cyo gusuzuma gishobora gutandukanya ingirabuzima fatizo hashingiwe ku bwoko: agakoresho gato gacapwa, gashobora gukorwa hifashishijwe icapiro risanzwe rya inkjet mu buryo bushoboka ni hafi igiceri kimwe c'Amerika kuri buri kamwe."]
HRL = ["On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each."]

---

In [15]:
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

from openai import OpenAI
client = OpenAI(api_key= OPENAI_API_KEY)

In [4]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

  from .autonotebook import tqdm as notebook_tqdm


### 1. NLLB Forward LRL to HRL (English)

In [5]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and tokenizer
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
def translate(source_text, source_lang, target_lang):

    # Kinyarwanda Tokenizer
    tokenizer.src_lang = source_lang

    # Tokenize the input text
    inputs = tokenizer(source_text, return_tensors="pt")

    # Generate the translation according to target language specified
    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id[target_lang], max_length=30
    )

    # Decode the translated tokens for translated text
    translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

    return translated_text

Translation Demo

In [35]:
# source_lang = "bul_Cyrl"
source_lang = "kin_Latn"
target_lang = "eng_Latn"
# source_text = "Студентите от Медицинския факултет на Станфордския университет обявиха в понеделник изобретението на нов диагностичен инструмент, който може да сортира клетки по тип на малък печатен чип, който може да бъде произведен с помощта на стандартен принтер за тапети вероятно за около един американски цент на човек."
source_text = "Kuwa mbere, abahanga ba siyansi bo mu Ishuri rikuru ry’ubuvuzi rya kaminuza ya Stanford batangaje ko havumbuwe igikoresho gishya cyo gusuzuma gishobora gutandukanya ingirabuzima fatizo hashingiwe ku bwoko: agakoresho gato gacapwa, gashobora gukorwa hifashishijwe icapiro risanzwe rya inkjet mu buryo bushoboka ni hafi igiceri kimwe c'Amerika kuri buri kamwe."


machine_translation = translate(source_text, source_lang, target_lang)

print(machine_translation)

On Monday, scientists at Stanford University Medical School announced the discovery of a new diagnostic tool that can differentiate stem cells by type: a


### 2. NLLB Back-Translation HRL (English) back to LRL

In [36]:
# source_lang = "eng_Latn"
# target_lang = "kin_Latn"

# target_lang = "bul_Cyrl"
# source_text = hypothesis_one

# back_translation = translate(source_text, source_lang, target_lang)


def back_translation(source_text, source_lang, target_lang):
    return translate(source_text, target_lang, source_lang)

back_translation_text = back_translation(machine_translation, source_lang, target_lang)

print(back_translation_text)

Ku wa mbere, abahanga bo mu Ishuri ry'Ubuvuzi rya Kaminuza ya Stanford batangaje ko bavumbuye igikoresho


### 3. GPT for Translation Error Detection - Using all previous translation references

In [37]:
def evaluate_translation(source_lang, target_lang, source_text, machine_translation, back_translation_text):

  response = client.chat.completions.create(
    # model="gpt-4o-2024-05-13",
    model="gpt-3.5-turbo",
    messages=[
      {
        "role": "system",
        "content": [
          {
            "text": f'Identify the errors in translations from {source_lang} to {target_lang}. I have provided 3 references:\n\n1. 1 initial low resource sentence that got translated into \n2. 1 English translation, and then back translated into \n3. 1 reference sentence to compare with the initial input\n\nHere they are:\n\n 1. {source_text} \n\n2. {machine_translation} \n\n3. {back_translation_text} \n\nRemember to most weight on the first one as that is the initial input from the user\n\nList the translation errors only, in the english translation that resulted\n',
            "type": "text"
          }
        ]
      },
      {
        "role": "assistant",
        "content": [
          {
            "text": "1. \"Учените\" was translated as \"Students\" instead of the correct \"Scientists.\"\n2. \"медицинското училище в университета в Станфод\" was translated as \"Stanford University School of Medicine,\" which misses the importance of the definitive article \"the\" before \"Stanford.\"\n3. \"мастилено-струйни принтери за вероятно около един американски цент всеки\" was partially omitted. It should indicate that the diagnostic tool can be produced for around one American cent each using standard inkjet printers.\n4. \"sort cells by a type of small printed\" is an incomplete clause and does not capture \"малък печатен чип\" as \"a small printed chip.\"\n\nOverall, the English translation missed key details and structure from the original sentence.",
            "type": "text"
          }
        ]
      }
    ],
    temperature=1,
    max_tokens=256,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
  )

  return response.choices[0].message.content, machine_translation

In [38]:
res, machine_translation = evaluate_translation(source_lang, target_lang, source_text, machine_translation, back_translation_text)
res

'Translation Error:\n- "Students" was translated instead of "Scientists."\n- "Discovery" was missing in the translation.\n- The translation was missing the details about the diagnostic tool being able to be produced for around one American cent each using standard inkjet printers.\n- Incorrect translation of "small printed chip" as "small printed."'

In [39]:
import json
from termcolor import colored

def highlight_errors(original_text, translation_text, error_json):

    errors = []

    with open(error_json, encoding='utf-8') as file:
        errors = json.load(file)["errors"]
        # print(errors)
    
    indices_og = {}
    indices_translation = {}

    for error in errors:
        # print(error)
        
        indices_og[error["start_index_orig"]] = error["end_index_orig"]
        indices_translation[error["start_index_translation"]] = error["end_index_translation"]


    indices_og = dict(sorted(indices_og.items()))
    indices_translation = dict(sorted(indices_translation.items()))

    # print(indices_og, indices_translation)
    

    highlight_errors_og = ""
    colors = ["on_blue","on_red","on_green"]
    count = 0 

    # print(list(indices_og.items()))

    
    highlight_errors_translation = ""
    count = 0 
    prev = 0

    for key,val in indices_og.items():

        # if key <= prev or val <= prev:
            # prev = min(prev,val,key)

        # print()
        highlight_errors_og += original_text[0][prev:key]
        # print()

        if key < val and prev < val:
            highlight_errors_og += colored(original_text[0][key:val], "white", colors[count])
        
        # print(prev,key,val,)

        count += 1
        prev = val

    highlight_errors_og += original_text[0][prev:]
    
    count = 0 
    prev = 0
    
    
    
    # 
    # 
    # 
    
    
    
    
    for key,val in indices_translation.items():

        # if key <= prev or val <= prev:
            # prev = min(prev,val,key)

        # print()
        highlight_errors_translation += translation_text[0][prev:key]
        # print()

        if key < val and prev < val:
            highlight_errors_translation += colored(translation_text[0][key:val], "white", colors[count])
        
        # print(prev,key,val,)

        count += 1
        prev = val

    highlight_errors_translation += translation_text[0][prev:]

    

    return highlight_errors_og, highlight_errors_translation

og, translation = highlight_errors(source_text, [machine_translation], "error_sample.json")

print(og)
print(translation)

[44m[97mK[0m[41m[97m[0m[42m[97m[0m
[44m[97mOn Mond[0may, scientists at Stanford University Medica[41m[97ml School announced the discovery of a new diagnostic tool[0m that can differentiate stem cells by type: a


In [185]:
print(hypothesis_one)

Students from Stanford University Medical School announced Monday the invention of a new diagnostic tool that can sort cells by type of small printed chip,


### 4. Error Mapping Allignment Algorithm

In [None]:
def error_mapping_alignment():
    # TODO: Add error mapping
    pass


---
---

### 5. Evaluation

In [39]:
import sacrebleu

def compute_scores(hypothesis, reference):
    bleu = sacrebleu.corpus_bleu([hypothesis], [reference])
    chrf = sacrebleu.corpus_chrf([hypothesis], [reference])
    print(f"spBLEU score: {bleu.score}")
    print(f"chrF score: {chrf.score}")


In [43]:
compute_scores(hypothesis_one, HRL)

spBLEU score: 22.466891648632068
chrF score: 47.44336565879224


### Full FLORES Evaluation

In [70]:
dataset_path = "../flores/floresp-v2.0-rc.2/devtest"
english_data_path = "flores/floresp-v2.0-rc.2/devtest/devtest.eng_Latn"

lines = []


'''
Want to test translations from first n language datasets, 
against their English Machine translations through the pipeline 

First k lines from the first n language datasets
'''

n = 3

for filename in os.listdir(dataset_path):
    print("Reading", filename)
    
    with open(f'{dataset_path}/{filename}', encoding='utf-8') as dataset:

        lines = [next(dataset) for _ in range(10)]

        print(lines)
 
    n -= 1
    if n == 0:
        break


# print(lines)

Reading devtest.ace_Arab
['"کامو جينو نا تيكويه عمو ٤ بولن ڽڠ هانا ديابيتيس ڽڠ اوايجيه ساکيت ديابيتيس،" غتامه لى غوبڽن. \n', 'در. ايهود اور، ڤروفيسور کدوکترن بق يونيۏرسيتس دلهاوسي دي هليفاک\u200cس، نوفا سکوتيا ڠون کڤالا ديۏيسي كلينيس ڠون علميه دري اسوسياسي ديابيتيس کانادا ݢڤئيڠت بهوا ڤنليتين ڽو منتوڠ لم ماس ڽڠ کفون.\n', 'لݢى لادوم اورڠ چاروڠ لاءينجيه، غوبڽن راݢو ڤکوه ڤثاکيت ديابيتيس جوت ڤولي، سبب ڽڠ جيتومى ڽو هانا مسڠكوت دڠن اورڠ ڽڠ كا مڤثاکيت ديابيتيس جنيه ١.\n', 'بق اورو سنين، سارا دانيوس، سيکريتاريس تتڤ کوميتى نوبل سسترا بق اكاديمي سويديا، ݢعمومکن ڠون بق مندوم اورڠ لم ماس ڤروݢرم راديو دي راديو سۏاريا دي سويديا کوميتى ڽن، هن جوت مسمبوڠ دڠن بوب ديلان ڠون لانسوڠ تنتڠ کمنڠن هديه نوبل ٢٠١٦ لم سسترا، كا ݢڤتيڠݢاي اوساها کى مسمبوڠ دڠن غوبڽن.\n', 'دانيوس خن، "جينو كامو هان مبوت سڤو. لون کا لون تليڤون ڠون لون کيريم ايمايل کى کولابوراتور ݢوبڽن ڽڠ ڤاليڠ رب ڠون لون تريموڠ باله ڽڠ ڤاليڠ روميه. جينو، ڽن کا سيڤ.\n', 'سيݢوهلوم غوبڽن، سي.اي.او. ريڠ، جامي سيمينوف، ݢخن کى ڤراوسهاءن فون واتى بيل ڤنتو غو

### LangChain Agent Pipeline

In [None]:
from langchain.chains import SimpleChain

def pipeline(text):
    # 1: LRL to HRL Forward Translation
    translated_text = translate(text, source_lang="LRL_code", target_lang="en")
    
    # 2: Back Translation
    back_translated_text = translate(translated_text, source_lang="en", target_lang="LRL_code")
    
    # Step 3: Error Identification & Classification
    errors = classify_errors(back_translated_text)
    
    # Step 4: Error Mapping Alignment
    error_mapping_alignment()
    
    return {
        "original_text": text,
        "translated_text": translated_text,
        "back_translated_text": back_translated_text,
        "errors": errors
    }

result = pipeline("Didn't get here being careful ")
print(result)


ROUGH

---
---

In [2]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer


model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

In [4]:
# chinese_text = "生活就像一盒巧克力。"
chinese_text = "忘记了我在打扰的时间"

# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

AttributeError: 'NllbTokenizerFast' object has no attribute 'get_lang_id'