# Initial Config


Note from Mekael
- For my team, use a virutal environment to keep deployment operations clean and tidy
- You can do this by running the following command in your terminal
- `python -m venv error-in-translations`
- Activate the environment by selecting it as your kernel for your jupyter notebook. If it doesn't work you will have to figure it out
- pip install the packages in the requirements file in the root directory of this repo
- `pip install -r requirements.txt`


- **If you install new packages that are not included in the environment, please add it to the requirements file manually or generate a new requirements file with the following command in the terminal**
- `pip freeze > requirements.txt`

https://github.com/openlanguagedata/flores

In [28]:
import pandas as pd

import openai
from openai import OpenAI
import os
import spacy
import random
import sacrebleu

# Not working
OPENAI_API_KEY = 'sk-proj-o9TONJi0MW2tSiDMhRkxT3BlbkFJkUr03XQ5IfUaxamV0e3k'

# Mekael's Personal Key, not being shared
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Baseline Benchmark

This benchmark sets a baseline and tests the translation precision & accuracy our POC pipeline, against the bare translation capabilities of OPENAI's CHATGPT 3.5 Turbo via their API.

If our POC performs better than the stock GPT 3.5 Turbo, it means that our proposed method is valubale and worthwhile to implement. 

For baseline testing purposes, our POC makes use of custom GPT 3.5 API prompting as the translation model as well as the quality estimation model. These will be replaced with a more sophisticated custom LLM solution during actual implementation.

We will be using Meta's Flores 200 dataset for this testing, and scores will be in the form of spBLEU and/or chrF++

---

# Implementation

## View the dataset

In [2]:
english_dataset = "../flores/floresp-v2.0-rc.2/dev/dev.eng_Latn"

# Rename the Column
column_names = ["Text Lines"]

# Read in the Flores Dataset English Latn
df = pd.read_csv(english_dataset, delimiter = '\t', header=None, names=column_names)
df

Unnamed: 0,Text Lines
0,"On Monday, scientists from the Stanford Univer..."
1,Lead researchers say this may bring early dete...
2,The JAS 39C Gripen crashed onto a runway at ar...
3,The pilot was identified as Squadron Leader Di...
4,Local media reports an airport fire vehicle ro...
...,...
984,The tourist season for the hill stations gener...
985,"However, they have a different kind of beauty ..."
986,Only a few airlines still offer bereavement fa...
987,"Airlines that offer these include Air Canada, ..."


## Select Source & Target Languages

In [3]:
# Devtest folder will be used for this baseline testing as dev
# may be more likely to appear in training data
flores_dataset = "../flores/floresp-v2.0-rc.2/devtest"

language_datasets = os.listdir(flores_dataset)


# Randomly select source and target languages
src_language_dataset = random.choice(language_datasets)
targ_language_dataset = random.choice(language_datasets)

# Assure that the source and target languages are not the same
while targ_language_dataset == src_language_dataset:
    targ_language_dataset = random.choice(language_datasets)

## Translations using our proposed method (POC) and stock GPT 3.5 Turbo

### Add POC (no glossary)

In [4]:
# prompt with entity translation
def prompt_generator(text, source_language, target_language):
  prompt = f"Translate the following text from {source_language} into {target_language}: {text}\n"
  if terms == {}:
    return prompt
  prompt = translations + prompt

  return prompt

Machine Translation code, can be replaced by other models.

In [5]:
def translate_text(prompt):

    client = OpenAI(api_key= OPENAI_API_KEY,)
    print(prompt)
    response = client.chat.completions.create(
      messages=[{
            "role": "user",
            "content": prompt,
          }],
      model="gpt-3.5-turbo",)
    translation = response.choices[0].message.content.strip().split("\n")[0]
    return translation

Quality Estimation

In [6]:
def quality_estimator(original_text, translated_text):
  client = OpenAI(api_key= OPENAI_API_KEY,)
  prompt = f"Evaluate the quality estimation of the following source and translation sentence pairs by following a step-by-step process: \
    Step 1: Estimate the perplexity of the translated sentence.\
    Step 2: Determine the token-level similarity between the source and translatedsentences.\
    Step 3: Combine the results and classify the translation quality into one of the following categories:'No meaning preserved', 'Some meaning preserved, but not understandable', 'Some meaning preserved and understandable', 'Most meaningpreserved, minor issues',or 'Perfect translation'.\
    Source:{original_text}.Translation:{translated_text}"
  print(prompt)
  response = client.chat.completions.create(
    messages=[{
          "role": "user",
          "content": prompt,
        }],
    model="gpt-3.5-turbo",
    )
  result = response.choices[0].message.content
  return result

In [7]:
def quality_classifier(evaluation):
  start_index = evaluation.find("'")
  end_index = evaluation.find("'", start_index + 1)
  category = evaluation[start_index+1:end_index]
  return category

In [8]:
original_text = "光纤照上去变成黑光纤了"
translated_text = "The fiber optic cable shines and turns into dark fiber."
evaluation = quality_estimator(original_text, translated_text)
print(evaluation)

Evaluate the quality estimation of the following source and translation sentence pairs by following a step-by-step process:     Step 1: Estimate the perplexity of the translated sentence.    Step 2: Determine the token-level similarity between the source and translatedsentences.    Step 3: Combine the results and classify the translation quality into one of the following categories:'No meaning preserved', 'Some meaning preserved, but not understandable', 'Some meaning preserved and understandable', 'Most meaningpreserved, minor issues',or 'Perfect translation'.    Source:光纤照上去变成黑光纤了.Translation:The fiber optic cable shines and turns into dark fiber.
Step 1: Estimate the perplexity of the translated sentence.

Perplexity is a measure of how well a probability model predicts a sample. In the context of machine translation, a lower perplexity score indicates better translation quality. We would need a language model to calculate the perplexity score accurately, but we can estimate it by c

## Baseline Test

In [119]:
# Pull a random line from source and target datasets

src_sentences = []
targ_sentences = []

def read_dataset(path):
    with open(flores_dataset + "/" + path, 'r', encoding="utf-8") as dataset_file:
        lines = dataset_file.readlines()
        total_lines = len(lines)
    return lines, total_lines

        
src_lines, total_lines = read_dataset(src_language_dataset)
targ_lines, total_lines = read_dataset(src_language_dataset)

selected_line_int = random.randint(1, total_lines)

print(selected_line_int)

selected_line_src = src_lines[selected_line_int - 1]
selected_line_targ = targ_lines[selected_line_int - 1]

print(selected_line_src)
# print(src_sentences)
# print(targ_sentences)

849
Wɔn a wɔforo abotan a wofi wiase afanan nyinaa no gu so renya akwan foforo ahorow wɔ afasu bebree a wobetumi aforo no ho,



In [102]:
print(src_language_dataset)

devtest.twi_Latn_akua1239


In [None]:
# Translate source to target using our POC


In [120]:
# Translate source to target using stock GPT 3.5 Turbo

prompt = f'Translate "{selected_line_src}" from the {src_language_dataset} language into the target language {targ_language_dataset}'
translation = ""

client = OpenAI(api_key= OPENAI_API_KEY,)
print(prompt)
response = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": prompt,
        }],
    model="gpt-3.5-turbo",)

hypothesized_translation = response.choices[0].message.content.strip().split("\n")[0]

print("\n\nMANUAL CHECK LIST PARAMS\n\n")

print(src_language_dataset, src_sentences)
print(targ_language_dataset, targ_sentences)


print(f'TRANSLATION: {hypothesized_translation}')

Translate "Wɔn a wɔforo abotan a wofi wiase afanan nyinaa no gu so renya akwan foforo ahorow wɔ afasu bebree a wobetumi aforo no ho,
" from the devtest.twi_Latn_akua1239 language into the target language devtest.kab_Latn


MANUAL CHECK LIST PARAMS


devtest.twi_Latn_akua1239 []
devtest.kab_Latn []
TRANSLATION: "Wɔn a wɔforo abotan a wofi wiase afanan nyinaa no gu so renya akwan foforo ahorow wɔ afasu bebree a wobetumi aforo no ho," translates to "Tzɂa i nchbraben a ɚen anchkaandan tsʋbo sha jirjelkoo atusre inyeme ahalɩbli i tsʋkpa dunzakɔkpir abrabrɩtɔ" in Kab_Latn.


## Compute spBLEU & chrF++ Scores

Below are reference cells using the scores with english language sentence showing how it works. 

```hypothesized_translation``` can be just a string which contains the translation from our model

```target_translation``` should be a list of strings (the score expects list of lists of strings)

In [64]:
target_translation = targ_sentences

bleu_score  = sacrebleu.corpus_bleu([hypothesized_translation], [target_translation], tokenize="intl")
print("spBLEU Score: ", bleu_score.score)

spBLEU Score:  1.2414943415352928


In [None]:
hypothesized_translation = "Lose track of ti"
target_translation = "Lose track of time"

bleu_score  = sacrebleu.corpus_chrf([hypothesized_translation], [[target_translation]])
print("chrF++ Score: ", bleu_score.score)

chrF++ Score:  86.51314090979501


---

Below is a demo trying to find out why the score is so bad. If you want to approach then change the stuff below. NOTE: ```target_translation``` should be a list of lists of strings that are in the TARGET LANGUAGE as REFERENCES for the score to judge against. This means that all sentences being fed to the score, ```hypothesized_translation``` and all strings in ```target_translation``` are all in the same language. Be careful with what each variable contains there are similar named variables in this notebook.

In [121]:
# hypothesized_translation = "Anḍil aṭas uṭṭun d ayen n 'feral' neɣ uṭṭun. Imdukkal yessnen ayen yal aṭṭun wagi ɣef wulawen-is (asnin-iten yakan weḥḍen); yella d aṭṭun yessufeɣ aḥric neɣ yelha d acu kan aɣewṭur"


hypothesized_translation = hypothesized_translation


target_translation = [selected_line_targ
]

bleu_score  = sacrebleu.corpus_bleu([hypothesized_translation], [target_translation], tokenize="intl")
print("spBLEU Score: ", bleu_score.score)

chr_score  = sacrebleu.corpus_chrf([hypothesized_translation], [target_translation])
print("chrF++ Score: ", chr_score.score)

print(hypothesized_translation)
# print(selected_line_src)
print(selected_line_targ)

spBLEU Score:  46.493318002787895
chrF++ Score:  79.9140429774981
"Wɔn a wɔforo abotan a wofi wiase afanan nyinaa no gu so renya akwan foforo ahorow wɔ afasu bebree a wobetumi aforo no ho," translates to "Tzɂa i nchbraben a ɚen anchkaandan tsʋbo sha jirjelkoo atusre inyeme ahalɩbli i tsʋkpa dunzakɔkpir abrabrɩtɔ" in Kab_Latn.
Wɔn a wɔforo abotan a wofi wiase afanan nyinaa no gu so renya akwan foforo ahorow wɔ afasu bebree a wobetumi aforo no ho,



In [90]:
import sacrebleu

hypothesized_translation = ["Je perds la notion du temps, je deviens fou."]
target_translation = [["Je perds la notion du temps, je deviens fou."]]

bleu_score  = sacrebleu.corpus_bleu(hypothesized_translation, target_translation, tokenize="intl")
print("spBLEU Score: ", bleu_score.score)

chr_score  = sacrebleu.corpus_chrf(hypothesized_translation, target_translation)
print("chrF++ Score: ", chr_score.score)


spBLEU Score:  100.00000000000004
chrF++ Score:  100.0


In [81]:
# hypothesized_translation = "Anḍil aṭas uṭṭun d ayen n 'feral' neɣ uṭṭun. Imdukkal yessnen ayen yal aṭṭun wagi ɣef wulawen-is (asnin-iten yakan weḥḍen); yella d aṭṭun yessufeɣ aḥric neɣ yelha d acu kan aɣewṭur"


hypothesized_translation = "Tsere tsin abu bay ɣud jabeg ma, ɣur nu wira. Tɛɣa tu tɔb tɛna, tu aɣa d bul bay tɛt bayu. Nyenta but gan, nynta n ma jkung ta ndejɣa nuut bay ta."


target_translation = ['ɣer taggara n Tallit Talemmast Tuṛuft n umalu tebda ad tesnulfuy aɣan n yiman-is. yiwen n usnulfu meqqren akk n tallit d ayen i yeǧǧan imdanen n yimidagen ad bdun ad ttarran tiqeffilin i wakken ad ṭṭfen iceṭṭiḍen-nsen.',
'Tafellaḥt n tudert d tafellaḥt i xeddmen i ufares n ddeqs n wučči i wakken kan ad ččen ifellaḥen d twaculin-nsen.',
'Tafellaḥt n tudert d anagraw isehlen, tikkwal agaman, s usexdem n zzeriεa n tmurt-nni yettwajemεen i sdukkulen s tuzzya n yirden neɣ timamkin tiḥerfiyin icudden ɣer waya i wakken ad snernin lɣella.',
'Deg umezruy tuget n yifellaḥen ttekkan deg tfellaḥt n tudert yerna d ayen mazal ɣer tura deg waṭas n yiɣlanen deg ubrid n tneflit.',
'Idelsan imecṭaḥ jemεen-d imdanen yesεan tidmiyin yettemcabin yettḥassan d akken ttwaεzalen seg yilugan imettiyen yerna fkan-asen tagnit i wakken sneflin anamek n tmagit.',
'Nezmer ad d-neεqel idelsan imecṭaḥ s leεmeṛ, agraw atni, taserkemt, adeg, d/neɣ tuzzuft n yimaṣlaḍen.',
'Tulmisin yesbanayen amek ad tεeqleḍ adles amecṭuḥ zemrent ad ilint d tisnilsanin, tifulkanin, tisɣanin, tisertiyin, tuzzufin, tirakalin, neɣ d asdukkel n yimgan.',
'Imaṣlaḍen n udles amecṭuḥ zgan sbanayen-d attekki-nsen s usexdem azamulan n waɣan yettmaεqalen , am lebsa, tikli, d tmeslayt.',
'Yiwet si tarrayin timagnutin akk yettwasxedmen i wakken ad d-nesmedya azal n usmetti d tumla n kra n tejṛutin yettuɣaḍen n warrac imecṭaḥ i, s ustehzi, lexṣaṣ n zher, neɣ tameḥqranit s lebɣi, ur ten-smettin ara imengaḍen mi ttimɣuren.',
'Arrac imecṭaḥ am wigi qqaren-asen d "imulasen" neɣ d iweḥciyen. Kra n warrac imecṭaḥ imulasen qqnen-ten deg uxxam yemdanen (s wudem imezgi d imawlan-nsen); di kra n tejṛutin n usinef n ugrud-agi yella-d imi imawlan ur qbilen ara iɣeblan n tdawsa taɣaṛant d taggagt iweεren n ugrud. ',
'Igerdan iweḥciyen yezmer jerrben tamuḥqranit neɣ d tiyita qessiḥen uqbel ad ten-ǧǧen neɣ ad rewlen.',
'Llan wiyaḍ qqaren-d fell-asen d akken rebban-ten-id iɣersiwen; kra niḍen nnan-d fell-asen d akken dren deg teẓgi i yiman-nsen.',
'Imi tturebban-d s wudem ummid sɣur iɣersiwen war-alsiyen, igerdan iweḥciyen ttammalen-d tikli ( s tilisa tiɣaṛanin) qrib am tin n yiɣersiwen i ten-irebban, am tugdi-nsen neɣ tigurzent ɣer yilsiyen.',
'Imi almad yebnan ɣef usenfar yessishil almad yerna yettara-t yesεa azal, tasuki n leεḍil tεedda akkin i waya.',
'Tasuki n leεḍil ur telli d tarrayt n ulmad maca d lemεawna i yettεawanen imdanen i yesεan tirmit n ulmad tamaynut am usexdem n useɣzan amaynut n uselkam neɣ tazwara n usenfar amaynut.',
'Wid-agi yettεawanen zemren ad ilin d uhlisen neɣ d ilawen, s wawalen niḍen, aselmad d talɣa n umεiwen maca ula d amessak n lkaɣeḍ n Lbiru Mikrusuft.',
'Imεiwnen uhlisen llan deg useɣzan yerna llan i wakken ad steqsin, sḥercen, akked ad segzun tisekkirin iweεṛen i yinelmaden ad tent-fehmen s yiman-nsen.',
'Igerdan ttilin deg Ixxamen n Twaculin niḍen ɣef waṭas n ssebbat tanḍiyin seg ustehzi, tamuḥqranit, ula ɣer tukksa.',
'Ur yelli ugrud ilaqen ad yimɣur deg uxxam anda ulac leḥmala, aseḥbiber, d usinen, maca nutni ddren akka.',
'Nettwali d akken Anagraw n Twaculin Yettrebbin d adeg yesεan taɣellist i yigerdan-agi.',
'Anagraw-nneɣ n twaculin yettrebbin tiṭ-is ad yefk ixxamen yesεan taɣellist, imejjayen yesεan tayri, asinen irekden, d asejji bu taflest.',
'Ixxamen n twaculin yettrebbin ilaq ad sεun akk ayen ixuṣṣen deg yixxamen seg ansi id ten-id-yekksen yakan.',
'Internet yesdukkel iferdisen n teɣwalt n waṭas n lɣaci d tin gar yemdanen.',
'Tulmisin tufṛizin n Internet ṣṣawaḍent ɣer tsektiwin timernanin deg wayen yeεnan tudsa n useqdec akked taḍfi.',
'D amedya, "almad" d "wesmetti" summren-ten am isguffden ixataren i useqdec n Internet (James d wiyaḍ., 1995).',
]

bleu_score  = sacrebleu.corpus_bleu([hypothesized_translation], [target_translation], tokenize="intl")
print("spBLEU Score: ", bleu_score.score)

spBLEU Score:  1.1734190039234365


In [88]:
hypothesized_translation = "Anḍil aṭas uṭṭun d ayen n 'feral' neɣ uṭṭun. Imdukkal yessnen ayen yal aṭṭun wagi ɣef wulawen-is (asnin-iten yakan weḥḍen); yella d aṭṭun yessufeɣ aḥric neɣ yelha d acu kan aɣewṭur"
target_translation = ['ɣer taggara n Tallit Talemmast Tuṛuft n umalu tebda ad tesnulfuy aɣan n yiman-is. yiwen n usnulfu meqqren akk n tallit d ayen i yeǧǧan imdanen n yimidagen ad bdun ad ttarran tiqeffilin i wakken ad ṭṭfen iceṭṭiḍen-nsen.',
'Tafellaḥt n tudert d tafellaḥt i xeddmen i ufares n ddeqs n wučči i wakken kan ad ččen ifellaḥen d twaculin-nsen.',
'Tafellaḥt n tudert d anagraw isehlen, tikkwal agaman, s usexdem n zzeriεa n tmurt-nni yettwajemεen i sdukkulen s tuzzya n yirden neɣ timamkin tiḥerfiyin icudden ɣer waya i wakken ad snernin lɣella.',
'Deg umezruy tuget n yifellaḥen ttekkan deg tfellaḥt n tudert yerna d ayen mazal ɣer tura deg waṭas n yiɣlanen deg ubrid n tneflit.',
'Idelsan imecṭaḥ jemεen-d imdanen yesεan tidmiyin yettemcabin yettḥassan d akken ttwaεzalen seg yilugan imettiyen yerna fkan-asen tagnit i wakken sneflin anamek n tmagit.',
'Nezmer ad d-neεqel idelsan imecṭaḥ s leεmeṛ, agraw atni, taserkemt, adeg, d/neɣ tuzzuft n yimaṣlaḍen.',
'Tulmisin tufṛizin n Internet ṣṣawaḍent ɣer tsektiwin timernanin deg wayen yeεnan tudsa n useqdec akked taḍfi.',
'D amedya, "almad" d "wesmetti" summren-ten am isguffden ixataren i useqdec n Internet (James d wiyaḍ., 1995).',
]

chr_score  = sacrebleu.corpus_chrf([hypothesized_translation], [target_translation])
print("chrF++ Score: ", chr_score.score)

chrF++ Score:  18.932100676864668


---

Dummy cell below

In [36]:
# Hypotheses (system translations)
hypotheses = ["This is a test translation.", "Here is another sentence."]

# References (each sublist contains the reference translations for one hypothesis)
references = [
    ["This is a test translation.", "This is a trial translation."],
    ["Here is another sentence.", "This is another sentence."]
]

# Compute the BLEU score using a specific tokenizer ('intl' or '13a', 'zh' for Chinese, etc.)
bleu_score = sacrebleu.corpus_bleu(hypotheses, references, tokenize='intl')
print("BLEU Score:", bleu_score.score)

BLEU Score: 86.27788640890412
