# Initial Config


Note from Mekael
- For my team, use a virutal environment to keep deployment operations clean and tidy
- You can do this by running the following command in your terminal
- `python -m venv error-in-translations`
- Activate the environment by selecting it as your kernel for your jupyter notebook. If it doesn't work you will have to figure it out
- pip install the packages in the requirements file in the root directory of this repo
- `pip install -r requirements.txt`


- **If you install new packages that are not included in the environment, please add it to the requirements file manually or generate a new requirements file with the following command in the terminal**
- `pip freeze > requirements.txt`

https://github.com/openlanguagedata/flores

In [2]:
# !pip install sacrebleu
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


You should consider upgrading via the 'C:\Users\mekae\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [3]:
import pandas as pd

import openai
from openai import OpenAI
import os
import spacy
import random
import sacrebleu

# Not working
OPENAI_API_KEY = 'sk-proj-o9TONJi0MW2tSiDMhRkxT3BlbkFJkUr03XQ5IfUaxamV0e3k'

# Mekael's Personal Key, not being shared
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Baseline Benchmark

This benchmark sets a baseline and tests the translation precision & accuracy our POC pipeline, against the bare translation capabilities of OPENAI's CHATGPT 3.5 Turbo via their API.

If our POC performs better than the stock GPT 3.5 Turbo, it means that our proposed method is valubale and worthwhile to implement. 

For baseline testing purposes, our POC makes use of custom GPT 3.5 API prompting as the translation model as well as the quality estimation model. These will be replaced with a more sophisticated custom LLM solution during actual implementation.

We will be using Meta's Flores 200 dataset for this testing, and scores will be in the form of spBLEU and/or chrF++

---

# Implementation

## View the dataset

In [4]:
english_dataset = "../flores/floresp-v2.0-rc.2/dev/dev.eng_Latn"

# Rename the Column
column_names = ["Text Lines"]

# Read in the Flores Dataset English Latn
df = pd.read_csv(english_dataset, delimiter = '\t', header=None, names=column_names)
df

Unnamed: 0,Text Lines
0,"On Monday, scientists from the Stanford Univer..."
1,Lead researchers say this may bring early dete...
2,The JAS 39C Gripen crashed onto a runway at ar...
3,The pilot was identified as Squadron Leader Di...
4,Local media reports an airport fire vehicle ro...
...,...
984,The tourist season for the hill stations gener...
985,"However, they have a different kind of beauty ..."
986,Only a few airlines still offer bereavement fa...
987,"Airlines that offer these include Air Canada, ..."


## Select Source & Target Languages

In [5]:
# Devtest folder will be used for this baseline testing as dev
# may be more likely to appear in training data
flores_dataset = "../flores/floresp-v2.0-rc.2/devtest"

language_datasets = os.listdir(flores_dataset)


# Randomly select source and target languages
src_language_dataset = random.choice(language_datasets)
targ_language_dataset = random.choice(language_datasets)

# Assure that the source and target languages are not the same
while targ_language_dataset == src_language_dataset:
    targ_language_dataset = random.choice(language_datasets)

## Translations using our proposed method (POC) and stock GPT 3.5 Turbo

### Add POC (no glossary)

In [6]:
# prompt with entity translation
def prompt_generator(text, source_language, target_language):
  prompt = f"Translate the following text from {source_language} into {target_language}: {text}\n"
  if terms == {}:
    return prompt
  prompt = translations + prompt

  return prompt

Machine Translation code, can be replaced by other models.

In [7]:
def translate_text(prompt):

    client = OpenAI(api_key= OPENAI_API_KEY,)
    print(prompt)
    response = client.chat.completions.create(
      messages=[{
            "role": "user",
            "content": prompt,
          }],
      model="gpt-3.5-turbo",)
    translation = response.choices[0].message.content.strip().split("\n")[0]
    return translation

Quality Estimation

In [8]:
def quality_estimator(original_text, translated_text):
  client = OpenAI(api_key= OPENAI_API_KEY,)
  prompt = f"Evaluate the quality estimation of the following source and translation sentence pairs by following a step-by-step process: \
    Step 1: Estimate the perplexity of the translated sentence.\
    Step 2: Determine the token-level similarity between the source and translatedsentences.\
    Step 3: Combine the results and classify the translation quality into one of the following categories:'No meaning preserved', 'Some meaning preserved, but not understandable', 'Some meaning preserved and understandable', 'Most meaningpreserved, minor issues',or 'Perfect translation'.\
    Source:{original_text}.Translation:{translated_text}"
  print(prompt)
  response = client.chat.completions.create(
    messages=[{
          "role": "user",
          "content": prompt,
        }],
    model="gpt-3.5-turbo",
    )
  result = response.choices[0].message.content
  return result

In [9]:
def quality_classifier(evaluation):
  start_index = evaluation.find("'")
  end_index = evaluation.find("'", start_index + 1)
  category = evaluation[start_index+1:end_index]
  return category

In [10]:
original_text = "光纤照上去变成黑光纤了"
translated_text = "The fiber optic cable shines and turns into dark fiber."
evaluation = quality_estimator(original_text, translated_text)
print(evaluation)

Evaluate the quality estimation of the following source and translation sentence pairs by following a step-by-step process:     Step 1: Estimate the perplexity of the translated sentence.    Step 2: Determine the token-level similarity between the source and translatedsentences.    Step 3: Combine the results and classify the translation quality into one of the following categories:'No meaning preserved', 'Some meaning preserved, but not understandable', 'Some meaning preserved and understandable', 'Most meaningpreserved, minor issues',or 'Perfect translation'.    Source:光纤照上去变成黑光纤了.Translation:The fiber optic cable shines and turns into dark fiber.
Step 1: Estimate the perplexity of the translated sentence.

Perplexity is a measure of how well a probability model predicts a sample. In the context of translation quality evaluation, lower perplexity indicates better quality as it means the translation is more coherent and grammatically correct. To estimate the perplexity of the translat

## Baseline Test

In [11]:
# Pull a random line from source and target datasets

src_sentences = []
targ_sentences = []

def read_dataset(path):
    with open(flores_dataset + "/" + path, 'r', encoding="utf-8") as dataset_file:
        lines = dataset_file.readlines()
        total_lines = len(lines)
    return lines, total_lines

        
src_lines, total_lines = read_dataset(src_language_dataset)
targ_lines, total_lines = read_dataset(src_language_dataset)

selected_line_int = random.randint(1, total_lines)

print(selected_line_int)

selected_line_src = src_lines[selected_line_int - 1]
selected_line_targ = targ_lines[selected_line_int - 1]

print(selected_line_src)
# print(src_sentences)
# print(targ_sentences)

797
تضم البعض من المهرجانات مناطق تخييمٍ خاصةٍ للعائلات التي معها أولاد صغار.



In [12]:
print(src_language_dataset)

devtest.ars_Arab


In [13]:
# Translate source to target using our POC


In [14]:
# Translate source to target using stock GPT 3.5 Turbo

prompt = f'Translate "{selected_line_src}" from the {src_language_dataset} language into the target language {targ_language_dataset}. Output only the '
translation = ""

client = OpenAI(api_key= OPENAI_API_KEY,)
print(prompt)
response = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": prompt,
        }],
    model="gpt-3.5-turbo",)

hypothesized_translation = response.choices[0].message.content.strip().split("\n")[0]

print("\n\nMANUAL CHECK LIST PARAMS\n\n")

print(src_language_dataset, src_sentences)
print(targ_language_dataset, targ_sentences)


print(f'TRANSLATION: {hypothesized_translation}')

Translate "تضم البعض من المهرجانات مناطق تخييمٍ خاصةٍ للعائلات التي معها أولاد صغار.
" from the devtest.ars_Arab language into the target language devtest.ita_Latn. Output only the 


AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************pDOs. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

## Compute spBLEU & chrF++ Scores

Below are reference cells using the scores with english language sentence showing how it works. 

```hypothesized_translation``` can be just a string which contains the translation from our model

```target_translation``` should be a list of strings (the score expects list of lists of strings)

---

Below is a demo trying to find out why the score is so bad. If you want to approach then change the stuff below. NOTE: ```target_translation``` should be a list of lists of strings that are in the TARGET LANGUAGE as REFERENCES for the score to judge against. This means that all sentences being fed to the score, ```hypothesized_translation``` and all strings in ```target_translation``` are all in the same language. Be careful with what each variable contains there are similar named variables in this notebook.

In [121]:
hypothesized_translation = hypothesized_translation


target_translation = [selected_line_targ
]

bleu_score  = sacrebleu.corpus_bleu([hypothesized_translation], [target_translation], tokenize="intl")
print("spBLEU Score: ", bleu_score.score)

chr_score  = sacrebleu.corpus_chrf([hypothesized_translation], [target_translation])
print("chrF++ Score: ", chr_score.score)

print(hypothesized_translation)
# print(selected_line_src)
print(selected_line_targ)

spBLEU Score:  46.493318002787895
chrF++ Score:  79.9140429774981
"Wɔn a wɔforo abotan a wofi wiase afanan nyinaa no gu so renya akwan foforo ahorow wɔ afasu bebree a wobetumi aforo no ho," translates to "Tzɂa i nchbraben a ɚen anchkaandan tsʋbo sha jirjelkoo atusre inyeme ahalɩbli i tsʋkpa dunzakɔkpir abrabrɩtɔ" in Kab_Latn.
Wɔn a wɔforo abotan a wofi wiase afanan nyinaa no gu so renya akwan foforo ahorow wɔ afasu bebree a wobetumi aforo no ho,

