# Dataset Translation

In the paper, we introduce HistoiresMorales, a French dataset built upon a corpus of human-written moral stories in English called MORALSTORIES. 
This dataset was introduced by Emelin et al., 2021: https://aclanthology.org/2021.emnlp-main.54/.

We use gpt-3.5-turbo-16k model for translations, accessed via the Chat Completions API in November 2023. 
That is the model for completion released by OpenAI :https://platform.openai.com/docs/models. The model is trained on data up to Sep 2021.

We initiate the data translation process with a simple prompt and refine it through human feedback. Below, we describe the construction of the prompt body and the corresponding data annotation procedures.

In this notebook, we provide code that we use for translating the data. We describe the process of translating the data in the section "Second Annotations Stage" of the paper. 

Running this code requires `datasets==2.13.1` and `openai==1.12.0` packages listed below and in ```requirements.txt```.

In [1]:
%pip install datasets openai -q
%pip install --upgrade typing-extensions -q
%pip install tiktoken -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# Dataset loading

In [2]:
# if you see error here ImportError: cannot import name 'Iterator' from 'typing_extensions'
# -> restart session (Redemarrer la session in Colab or restart kernel in Jupyter)
import pandas as pd
from datasets import load_dataset
import os
import json
import matplotlib.pyplot as plt
from time import sleep
from tqdm import tqdm
from difflib import ndiff
from collections import Counter
import tiktoken
from openai import OpenAI

In [3]:
dataset = load_dataset("demelin/moral_stories",'full')
df = pd.DataFrame(dataset["train"])
string_cols=[col for col, dtype in df.dtypes.items() if dtype == 'O'][1:]
print(string_cols)
df['concatenated_sentences'] = df[string_cols].apply(lambda row: '\n'.join(row), axis=1)
# shape of full MoralStories dataset==12k
assert df.shape[0]==12000

['norm', 'situation', 'intention', 'moral_action', 'moral_consequence', 'immoral_action', 'immoral_consequence']


# OpenAI Client Initialization: Put key here

Key to be copied from: https://platform.openai.com/api-keys
Go to the link above, click "Add new secret key".
Note that previously generated keys cannot be copied.

In [4]:
key= "KEY"
client = OpenAI(api_key=key)

# Prompting with demonstrations

We embed demonstrations in the prompt to enhance translation quality.  
The demonstrations have the structure as in the following example:
```
S : Mike wants to run errands and pick up food items
for dinner.
T : Michel souhaite faire des courses et ramasser des den-
rées alimentaires pour le dîner.
H : The translation of ‘pick up’ into ‘ramasser’ is too literal.
A more fitting translation for the context is ‘acheter’.
```

We use the following prompt for translation ( dubbed as **prompt 3** in the paper):

```
In this demonstration-based learning task, we will provide examples for translating moral stories from English to French. 
The demonstrations will follow this structure: S + T + H, where the latter are comments indicating which aspect was wrongly translated with suggested corrections. **Concatenated demonstrations**. 
Now, your task is: 
Translate the following sentences into French and adapt them to the French cultural context. Note: Names must be converted into French equivalents. 
Important: First names, geographical locations, and other named entities must be converted to French equivalents, and their translations should be consistent throughout the story.```

In [5]:
df_annotations=pd.read_feather("./annotated_data/annotations_01_rationales.feather")
assert df_annotations.shape[0]==15 # We use 15 demonstrations in our task

In [6]:
preprompt_demo = """
In this demonstration-based learning task, we will provide examples for translating moral stories from English to French. 
The demonstrations will follow this structure: S + T + H, where the latter are comments indicating which aspect was wrongly translated with suggested corrections.
"""
demo_list = []
for row_ind, row in df_annotations.iterrows():
    source = row['original']
    t1 = row['translations']
    rationales=row['rationales']
    demo_list.append((row_ind, {"source": source, "t1": t1, "rationale": rationales}))
int_=0
sorted_demo_dict=dict()
for key, value in demo_list:
    sorted_demo_dict[int_] = value
    int_+=1 
for demo_key, demo_value in list(sorted_demo_dict.items()):
    preprompt_demo += f"\n\nDemo {demo_key}:\n"
    preprompt_demo += f"(S): {demo_value['source']}\n"
    preprompt_demo += f"(T1): {demo_value['t1']}\n"
    preprompt_demo += f"(Rationale): {demo_value['rationale']}\n"
assert len(preprompt_demo)==19576 # \# of characters in the concated demonstrations

In [7]:
encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens
# n tokens for our prompt should not exceed max.context of translation model
assert num_tokens_from_string(preprompt_demo, "cl100k_base")==4729
preprompt_demo+= "Now, your task is: "
# Get the final prompt (P3)
preprompt = """Translate the following sentences into French and adapt them to the French cultural context. Note: Names must be converted into French equivalents. 
Important: First names, geographical locations, and other named entities must be converted to French equivalents, and their translations should be consistent throughout the story."""

In [8]:
# Remove already translated data from the queue
directory = 'morales_data'
os.makedirs(directory, exist_ok=True)
next_index = 0
if os.listdir(directory):
    dfs_i = []
    for filename in os.listdir(directory):
        if filename.endswith('.feather'):  # Assuming all files are in Feather format
            filepath = os.path.join(directory, filename)
            df_i = pd.read_feather(filepath)
            dfs_i.append(df_i)
    if dfs_i:
        combined_df = pd.concat(dfs_i, ignore_index=True)
        if combined_df.shape[0]:
            id_processed = set(combined_df['ID'].tolist())
            shape_0=df.shape
            df = df[~df['ID'].isin(id_processed)]
            print("Removed already processed entries from df. The previous and current df shapes:", )
            print(shape_0)
            print(df.shape)
            while os.path.exists(os.path.join(directory, f'data_{next_index}.feather')):
                next_index += 1
    else:
        print("No Feather files found in the directory.")
else:
    print("Directory is empty.")

Removed already processed entries from df. The previous and current df shapes:
(12000, 10)
(11998, 10)


Note that we use the default temperature parameter equal to 1. 

In [9]:
%%time
system_prompt="""
You are a translation model that translates messages for a morality alignment research project.
"""
output_directory = directory[:]
processed_count = 0
data_rows=[]
for index, row in tqdm(df.iterrows(), total=len(df)):
    request_i = preprompt_demo + preprompt + '\nStory:\n' + row['concatenated_sentences']
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo-16k",
        messages=[{"role": "user", "content": request_i, "system":system_prompt}],
        temperature=1
    )
    translation = completion.choices[0].message.content
    data_row = {
        "ID": row['ID'],
        "model": "gpt-3.5-turbo-16k",
        "prompt_body": row['concatenated_sentences'],
        "temp_default": translation
    }
    data_rows.append(data_row)
    
    processed_count += 1
    if processed_count % 5 == 0:
        next_index += 1
        data_init2 = pd.DataFrame(data_rows)
        output_filepath = os.path.join(output_directory, f'data_{next_index}.feather')
        data_init2.to_feather(output_filepath)
        data_init2 = None
        data_rows=[]
    else: 
        sleep(0.3)
    break

  0%|                                                                                        | 0/11998 [00:09<?, ?it/s]

CPU times: total: 31.2 ms
Wall time: 9.19 s





In [10]:
if data_rows:
    next_index += 1
    output_filepath = os.path.join(output_directory, f'data_{next_index}.feather')
    data_init2 = pd.DataFrame(data_rows)
    data_init2.to_feather(output_filepath)

In [None]:
df_new = pd.DataFrame(data_init2)
df_new.head(1)
# df_new.to_feather('test.feather')

The running time for translating all the stories is about 10 hours. Full cost for translating the data including the data for annotations is 200 \$.
We estimate the quality of the obtained data in the `data-quality-*` notebooks.
In the section `Temperature Search` of the paper, we elaborate on the impact of temperature. However, we use the default one and it allowed us to obtain good translations estimated with CometKIWI metric.