# Translate text to other languages

You will take the text chunks and questions that you have created in the previous notebook and translate them to other languages using OpenAI.

Define the target language you want to translate to.

In [1]:
target_language = "arabic"

## OpenAI

### Create the system prompt

Create a system prompt to teach the LLM how to perform the translation task.

In [2]:
system_prompt = f"""
## Task description

You are a world-class translator. Your task is to translate a given text into a specified target language. Don't do a literal word-for-word translation; instead, ensure that the translated text captures the original meaning, tone, and context.

### Important instructions:

- Don't add or remove any information from the original text.
- Maintain the original tone and style of the text.
- Return only the translated text without any additional commentary or explanations.

Here is the text, please translate it into {target_language}:
"""

### Create a Pydantic object

We use a Pydantic object to define the structure of the output we want from the LLM. This will force the LLM to return the output in a structured format that we can easily parse.

In [3]:
from pydantic import BaseModel


class TranslatedText(BaseModel):
    text: str

### Load the data

In [4]:
import json

prefix = "imad_saddik_end_of_studies_internship_report_chunks"
text_chunk_file_path = f"../data/chunks/{prefix}.json"
questions_file_path = f"../data/question_answer_pairs/{prefix}_qa_pairs.json"

with open(text_chunk_file_path, "r") as f:
    text_chunks = json.load(f)

with open(questions_file_path, "r") as f:
    questions = json.load(f)

print(f"Loaded {len(text_chunks)} text chunks.")
print(f"Loaded {len(questions)} questions.")

Loaded 42 text chunks.
Loaded 494 questions.


### Use OpenAI

In [5]:
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

Prepare to save the translated chunks and questions in the same direcotories as before but with a suffix indicating the target language. 

For example, if the target language is Arabic, the files will be saved as `chunks_arabic.json` and `chunks_qa_pairs_arabic.json`.

In [6]:
import time

from tqdm import tqdm
from openai import APIConnectionError, RateLimitError, InternalServerError

MAX_RETRIES = 5
INITIAL_DELAY_SECONDS = 2

for chunk in tqdm(text_chunks, total=len(text_chunks)):
    text = chunk["text_chunk"]

    delay = INITIAL_DELAY_SECONDS
    for attempt in range(MAX_RETRIES):
        try:
            response = client.responses.parse(
                model="gpt-4.1",
                input=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": text},
                ],
                text_format=TranslatedText,
            )

            translated_text: TranslatedText = response.output_parsed  # type: ignore
            chunk["text_chunk"] = translated_text.text
            break

        except (APIConnectionError, RateLimitError, InternalServerError) as e:
            if attempt == MAX_RETRIES - 1:
                print(f"FINAL ATTEMPT FAILED for a chunk. Error: {e}")
                break

            print(
                f"Model overloaded (503). Retrying in {delay} seconds... (Attempt {attempt + 1}/{MAX_RETRIES})"
            )

            time.sleep(delay)
            delay *= 2

        except Exception as e:
            print(f"An unexpected error occurred with a chunk: {e}")
            break

100%|██████████| 42/42 [16:16<00:00, 23.25s/it]


In [7]:
for question_object in tqdm(questions, total=len(questions)):
    question_text = question_object["question"]

    delay = INITIAL_DELAY_SECONDS
    for attempt in range(MAX_RETRIES):
        try:
            response = client.responses.parse(
                model="gpt-4.1",
                input=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": question_text},
                ],
                text_format=TranslatedText,
            )

            translated_text: TranslatedText = response.output_parsed  # type: ignore
            question_object["question"] = translated_text.text
            break

        except (APIConnectionError, RateLimitError, InternalServerError) as e:
            if attempt == MAX_RETRIES - 1:
                print(f"FINAL ATTEMPT FAILED for a chunk. Error: {e}")
                break

            print(
                f"Model overloaded (503). Retrying in {delay} seconds... (Attempt {attempt + 1}/{MAX_RETRIES})"
            )

            time.sleep(delay)
            delay *= 2

        except Exception as e:
            print(f"An unexpected error occurred with a chunk: {e}")
            break

100%|██████████| 494/494 [12:20<00:00,  1.50s/it]


### Save the translated data

In [16]:
translated_text_chunk_file_path = f"../data/chunks/{prefix}_{target_language}.json"
translated_questions_file_path = (
    f"../data/question_answer_pairs/{prefix}_qa_pairs_{target_language}.json"
)

with open(translated_text_chunk_file_path, "w") as f:
    json.dump(text_chunks, f, indent=2)

with open(translated_questions_file_path, "w") as f:
    json.dump(questions, f, indent=2)