<a href="https://colab.research.google.com/github/PedroGFerreira/AdvancedTopicsMachineLeaning/blob/main/TAAC_nb1_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translating text from Portuguese to English
We will test two approaches to use an LLM for text translation.


1.   Using a Pre-trained and open-source model
2.   Using API to Access a commercial LLM

For 1) we will HuggingFace that provides the Transformers library and access to different models.  The Transformers will be explored in more detail in a subsequent notebook.
The model will be the unicamp-dl/translation-pt-en-t5:
[link text](https://huggingface.co/unicamp-dl/translation-pt-en-t5)

This is an implementation of T5 for translation in PT-EN tasks using a modest hardware setup.

**Note that models regarding the portuguese language are not so frequent. **

At HuggingFace there is at least another PT to EN translator from Unbabel. But this is a much larger model with 13B parameters that is trained to perform other tasks besides translation: [link text](https://huggingface.co/Unbabel/TowerInstruct-13B-v0.1)


For 2) we will use the chtGPT version3.5-turbo. To access this program programatically via an API you will need to have API key. You should register at the openAI website and generate an API Key. Your scriptin environment should be set to this Key.





In [None]:
# Install all the necessaries packages
# creating a virtual environment in your project directory
!python -m venv .env
# Activate the virtual environment
!source .env/bin/activate
# ready to install 🤗 Transformers with the following command:
!pip install transformers
!pip install openai

In [None]:
# load here the packages for the remainder of the script
import os
import openai
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# 1. Pre-trained and Open-Source Translation Model

Here we use to approached to accomplish the translation task. A more direct approach is using the *pipeline* functionality from the Transformers package.

A second approach is to use the functionality apply_chat_template that is used to create an interactive chating approach. This may not be the most adequate approach since the translation model used here is more adequate for direct translations. Nevertheless, the chat-based template is used for demonstration. See more here:
[link text](https://huggingface.co/docs/transformers/main/en/chat_templating
)

In [141]:
 # Initialize the translation pipeline using a model trained for Portuguese to English translation
pten_pipeline = pipeline('text2text-generation', model="unicamp-dl/translation-pt-en-t5")

# portuguese text to translate
portuguese_text = "Bem vindo ao curso de Tópicos Avançados de Aprendizagem Automática"

# 1 - Using directly the pipeline function from Hugging Face
text_to_translate = "translate Portuguese to English: %s" % portuguese_text
translated_text = pten_pipeline(text_to_translate)[0]['generated_text']
print("Translated Text:", translated_text)

# 2- Chat Template
# The apply_chat_template is a method in Huggingface's tokenizers library that can be used to format messages in a chat-like context.
# This method is often used with models that are designed for chat-based interactions.
# using the apply_chat_template with a translation model such as those in the AutoModelForSeq2SeqLM family isn't standard, since these models aren't usually designed for chat-style inputs.

tokenizer = AutoTokenizer.from_pretrained("unicamp-dl/translation-pt-en-t5")
model = AutoModelForSeq2SeqLM.from_pretrained("unicamp-dl/translation-pt-en-t5")

# Apply chat template (conceptual example)
formatted_input = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "You are a translator. Translate the following text from Portuguese to English."},
        {"role": "user", "content": portuguese_text}
    ],
    roles=["system", "user"],
    return_tensors="pt"
)

# Generate the translation
generated_tokens = model.generate(
    formatted_input,
    max_length=200,  # Adjust based on the expected length of the output
)

# Decode the generated tokens to get the translated text
translated_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

# Print the translated text
# Note that the text contains the identifiers that are used as control tokens.
print("Translated Text:", translated_text)

Translated Text: <|im_end|> <|im_start|>user Welcome to the Course of Advanced Topics of Automatic Learning<|im_end|> <|im_start|>system You are a translator. Translate the following text from


#Using an API to a Commercial Model

Here you need to start by setting the API KEY. In this case, as I was using colab, my Keys are stored in the "Secrets" section. In alternative the key can be hard coded in the text as:

```
openai.api_key = "........"
```



In [37]:
# Set up your OpenAI API key
openai.api_key = os.environ["OPENAI_API_KEY"]

# Define the Portuguese text to be translated
portuguese_text = "Bem vindo ao curso de Tópicos Avançados de Aprendizagem Automática"

# Use the OpenAI Chat API to translate the text
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",  # You can also use "gpt-4" if available
    messages=[
        {"role": "system", "content": "You are a helpful assistant that translates Portuguese to English."},
        {"role": "user", "content": f"Translate the following Portuguese text to English: {portuguese_text}"}
    ],
    temperature=0.0,  # Lower temperature for more deterministic output
)

# Extract the translated text
translated_text = response['choices'][0]['message']['content'].strip()

# Print the translated text
print("Translated Text:", translated_text)


Translated Text: Welcome to the Advanced Topics in Machine Learning course.
