# Generative AI Use Case: Machine Translation

Welcome to the practical side of my generative AI project. In this lab I will do the machine translation task using generative AI. I will explore how the input text affects the output of the model, and perform prompt engineering to direct it towards the task I need. By comparing zero shot, one shot, and few shot inferences, I will take the first step towards prompt engineering and see how it can enhance the generative output of Large Language Models.

# Table of Contents

- [ 1 - Set up Required Dependencies](#1)
- [ 2 - Machine translation without Prompt Engineering](#2)
- [ 3 - Machine translation with an Instruction Prompt](#3)
  - [ 3.1 - Zero Shot Inference with an Instruction Prompt](#3.1)
  - [ 3.2 - Zero Shot Inference with the Prompt Template from FLAN-T5](#3.2)
- [ 4 - Machine translation with One Shot and Few Shot Inference](#4)
  - [ 4.1 - One Shot Inference](#4.1)
  - [ 4.2 - Few Shot Inference](#4.2)
- [ 5 - Generative Configuration Parameters for Inference](#5)


<a name='1'></a>
## 1 - Set up Required Dependencies

In [1]:
%pip install -U datasets==2.17.0

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Ignored the following yanked versions: 0.3.0a0[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement torchdata==0.5.1 (from versions: 0.3.0a1, 0.3.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.8.0, 0.9.0, 0.10.0, 0.10.1)[0m[31m
[0m[31mERROR: No matching distribution found for torchdata==0.5.1[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Load the datasets, Large Language Model (LLM), tokenizer, and configurator. Do not worry if you do not understand yet all of those components - they will be described and discussed later in the notebook.

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

  from .autonotebook import tqdm as notebook_tqdm


<a name='2'></a>
## 2 - Machine translation without Prompt Engineering

In this use case, you will be generating a translation between azerbaijani and english with the pre-trained Large Language Model (LLM) MarianMT from Hugging Face. The list of available models in the Hugging Face `transformers` package can be found [here](https://huggingface.co/docs/transformers/index). 

Let's upload some simple translation from the [DialogSum](https://huggingface.co/datasets/Zarifa/English-To-Azerbaijani) Hugging Face dataset. This dataset contains 5,000+ sentence with the corresponding manually labeled translated text. 

In [3]:
huggingface_dataset_name = "Zarifa/English-To-Azerbaijani"

dataset = load_dataset(huggingface_dataset_name)

Print a couple of sentences with their baseline translations.

In [4]:
example_indices = [40, 200]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['train'][index]['translation']['en'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['train'][index]['translation']['aze'])
    print(dash_line)
    print()

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
It is time to go to school.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Məktəbə getmə vaxtı.
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  2
---------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
The president was elected for four years.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Başçı dörd illiyinə seçildi.
-----------------------------------------------------------------------------

Load the [MarianMT](https://huggingface.co/Helsinki-NLP/opus-mt-az-en), creating an instance of the `AutoModelForSeq2SeqLM` class with the `.from_pretrained()` method. 

In [5]:
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-az-en")

  return torch.load(checkpoint_file, map_location="cpu")


To perform encoding and decoding, you need to work with text in a tokenized form. **Tokenization** is the process of splitting texts into smaller units that can be processed by the LLM models. 

Download the tokenizer for the MarianMT model using `AutoTokenizer.from_pretrained()` method. Parameter `use_fast` switches on fast tokenizer. At this stage, there is no need to go into the details of that, but you can find the tokenizer parameters in the [documentation](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer).

In [6]:
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-az-en")



Test the tokenizer encoding and decoding a simple sentence:

In [9]:
sentence = "What time is it, Tom?"

sentence_encoded = tokenizer(dataset['train'][index]['translation']['aze'], return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0], 
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([3119, 1294, 2509,  560, 5190,  871, 8239,    5,    0])

DECODED SENTENCE:
▁Başçı▁dörd▁illiyinə▁seçildi.


Now it's time to explore how well the base LLM translate a test without any prompt engineering. **Prompt engineering** is an act of a human changing the **prompt** (input) to improve the response for a given task.

In [11]:
for i, index in enumerate(example_indices):
    dialogue = dataset['train'][index]['translation']['aze']
    summary = dataset['train'][index]['translation']['en']
    
    inputs = tokenizer(dialogue, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Məktəbə getmə vaxtı.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
It is time to go to school.
---------------------------------------------------------------------------------------------------
MODEL GENERATION - WITHOUT PROMPT ENGINEERING:
Stay up at school.

---------------------------------------------------------------------------------------------------
Example  2
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Başçı dörd illiyinə seçildi.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
The president was elected for four years.
---------------