# Midterm Task
## Encoder-Decoder-based transformer (language translation):
Please run and study notebook iwslt_ted_talk_midterm.ipynb

**Dataset:** IWSLT Ted-talk English-to-French Translation Dataset
This dataset includes parallel English-French sentences, which are commonly used for training translation models. It is a well-known benchmark dataset for machine translation tasks. We are using a tiny part of this dataset. Try systematically increasing it and observe the effects.

**Task:** Transfer-learn the model that translates English sentences into French using IWSLT. Professor started pre-processing and formatting the IWSLT data for model training – please explain the existing steps and continue towards the full training and inference loop.

**For both projects:**

o Use tokenizer libraries like Hugging Face’s transformers for consistency in tokenization.

o Split the datasets into training, validation, and test sets.

o For text generation, sequences need to be transformed into input-output pairs where the model predicts the next token based on prior tokens.

o For translation, align the source (English) and target (French) sentences for input-output mapping.

# Encoder-Decoder-based transformer (language translation) Yanlai
## First Implementation
The code below executes the project's first implementation in directory version, the working environment can be found at tesla.rowan.edu:/home/yanlai/ADV_DL_MIDTERM, port:8000. **Other code blocks are the step-by-step second implementation with explainations for details.**

This implementations is fine-tuned on T5-small, which is a transformer-based language model developed by Google Research with encoder-decoder structure, for **100(script run)+10(notebook run) epoches**. And an interactive inference demo script is located at:/home/yanlai/ADV_DL_MIDTERM/interactive_inference_demo.py. If the python environment is initilized properly, you shall run the demo via python interactive_inference_demo.py in the working directory of this project with no extra modification. Subsequently, inputing any English strings will get the translation pipeline started.

Training and model parameters(e.g. learning rate, resume path, dataset increment, mini-batch size) can be adjusted in configuration scripts in /home/yanlai/ADV_DL_MIDTERM/config/include and /home/yanlai/ADV_DL_MIDTERM/config/model folders. 

In [1]:
from tqdm import tqdm
from config.config import cfg
from colorama import Fore, Style, init
from engine import trainer
from dataset.dataset import tokenized_books_eval,tokenized_books_test
from model.model import tokenizer
if not cfg['model']['do_eval']:
    #run test dataset every 1000 steps
    trainer.train()
trainer.evaluate(eval_dataset=tokenized_books_test)
#Metrics Computations on Eval Dataset
results = trainer.predict(tokenized_books_eval)
'''
PredictionOutput(predictions=array([[    0,  1064,   285, ...,    87,   287, 19882],
    [    0,  3557,   210, ...,     0,     0,     0],
    [    0, 17129,  5545, ...,     3, 26375,   245],
    ...,
    [    0,  1955,   276, ...,     0,     0,     0],
    [    0,  3039,    73, ...,    15,    20,  2143],
    [    0,  9236,     9, ...,     0,     0,     0]]), 
    label_ids=array([[ 7227,   142,  8063, ...,    15,     5,     1],
    [ 3557,   210,     3, ...,  -100,  -100,  -100],
    [  622,     3, 29725, ...,  -100,  -100,  -100],
    ...,
    [ 1955,   276, 12220, ...,  -100,  -100,  -100],
    [ 3039,   197, 29068, ...,  -100,  -100,  -100],
    [  312, 26274,   146, ...,  -100,  -100,  -100]]), 
    metrics={'test_loss': 1.4194819927215576, 
    'test_bleu': 4.7773, 'test_gen_len': 17.4118, 
    'test_runtime': 3.1035, 'test_samples_per_second': 65.733, 
    'test_steps_per_second': 4.189})
'''

'''
BLEU (Bilingual Evaluation Understudy) is a metric ranging from 0 to 100 
for evaluating the quality of text which has been 
machine-translated from one language to another.
'''
#map the predictions to the actual words
decoded_preds = tokenizer.batch_decode(results.predictions, skip_special_tokens=True)
for i in range(5):
    print(Fore.GREEN + "English Input:")
    print(Fore.BLUE + tokenized_books_eval['translation'][i]['en'])
    print(Fore.GREEN + "French Prediction:")
    print(Fore.BLUE + decoded_preds[i])
    print(Fore.RED + "Actual French Translation:")
    print(Fore.RESET + tokenized_books_eval['translation'][i]['fr'])
    print("==========================================================================================================")
print(Fore.YELLOW + "Metrics:")
print('TEST SET BLEU:',results.metrics['test_bleu'])

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Map: 100%|██████████| 10214/10214 [00:02<00:00, 4982.18 examples/s]
Map: 100%|██████████| 204/204 [00:00<00:00, 6265.01 examples/s]
Map: 100%|██████████| 334/334 [00:00<00:00, 6507.06 examples/s]

Data Example 1:
language: en  Graphic designer Stefan Sagmei
language: fr  Le designer graphique Stefan S
Data Example 2:
language: en  Stefan Sagmeister: Happiness b
language: fr  Stefan Sagmeister parle du bon
Train Dataset Size:
10214
Test Dataset Size:
334
Eval Dataset Size:
204



  trainer = Seq2SeqTrainer(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
500,0.8698
1000,0.8651
1500,0.8638




English Input:
Graphic designer Stefan Sagmeister takes the audience on a whimsical journey through moments of his life that made him happy -- and notes how many of these moments have to do with good design.
French Prediction:
Le graphiste Stefan Sagmeister accompagne le public dans un voyage étrange
Actual French Translation:
Le designer graphique Stefan Sagmeister emmène le public dans un voyage fantasque à travers des moments de sa vie qui l'ont rendu heureux - et souligne que nombre de ces moments ont à voir avec le design de qualité.
English Input:
Stefan Sagmeister: Happiness by design
French Prediction:
Stefan Sagmeister : Le bonheur par le design
Actual French Translation:
Stefan Sagmeister parle du bonheur dans le design.
English Input:
Oxford philosopher and transhumanist Nick Bostrom examines the future of humankind and asks whether we might alter the fundamental nature of humanity to solve our most intrinsic problems.
French Prediction:
Nick Bostrom, philosophe et transhuma

## Second Implementation
### IWSLT (International Workshop on Spoken Language Translation) dataset preprocessing---train, test and val sets split
The code snippet below handles dataset loading and splitting, specifically for English-French translation. It uses the datasets library from hugging face to load data from different years (2014-2016) and creates various dataset splits based on configuration settings (cfg). It allows for either partial (95%) or full (100%) training data usage in IWSLT of year 2016, different test set configurations (0-50% or 95-100%) in IWSLT of year 2016, and optionally increments the training data by concatenating datasets from multiple years (2014-2015). The evaluation dataset is created by combining the last 5% of data from both 2014 and 2015 datasets. 

### IWSLT (International Workshop on Spoken Language Translation) dataset preprocessing---dataset size effect


In [2]:
import torch
import datasets
from datasets import load_dataset
from model.model import tokenizer
from colorama import Fore, Style, init
from config.config import cfg
if cfg['data']['train_type'] == 'partial':
    ri = (datasets.ReadInstruction('train', to=95, unit='%'))
    train_dataset = load_dataset("IWSLT/ted_talks_iwslt", language_pair=("en", "fr"), year="2016", split=ri)
elif cfg['data']['train_type'] == 'full':
    ri = (datasets.ReadInstruction('train', to=100, unit='%'))
    train_dataset = load_dataset("IWSLT/ted_talks_iwslt", language_pair=("en", "fr"), year="2016", split=ri)
if cfg['data']['test_type'] == 'test_0_50pct_ds':
    ri = (datasets.ReadInstruction('train', to=50, unit='%'))
    test_dataset = load_dataset("IWSLT/ted_talks_iwslt", language_pair=("en", "fr"), year="2016", split=ri)
elif cfg['data']['test_type'] == 'test_95_100pct_ds':
    ri = (datasets.ReadInstruction('train', from_=95, unit='%'))
    test_dataset = load_dataset("IWSLT/ted_talks_iwslt", language_pair=("en", "fr"), year="2016", split=ri)
else:
    raise NotImplementedError
if cfg['data']['data_increment'] > 0:
    ri = (datasets.ReadInstruction('train', to=95, unit='%'))
    increment_dataset1 = load_dataset("IWSLT/ted_talks_iwslt", language_pair=("en", "fr"), year="2014",split=ri)
    increment_dataset2 = load_dataset("IWSLT/ted_talks_iwslt", language_pair=("en", "fr"), year="2015",split=ri)
    train_dataset = datasets.concatenate_datasets([train_dataset, increment_dataset1, increment_dataset2])
ri = (datasets.ReadInstruction('train', from_=95, unit='%'))
eval_dataset1 = load_dataset("IWSLT/ted_talks_iwslt", language_pair=("en", "fr"), year="2014", split=ri)
eval_dataset2 = load_dataset("IWSLT/ted_talks_iwslt", language_pair=("en", "fr"), year="2015", split=ri)
eval_dataset = datasets.concatenate_datasets([eval_dataset1, eval_dataset2])

## Second Implementation
### IWSLT (International Workshop on Spoken Language Translation) dataset preprocessing--- Data Formatting, Tokenization, Metrics and Mapping
The following preprocessing function tokenizes the input and target texts, truncating them to a maximum length of 128 tokens. The postprocessing function strips whitespace from the predictions and labels. The compute_metrics function calculates the BLEU score and the average length of the generated sequences. The data collator dynamically pads the inputs and labels to the maximum length in the batch.


In [3]:
import numpy as np
import evaluate
from transformers import DataCollatorForSeq2Seq
def preprocess_function(examples):
    #inputs = [prefix + example[source_lang] for example in examples["translation"]]
    inputs = [example[cfg['data']['source_lang']] for example in examples["translation"]]
    targets = [example[cfg['data']['target_lang']] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) #remove special tokens

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id) #replace -100 with padding token
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels) #remove whitespaces

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)#compute BLEU score
    result = {"bleu": result["score"]}#store the BLEU score

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]#compute the average length of the generated text
    result["gen_len"] = np.mean(prediction_lens)#store the average length of the generated text
    result = {k: round(v, 4) for k, v in result.items()}#round the result
    return result

metric = evaluate.load("sacrebleu")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=cfg['model']['name'])
#tokenize the datasets
tokenized_books = train_dataset.map(preprocess_function, batched=True)
tokenized_books_test = test_dataset.map(preprocess_function, batched=True)
tokenized_books_eval = eval_dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 10214/10214 [00:02<00:00, 4999.83 examples/s]
Map: 100%|██████████| 204/204 [00:00<00:00, 6358.64 examples/s]
Map: 100%|██████████| 334/334 [00:00<00:00, 6576.21 examples/s]


## Second Implementation
### Model Setup
In this section, The model and tokenizer using the Hugging Face Transformers library are set up, which allows us to prepare the model for training and evaluation tasks. Below is the code to import and initialize the model and tokenizer. If we are to resume from other model weights, transformers.AutoModelForSeq2SeqLM will override the weight-loading code below.

In [4]:
#Model Import
from transformers import T5ForConditionalGeneration, T5Tokenizer
from config.config import cfg
model = T5ForConditionalGeneration.from_pretrained(cfg['model']['name'])
tokenizer = T5Tokenizer.from_pretrained(cfg['model']['name'])

## Second Implementation
### Train and Evaluation Loop---Hugging Face Version Implementation
In this section, the training loop using the Hugging Face library is implemented. I started by setting up the training parameters, including the learning rate, batch size, and number of epochs. Next, I created a DataLoader for our training dataset to efficiently load and preprocess the data in batches.

In [5]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import AutoModelForSeq2SeqLM
from config.config import cfg
#SET CUDA AVAILABLE DEVICES
import tqdm
import os
os.environ["CUDA_VISIBLE_DEVICES"] = cfg['model']['CUDA_VISIBLE_DEVICES']
# inputs,targets=[],[]
# for _, examples in tqdm(zip(range(200), iter(test_dataset)), total=200):
#     inputs.append(examples["translation"]['en'])
#     targets.append(examples["translation"]['fr'])
# model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
# print('model_inputs:')
# print(model_inputs.keys())
'''
dict_keys(['input_ids', 'attention_mask', 'labels'])
'input_ids': [[101, 278...]] These IDs represent the tokenized form of the input text.
'attention_mask': [[1, 1...]] The attention mask is used to indicate which tokens should be attended to (1) and which should be ignored (0).
'labels': [[22833,3,...]] For supervised learning tasks.
'''
if cfg['model']['train_loop_type'] == "huggingface":
    training_args = Seq2SeqTrainingArguments(
        output_dir="ckpt",
        learning_rate=cfg['train']['learning_rate'],
        per_device_train_batch_size=cfg['train']['per_device_train_batch_size'],
        per_device_eval_batch_size=cfg['train']['per_device_eval_batch_size'],
        weight_decay=0.001,
        save_total_limit=3,
        num_train_epochs=cfg['train']['num_train_epochs'],
        predict_with_generate=True,
        fp16=True, #change to bf16=True for XPU
        push_to_hub=False,
        do_eval=cfg['model']['do_eval'],
    )
    model = AutoModelForSeq2SeqLM.from_pretrained(cfg['model']['resume']) if cfg['model']['resume'] is not None else None
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_books,
        eval_dataset=tokenized_books_eval,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )
else:
    #add text of to be implemented on Error
    raise NotImplementedError("Only huggingface train loop is implemented for now. I'll implement the pytorch train loops for the final project. :)")

  trainer = Seq2SeqTrainer(


## Second Implementation
### Train and Evaluation Loop---Train, Test, and Evaluation on corresponding split set.
In this section, The training loop that includes training, testing, and evaluation on the corresponding split sets are implemented. The following code simply execute metric computation since the training is already done in the first implementation.

In [18]:
cfg['model']['do_eval'] = True
if not cfg['model']['do_eval']:
    #run test dataset every 1000 steps
    trainer.train()
trainer.evaluate(eval_dataset=tokenized_books_test)
#Metrics Computations on Eval Dataset
results = trainer.predict(tokenized_books_eval)
print(results)

PredictionOutput(predictions=array([[    0,   312,  8373, ...,   154,    17,  5517],
       [    0, 14189,  1138, ...,     0,     0,     0],
       [    0,  7486,  1491, ...,    31,   667,   226],
       ...,
       [    0,  1955,   276, ...,     0,     0,     0],
       [    0,  3039,    73, ...,    50, 11403,   342],
       [    0,  9236,     9, ...,     0,     0,     0]]), label_ids=array([[  312,  4378,  8373, ...,  -100,  -100,  -100],
       [14189,  1138,   122, ...,  -100,  -100,  -100],
       [  312,     3, 17704, ...,  -100,  -100,  -100],
       ...,
       [ 1955,   276, 12220, ...,  -100,  -100,  -100],
       [ 3039,   197, 29068, ...,  -100,  -100,  -100],
       [  312, 26274,   146, ...,  -100,  -100,  -100]]), metrics={'test_loss': 1.137765884399414, 'test_bleu': 6.9549, 'test_gen_len': 17.6647, 'test_runtime': 8.8755, 'test_samples_per_second': 37.632, 'test_steps_per_second': 2.366})


## Second Implementation
### Train and Evaluation Loop---Pytorch Implementation
The code below sets up and executes a training, validation, and testing loop using PyTorch. Key components include a training configuration, model, optimizer (AdamW), and data loaders for training, validation, and testing datasets. Within each training epoch, the model is trained batch-by-batch, with the loss derived from the model class itself and backpropagated to adjust model weights. Finally, the model's performance measured via loss is evaluated without gradient computation.

Particularly, The loss here is calculated as the negative log-likelihood for a sequence-to-sequence translation task, which is typically a cross-entropy loss in language models like those in the Hugging Face transformers library. This loss quantifies how well the predicted tokens match the actual tokens in the target language. Each training batch passes through the model to produce predictions, and the model calculates the cross-entropy between the predicted and actual token sequences. Lower loss values indicate better alignment between predictions and ground truth, signaling that the model is improving in translating from English to French.

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import AdamW
import time
# Setup training parameters
learning_rate = cfg['train']['learning_rate']
batch_size = cfg['train']['per_device_train_batch_size']
num_epochs = cfg['train']['num_train_epochs']
devices = "cpu"
model = model.to(devices)
# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(eval_dataset, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)
#tqdm
from tqdm import tqdm
train_dataloader = tqdm(train_dataloader)
val_dataloader = tqdm(val_dataloader)
test_dataloader = tqdm(test_dataloader)
# Define optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate)
start_time = time.time()
# Training loop
for epoch in range(num_epochs):
    #quit the loop after 30 seconds (this code block is simply for demonstration purposes of the pytorch version training loop)
    model.train()
    total_train_loss = 0
    for batch in train_dataloader:
        if time.time() - start_time > 30:
            exit()
        inputs = tokenizer(batch['translation']['en'], return_tensors='pt', padding=True, truncation=True)
        labels = tokenizer(batch['translation']['fr'], return_tensors='pt', padding=True, truncation=True).input_ids
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        #print loss via tqdm
        train_dataloader.set_description(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}")
        total_train_loss += loss.item()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
    avg_train_loss = total_train_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Training Loss: {avg_train_loss}")

    # Evaluation on validation set
    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for batch in val_dataloader:
            inputs = tokenizer(batch['translation']['en'], return_tensors='pt', padding=True, truncation=True)
            labels = tokenizer(batch['translation']['fr'], return_tensors='pt', padding=True, truncation=True).input_ids
            outputs = model(**inputs, labels=labels)
            loss = outputs.loss
            val_dataloader.set_description(f"Epoch {epoch+1}/{num_epochs}, Validation Loss: {loss.item()}")
            total_val_loss += loss.item()
    
    avg_val_loss = total_val_loss / len(val_dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Validation Loss: {avg_val_loss}")

# Evaluation on test set
model.eval()
total_test_loss = 0
with torch.no_grad():
    for batch in test_dataloader:
        inputs = tokenizer(batch['translation']['en'], return_tensors='pt', padding=True, truncation=True)
        labels = tokenizer(batch['translation']['fr'], return_tensors='pt', padding=True, truncation=True).input_ids
        outputs = model(**inputs, labels=labels)
        test_dataloader.set_description(f"Epoch {epoch+1}/{num_epochs}, Test Loss: {loss.item()}")
        loss = outputs.loss
        total_test_loss += loss.item()

avg_test_loss = total_test_loss / len(test_dataloader)
print(f"Test Loss: {avg_test_loss}")

  0%|          | 0/4 [00:59<?, ?it/s]
  0%|          | 0/160 [00:00<?, ?it/s]


  0%|          | 0/6 [00:00<?, ?it/s]

Epoch 1/10, Loss: 9.733063697814941:  12%|█▏        | 19/160 [02:07<15:46,  6.71s/it] 


KeyboardInterrupt: 

: 

## Second Implementation
### Inference Pipeline
The code below sets up an interactive English-to-French translation system using previously fine-tuned sequence-to-sequence model on T-5 from Hugging Face's transformers library. The model and tokenizer are loaded from a specified checkpoint directory (model_checkpoint). CUDA is set up for GPU processing, if available. In an infinite loop, the program prompts the user to input a sentence in English, which it tokenizes and processes through the model to generate a translation in French. The output is decoded and displayed in color using the colorama library: the input is shown in green, and the model's translation in yellow, with a separator for readability. The loop continues until the user types 'exit'.

In [1]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from colorama import Fore
import os
import torch
from config.config import cfg
from model.model import tokenizer
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# Load model and tokenizer from checkpoints
model_checkpoint_folder = cfg['model']['resume']
model_checkpoint = os.path.join(model_checkpoint_folder)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
while True:
    inputs = input("Enter a sentence in English (or type 'exit' to quit): ")
    if inputs.lower() == 'exit':
        break
    model_inputs = tokenizer(inputs, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(**model_inputs)
    decoded_preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print("==========================================================================================================")
    print(Fore.GREEN + "English Input:", inputs)
    print(Fore.YELLOW + "French Prediction:", decoded_preds[0])
    print(Fore.WHITE+"==========================================================================================================")

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


[32mEnglish Input: Hi
[33mFrench Prediction: Hi
[32mEnglish Input: Hello
[33mFrench Prediction: Bonjour
[32mEnglish Input: How are you?
[33mFrench Prediction: Comment êtes-vous?
[32mEnglish Input: Fine!
[33mFrench Prediction: Bonne!
[32mEnglish Input: This is a translation demonstration
[33mFrench Prediction: Il s'agit d'une démonstration de traduction.
