<a href="https://colab.research.google.com/github/RosaMeyer/2023-lectures/blob/main/Week4_QA_teluguContext_to_telugu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QA Generation using Telugu questions and English contexts to generate the Telugu answer

Use the subset answer_inlang of the questions in Telugu to train (or fine-tune) a model to receive the Telugu question and English context as input and generate the Telugu answer.

Use answer_inlang for lables and English answers.
Only use answer_inlnag for answerable questions that doesnt have an english answer or context.

## Imports

In [None]:
from utils import *

# !pip install evaluate
# %pip install sacrebleu

import os
import numpy as np
import torch
import random
from datasets import Dataset
import evaluate
import sacrebleu

from transformers import (
    MT5Tokenizer,
    MT5ForConditionalGeneration,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    set_seed,
)

## Get, filter and prep data

In [None]:
training_data = get_training_data()
validation_data = get_validation_data()

# Filtering for Telugu only
te_train = training_data[training_data['lang'] == 'te']
te_val = validation_data[validation_data['lang'] == 'te']

In [None]:
# TODO: copies not really nessary
te_train = te_train.copy()
te_val = te_val.copy()

# The answer_inlang field is used as target/flag text only when available
# meaning we only keep examples with Telugu answers
te_train = te_train[te_train['answer_inlang'].notna() & (te_train['answer_inlang'].str.strip() != '')]
te_val = te_val[te_val['answer_inlang'].notna() & (te_val['answer_inlang'].str.strip() != '')]

## Model setup + load

We will be fine tuning google's MT5-small model to generate Telugu answers based on Telugu questions only as input.

In [None]:
MODEL_NAME = 'google/mt5-small' # Change to mt5-base or mt5-large if you have the resources
OUTPUT_DIR = './mt5_telugu_openqa'
MAX_SOURCE_LENGTH = 512
MAX_TARGET_LENGTH = 128
BATCH_SIZE = 4 # 8
NUM_EPOCHS = 50 # 75
LEARNING_RATE = 3e-4
SEED = 42

In [None]:
def set_seed(seed: int = SEED):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed()

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = MT5Tokenizer.from_pretrained(MODEL_NAME) # TODO: Could use AutoTokenizer
model = MT5ForConditionalGeneration.from_pretrained(MODEL_NAME)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'T5Tokenizer'. 
The class this function is called from is 'MT5Tokenizer'.


## Defining preprocessing pipeline for our data

In [None]:
def make_input(batch):
    q = batch.get('question', '')
    ctx = batch.get('context', '')
    return f'Question (Telugu): {q}\nContext (English): {ctx}'

def add_input_target(batch):
    batch['input_text'] = make_input(batch)
    batch['target_text'] = batch.get('answer_inlang', '')
    return batch

def preprocess_fn(batches):
    inputs = batches['input_text']
    targets = batches['target_text']

    # Use text_target parameter for target tokenization
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)
    labels = tokenizer(text_target=targets, max_length=64, truncation=True)

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

In [None]:
# Convert into pandas Dataset
train_data = Dataset.from_pandas(te_train)
val_data = Dataset.from_pandas(te_val)

train_data = train_data.map(add_input_target)
val_data = val_data.map(add_input_target)

train_proc = train_data.map(
    preprocess_fn,
    batched=True,
    remove_columns=train_data.column_names
)

val_proc = val_data.map(
    preprocess_fn,
    batched=True,
    remove_columns=val_data.column_names
)

# TODO:
train_proc = train_proc.filter(lambda x: len(x['labels']) > 0)
val_proc   = val_proc.filter(lambda x: len(x['labels']) > 0)

# TODO: decoder_input_ids like Aziz?

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=-100, # default
    pad_to_multiple_of=8,
    padding=True,
    # max_length=MAX_SOURCE_LENGTH
)

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Filter:   0%|          | 0/50 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

### Sanity checks

In [None]:
# Check input and targets
print('\nSample training examples:')

for i in range(3):
    row = te_train.iloc[i]
    inp = make_input(row)
    target = row['answer_inlang'] # if pd.notna(row['answer_inlang']) else row['answer']

    print(f'\n--- Example {i} ---')
    print(f'INPUT:\n{inp}')
    print(f'\nTARGET:{target}')
    print(f'English answer: {row['answer']}')


Sample training examples:

--- Example 0 ---
INPUT:
Question (Telugu): 1990 నాటికి ఆఫ్రికాలో అతిపెద్ద జనాభా కలిగిన దేశం ఏది?
Context (English): various archipelagos. It contains 54 fully recognised sovereign states (countries), nine territories and two "de facto" independent states with limited or no recognition. The majority of the continent and its countries are in the Northern Hemisphere, with a substantial portion and number of countries in the Southern Hemisphere. Africa's average population is the youngest amongst all the continents; the median age in 2012 was 19.7, when the worldwide median age was 30.4. Algeria is Africa's largest country by area, and Nigeria is its largest by population. Africa, particularly central Eastern Africa, is widely accepted as the place of origin of

TARGET:నైజీరియా
English answer: Nigeria

--- Example 1 ---
INPUT:
Question (Telugu): 2010 నాటికీ వ్యవసాయ రంగంలో చైనా దేశం ఎన్నో స్థానంలో ఉంది?
Context (English): A country with In [[2010]] China was ran

In [None]:
# Check tokenization
def validate_tokens(dataset, name):
    print(f'\nValidating {name}...')
    for i, example in enumerate(dataset):
        input_ids = example['input_ids']
        labels = example['labels']

        max_input = max(input_ids)
        max_label = max([x for x in labels if x != -100])

        if max_input >= tokenizer.vocab_size:
            print(f'Error in example {i}: input_ids max = {max_input} >= vocab_size {tokenizer.vocab_size}')
            # Decode to see what text caused this
            print(f'Problem with input text: {train_data[i]['input_text'][:200]}')
            break

        if max_label >= tokenizer.vocab_size:
            print(f'Error in example {i}: labels max = {max_label} >= vocab_size {tokenizer.vocab_size}')
            print(f'Problem with target text: {train_data[i]['target_text'][:200]}')
            break

    print(f'{name} validation complete!')

validate_tokens(train_proc, 'train')
validate_tokens(val_proc, 'val')


Validating train...
train validation complete!

Validating val...
val validation complete!


## Training steup - defining our evaluation metrics

We will use SacreBLEU and chrF loaded using the evaluate module.

In [None]:
# Load metrics
sacrebleu = evaluate.load('sacrebleu')
chrf = evaluate.load('chrf')

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]
    return preds, labels

def compute_metrics(eval_pred):
    # preds, labels = eval_pred

    # # Ensure preds are valid token IDs
    # if isinstance(preds, tuple):
    #     preds = preds[0]

    # Need these as eval prediction objects are not always a tuple
    if isinstance(eval_pred, tuple):
        preds, labels = eval_pred
    else:
        preds, labels = eval_pred.predictions, eval_pred.label_ids

    # Force integers and replace bad values in preds too
    preds = np.array(preds, dtype=np.int64)
    preds = np.where((preds < 0) | (preds >= tokenizer.vocab_size),
                        tokenizer.pad_token_id,
                        preds)

    # Replace -100 with pad_token_id before decoding - important as the decoder can't handle -100
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = [pred.replace('<extra_id_0>', '').strip() for pred in decoded_preds]
    decoded_labels = [label.replace('<extra_id_0>', '').strip() for label in decoded_labels]

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # # Print lengths for debugging
    # print(f'Number of predictions: {len(decoded_preds)}')
    # print(f'Number of references: {len(decoded_labels)}')
    #
    # print(f'\nSample prediction: {decoded_preds[0]}')
    # print(f'Sample reference: {decoded_labels[0]}')

    for p, r in zip(decoded_preds[:5], decoded_labels[:5]):
        print(f'Prediction: {p}')
        print(f'Reference: {r}')
        print('-' * 30)

    bleu_score = sacrebleu.compute(predictions=decoded_preds, references=[[l] for l in decoded_labels])['score']
    chrf_score = chrf.compute(predictions=decoded_preds, references=[[l] for l in decoded_labels])['score']

    return {'sacrebleu': bleu_score, 'chrf': chrf_score}

## Training the model

We finally train the model with a learning
rate of 0.0003 (3e-4), a weight decay of 0.01 and 50 training epochs.

In [None]:
# Sanity checks
print('Checking tokenization...')
sample = train_proc[0]
print(f'Input IDs range: {min(sample['input_ids'])} to {max(sample['input_ids'])}')
print(f'Label IDs range: {min([x for x in sample['labels'] if x != -100])} to {max([x for x in sample['labels'] if x != -100])}')
print(f'Tokenizer vocab size: {tokenizer.vocab_size}')

Checking tokenization...
Input IDs range: 1 to 228869
Label IDs range: 1 to 166439
Tokenizer vocab size: 250100


In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    eval_strategy='epoch',
    logging_strategy='epoch',
    save_strategy='epoch',
    predict_with_generate=True,
    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    generation_max_length=MAX_TARGET_LENGTH,
    generation_num_beams=4,
    remove_unused_columns=False,
    logging_nan_inf_filter=False,
    fp16=False, # torch.cuda.is_available()
    bf16=False,
    metric_for_best_model='chrf',
    greater_is_better=True,
    report_to='none',
    save_total_limit=1,
    seed=SEED,
    use_cpu=True,  # Force CPU to avoid MPS Long tensor error - or change ID's to floats
)

from transformers import GenerationConfig

# Create a generation config
generation_config = GenerationConfig(
    max_length=MAX_TARGET_LENGTH,
    num_beams=4,
    early_stopping=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    decoder_start_token_id=tokenizer.pad_token_id,
    forced_eos_token_id=tokenizer.eos_token_id,
    min_length=1,
    no_repeat_ngram_size=0,
)

# Apply the generated configs to the model
model.generation_config = generation_config

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_proc,
    eval_dataset=val_proc,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Trian
trainer.train()
print(trainer.evaluate())

  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss,Sacrebleu,Chrf
1,3.8088,3.232848,0.0,7.146223
2,3.2706,3.001337,4.822222,7.555865
3,2.8515,2.882479,1.78611,11.726262
4,2.1152,3.048683,6.015303,8.404648
5,1.839,2.872888,2.289885,12.323224
6,1.7382,2.783523,12.58766,13.89473
7,1.5073,2.81818,0.0,10.453651
8,2.1266,2.713093,0.0,8.811102
9,1.6971,2.62162,17.455562,10.238028
10,1.7147,2.590783,11.859575,12.731042


Prediction: నైజీరియా
Reference: హన్స్ ఆండర్సాగ్
------------------------------
Prediction: అల్జీరియా
Reference: హరీష్ జైరాజ్
------------------------------
Prediction: నైజీరియా
Reference: 1608
------------------------------
Prediction: నైజీరియా
Reference: మార్చి లేదా ఏప్రిల్
------------------------------
Prediction: నైజీరియా
Reference: కొలంబియా మాదిరిగా అదే పరిమాణం
------------------------------
Prediction: ఆఫ్రికా
Reference: హన్స్ ఆండర్సాగ్
------------------------------
Prediction: అల్జీరియా
Reference: హరీష్ జైరాజ్
------------------------------
Prediction: అల్జీరియా
Reference: 1608
------------------------------
Prediction: ఆఫ్రికా
Reference: మార్చి లేదా ఏప్రిల్
------------------------------
Prediction: ఆఫ్రికా
Reference: కొలంబియా మాదిరిగా అదే పరిమాణం
------------------------------
Prediction: ఆందోల్ కళాశాల
Reference: హన్స్ ఆండర్సాగ్
------------------------------
Prediction: అల్జీరియా
Reference: హరీష్ జైరాజ్
------------------------------
Prediction: ఆఫ్రికాలో అల్జీరియా
Reference

Prediction: పోర్ట్ ల్యాండ్
Reference: హన్స్ ఆండర్సాగ్
------------------------------
Prediction: వెంకటేశ్వర్లు, తల్లి మహాలక్షమ్మ
Reference: హరీష్ జైరాజ్
------------------------------
Prediction: 1608
Reference: 1608
------------------------------
Prediction: ముహమ్మద్ ప్రవక్త
Reference: మార్చి లేదా ఏప్రిల్
------------------------------
Prediction: పోర్ట్ ల్యాండ్
Reference: కొలంబియా మాదిరిగా అదే పరిమాణం
------------------------------
{'eval_loss': 3.2665796279907227, 'eval_sacrebleu': 7.473808976541252, 'eval_chrf': 14.37963940708437, 'eval_runtime': 18.5842, 'eval_samples_per_second': 5.381, 'eval_steps_per_second': 1.345, 'epoch': 50.0}


## Preview some results

Finally, we see our fine tuned model in actions using this simple preview function that compares the golden answer for some of the validation set golden values to the ones generated by the model.

In [None]:
train_data = Dataset.from_pandas(te_train)
val_data = Dataset.from_pandas(te_val)

def preview(n=5):
    # Safety: ensure we use the same tokenizer/model instance used for training
    device = next(model.parameters()).device

    rows = val_data.select(range(min(n, len(val_data))))
    for x in rows:
        # Must match the exact training prefix you used
        # inp = f"Question (Telugu): {x['question']} \nContext (English): {x['context']}" # \nAnswer (Telugu):"
        inp = f'Question (Telugu): {x['question']}\nContext (English): {x['context']}'

        enc = tokenizer(
            [inp],
            return_tensors='pt',
            truncation=True,
            max_length=128,
            padding=False,
        ).to(device)

        with torch.no_grad():
            out = model.generate(
                **enc,
                max_new_tokens=32,
                do_sample=True,
                top_p=0.9,
                top_k=40,
                temperature=0.8,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )

        pred = tokenizer.decode(out[0], skip_special_tokens=True).strip()

        print('Q (te):', x['question'])
        print('Gold (te):', x['answer_inlang'])
        print('Pred (te):', pred)
        print('Gold (en)', x['answer'])
        # print("Context (en):", x["context"])
        print('----')

preview(5)

Q (te): మలేరియా వ్యాధి కి మందు కనిపెట్టిన శాస్త్రవేత్త ఎవరు?
Gold (te): హన్స్ ఆండర్సాగ్
Pred (te): పోర్ట్ ల్యాండ్
Gold (en) Hans Andersag
----
Q (te): మున్నా చిత్రానికి సంగీత దర్శకుడు ఎవరు?
Gold (te): హరీష్ జైరాజ్
Pred (te): వెంకటేశ్వర్లు, తల్లి మహాలక్షమ్మ
Gold (en) Harish Jairaj
----
Q (te): ఈస్ట్ ఇండియా కంపెనీ భారతదేశంలోకి ఎప్పుడు వచ్చింది?
Gold (te): 1608
Pred (te): 1608
Gold (en) 1608
----
Q (te): తెలుగు పంచాంగం ప్రకారం నూతన సంవత్సరం ఏ ఇంగ్లీష్ నెలలో ప్రారంభమవుతుంది?
Gold (te): మార్చి లేదా ఏప్రిల్
Pred (te): సి.ఎన్.అన్నాదురై
Gold (en) March or April
----
Q (te): దక్షిణ ఆఫ్రికా దేశ విస్తీర్ణం ఎంత?
Gold (te): కొలంబియా మాదిరిగా అదే పరిమాణం
Pred (te): పోర్ట్ ల్యాండ్
Gold (en) Same size as Colombia
----
