### Transformer Project - Text Summarisation or Translation

## Applying Transformer-Based Models to Advanced NLP Tasks

### Text Summarisation

Text summarisation is the process of condensing a piece of text while retaining its key information. There are two main types:

- **Extractive Summarisation:** Selects key phrases or sentences directly from the original text to create a summary.
- **Abstractive Summarisation:** Generates new sentences that capture the meaning of the original text, often rephrasing or paraphrasing content.

**Applications:**
- News article summarisation
- Document and report summarisation
- Meeting transcript summarisation

---

### Text Translation

Text translation involves converting text from one language to another while maintaining meaning, context, and grammatical correctness.

**Examples:**
- English to French translation
- Multi-lingual translations using models like T5 or mT5
- Real-time translation in chat applications

---

### Fine-Tuning and Optimising Models

#### Pre-Trained Models for Summarisation and Translation

- **T5 (Text-to-Text Transfer Transformer):** Treats every NLP problem as a text-to-text task. Can be fine-tuned for both summarisation and translation tasks.
- **BART (Bidirectional and Auto-Regressive Transformer):** Combines a BERT-like encoder and a GPT-like decoder. Pre-trained for denoising and can be fine-tuned for summarisation and translation.

#### Optimisation Techniques

- **Learning Rate Scheduling:** Adjusts the learning rate during training to improve convergence.
- **Hyperparameter Tuning:** Involves adjusting parameters such as batch size, optimiser type, and maximum sequence length to achieve optimal performance.
- **Early Stopping:** Stops training when performance on a validation set stops improving.
- **Data Augmentation:** Increases the diversity of training data to improve model robustness.

---

### Analysing Model Performance

#### Evaluation Metrics

- **Text Summarisation:**
    - **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):** Measures overlap between generated and reference summaries.
    - **BLEU (Bilingual Evaluation Understudy):** Originally for translation, but sometimes used for summarisation to measure n-gram overlap.
- **Text Translation:**
    - **BLEU Score:** Evaluates translation quality by comparing n-grams of the candidate and reference translations.
    - **Perplexity:** Measures how well a probability model predicts a sample, indicating model confidence.
    - **METEOR, TER:** Additional metrics for translation quality.

#### Additional Considerations

- **Human Evaluation:** Involves human judges rating the quality and fluency of generated summaries or translations.
- **Error Analysis:** Identifying common failure cases to guide further model improvements.


In [None]:
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer

In [None]:
# load datset for summarisation
dataset = load_dataset("cnn_dailymail", "3.0.0")
print(dataset["train"][0])

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# tokenise for summarisation
def tokenize_funtion(examples):
    inputs = ["Summarise: "+ doc for doc in examples ["articles"]]
    model_inputs = tokenizer(inputs, max_lenght=512, truncation=True)

    # tokenise targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=150, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(tokenize_funtion,batched=True)

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

training_args = TrainingArguments(
    output_dir = "./results",
    evaluation_strategy = "epoch",
    learning_rate = 2e-5,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    num_train_epochs = 3,
    weight_decay = 0.01,
    save_total_limit= 2,
    predict_with_generate = True
)

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_dataset["train"],
    eval_dataset = tokenized_dataset["validation"],
    processing_class= tokenizer
)

trainer.train()

sample_text = "The Transformer model revolutionised NLP by enabling parallel processing of sequences"
inputs = tokenizer("summarize:"+ sample_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(input["input_ids"], max_length=150, num_beams=4, early_stopping= True)

print("Original text:\n", sample_text)
print("Summary:\n", tokenizer.decode(outputs[0], skip_special_tokens=True))

metric = load_metric("rogue")
predictions = outputs["generated_text"]
references = dataset["validation"]["highlights"]

results = metric.compute(prediction = predictions , references=references)
print(results)