## Abstractive Summarization
### Method 1 - Model Training 
### Native PyTorch implementation (src/model.ipynb)

Implemented works:

- Load pre-trained transformer
    - Facebook’s Bart Large 
 
- OOP implementation of Dataset 
    - Feature, Target
    - Tokenize
    - Padding, Truncate
    - Convert to Tensor
    - Pass to: DataLoader – with batch size

- Training Loop
    - Train mode
    - Adam optimizer
    - Forward pass & compute loss
    - Backward pass
    - Update params – compute gradient
    - Update Learning Rate
    - Zero the gradients
    - Update total loss
    - [ Average Training Loss: 1.3280 ]

- Saved the fine-tuned transformer model.
    - https://drive.google.com/drive/folders/1oLf8SJnRP6JgVoOCx73M1JRoOxDfZBSd?usp=sharing -> saved model

- Evalution loop
    - Eval mode
    - No gradient calculation
    - Forward pass & compute loss
    - Accumulate batch loss
    - Print batch information
    - Calculate average validation loss
    - Print final evaluation results (loss and time)
    - [ Validation Loss: 2.4502 ]

The trained model from method 1 was not used for deployment:

(Trained model from method 2 was used for deployment)

Reason:
- Even though the model has very minimal training loss but, the model performed inconsistenly in validation & testing phase.
- There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.


In [1]:
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from transformers import BartTokenizer

# OOP implementation of Dataset 
class SummarizationDataset(Dataset):
    def __init__(self, file_path, tokenizer, max_length=512):
        self.dataset = pd.read_csv(file_path) # file path
        self.tokenizer = tokenizer # Tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        text = self.dataset.iloc[idx, 0] # Feature
        summary = self.dataset.iloc[idx, 1] # Target 
        
        inputs = self.tokenizer.encode_plus(
            text, # Feature
            max_length=self.max_length,
            padding='max_length', # Padding 
            truncation=True, # Truncate
            return_tensors="pt" # Convert to Tensor
        )
        targets = self.tokenizer.encode_plus(
            summary, # Target
            max_length=self.max_length,
            padding='max_length', # Padding 
            truncation=True, # Truncate
            return_tensors="pt" # Convert to Tensor
        )
        
        return {
            'input_ids': inputs['input_ids'].flatten(), # feature - converts - mutli-dimentional tensor to one dimensional tesor
            'attention_mask': inputs['attention_mask'].flatten(), # padding - attention mask - ' '
            'labels': targets['input_ids'].flatten() # target - ' '
        }

# Tokenizer from foundational model
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base') 

# Data objects
train_dataset = SummarizationDataset('/home/mohan/infy/data/merged/final/train.csv', tokenizer)
val_dataset = SummarizationDataset('/home/mohan/infy/data/merged/final/validation.csv', tokenizer)
test_dataset = SummarizationDataset('/home/mohan/infy/data/merged/final/test.csv', tokenizer)

# Pass to: DataLoader – with batch size
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)


In [2]:
import torch
from transformers import BartForConditionalGeneration

# cuda access
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# load the foundational model
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
model = model.to(device) # Use cuda backend


In [4]:
import time
from transformers import AdamW, get_scheduler
from tqdm.auto import tqdm

# Robust optimizer - Adam - most effective
optimizer = AdamW(model.parameters(), lr=0.001)

num_epochs = 3 # epochs
num_training_steps = num_epochs * len(train_loader) # steps

#scheduler
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

# Progress bar
progress_bar = tqdm(range(num_training_steps))

# Set the model to train mode
model.train()

# Training loop
start_time = time.time()
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    
    epoch_start_time = time.time()
    total_loss = 0
    
    # Loop over the training batches
    for step, batch in enumerate(train_loader):
        
        batch_start_time = time.time()
        # data feeding    
        inputs = batch['input_ids'].to(device)  # input from batch
        attention_mask = batch['attention_mask'].to(device) # padding - attention_mask
        labels = batch['labels'].to(device) # target from batch
        
        outputs = model(input_ids=inputs, attention_mask=attention_mask, labels=labels)  # forward pass
        loss = outputs.loss # compute loss
               
        loss.backward() # Backward pass 
        optimizer.step() # Update model parameters
        lr_scheduler.step() # Update learning rate
        optimizer.zero_grad()  # Clear the gradients for the next iteration
        
        progress_bar.update(1) # Update the progress bar
        
        total_loss += loss.item() # Accumulate the loss
        
        current_lr = lr_scheduler.get_last_lr()[0] # Get the current learning rate 
        batch_time = time.time() - batch_start_time # compute batch time
        print(f"Epoch {epoch + 1} | Step {step + 1}/{len(train_loader)} | "
              f"Batch Loss: {loss.item():.4f} | Learning Rate: {current_lr:.6f} | "
              f"Batch Time: {batch_time:.2f}s")
    
    avg_loss = total_loss / len(train_loader) # compute average loss
    epoch_time = time.time() - epoch_start_time  # compute bepoch time
    # Print batch-level training information
    print(f"Epoch {epoch + 1} completed. Average Loss: {avg_loss:.4f} | "
          f"Epoch Time: {epoch_time:.2f}s")
    
total_training_time = time.time() - start_time # compute train time
print(f"Training completed in {total_training_time:.2f}s") # Print final training information


  0%|          | 0/37626 [00:00<?, ?it/s]

Epoch 1/3
Epoch 1 | Step 1/12542 | Batch Loss: 2.1976 | Learning Rate: 0.001000 | Batch Time: 0.82s
Epoch 1 | Step 2/12542 | Batch Loss: 2.8412 | Learning Rate: 0.001000 | Batch Time: 0.62s
Epoch 1 | Step 3/12542 | Batch Loss: 1.3823 | Learning Rate: 0.001000 | Batch Time: 0.63s
Epoch 1 | Step 4/12542 | Batch Loss: 1.3652 | Learning Rate: 0.001000 | Batch Time: 0.60s
Epoch 1 | Step 5/12542 | Batch Loss: 1.4219 | Learning Rate: 0.001000 | Batch Time: 0.58s
Epoch 1 | Step 6/12542 | Batch Loss: 1.8671 | Learning Rate: 0.001000 | Batch Time: 0.57s
Epoch 1 | Step 7/12542 | Batch Loss: 1.2737 | Learning Rate: 0.001000 | Batch Time: 0.63s
Epoch 1 | Step 8/12542 | Batch Loss: 1.4080 | Learning Rate: 0.001000 | Batch Time: 0.61s
Epoch 1 | Step 9/12542 | Batch Loss: 1.0707 | Learning Rate: 0.001000 | Batch Time: 0.57s
Epoch 1 | Step 10/12542 | Batch Loss: 2.3664 | Learning Rate: 0.001000 | Batch Time: 0.62s
Epoch 1 | Step 11/12542 | Batch Loss: 2.2995 | Learning Rate: 0.001000 | Batch Time: 0.61

In [5]:
# Save the fine tuned transformer model & its tokenizer
model.save_pretrained("/home/mohan/infy/models/fine_tuned_bart")
tokenizer.save_pretrained("/home/mohan/infy/models/fine_tuned_bart")

Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('/home/mohan/infy/models/fine_tuned_bart/tokenizer_config.json',
 '/home/mohan/infy/models/fine_tuned_bart/special_tokens_map.json',
 '/home/mohan/infy/models/fine_tuned_bart/vocab.json',
 '/home/mohan/infy/models/fine_tuned_bart/merges.txt',
 '/home/mohan/infy/models/fine_tuned_bart/added_tokens.json')

In [None]:
'''
For extra training - multiple trains 
Multiple epochs is possible - divided into few smaller epochs
This way, domestic GPU can be used to train large models with more epochs.
'''

#if start from first 
'''
model_path = '/home/mohan/infy/models/fine_tuned_bart'
model = BartForConditionalGeneration.from_pretrained(model_path)
tokenizer = BartTokenizer.from_pretrained(model_path)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

optimizer = AdamW(model.parameters(), lr=0.001)
'''

# SAME training loop

num_epochs = 3
num_training_steps = num_epochs * len(train_loader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

progress_bar = tqdm(range(num_training_steps))

model.train()

start_time = time.time()
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    
    epoch_start_time = time.time()
    total_loss = 0
    
    for step, batch in enumerate(train_loader):
        
        batch_start_time = time.time()    
        inputs = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids=inputs, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
               
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        
        progress_bar.update(1)
        
        total_loss += loss.item()
        
        current_lr = lr_scheduler.get_last_lr()[0]
        batch_time = time.time() - batch_start_time
        print(f"Epoch {epoch + 1} | Step {step + 1}/{len(train_loader)} | "
              f"Batch Loss: {loss.item():.4f} | Learning Rate: {current_lr:.6f} | "
              f"Batch Time: {batch_time:.2f}s")
    
    avg_loss = total_loss / len(train_loader)
    epoch_time = time.time() - epoch_start_time
    print(f"Epoch {epoch + 1} completed. Average Loss: {avg_loss:.4f} | "
          f"Epoch Time: {epoch_time:.2f}s")
    
total_training_time = time.time() - start_time
print(f"Training completed in {total_training_time:.2f}s")

In [6]:
'''
Calculates only the loss function 
No other performance metrics is used
'''
# Set the model to evaluation mode
model.eval() 
total_eval_loss = 0 # Initialize 
eval_start_time = time.time() # starting time

with torch.no_grad(): # no backpropagation needed
    for step, batch in enumerate(val_loader):
        batch_start_time = time.time() # starting time - batch

        inputs = batch['input_ids'].to(device) # input from batch
        attention_mask = batch['attention_mask'].to(device)  # padding - attention_mask
        labels = batch['labels'].to(device) # target from batch

        outputs = model(input_ids=inputs, attention_mask=attention_mask, labels=labels) # forward pass
        loss = outputs.loss # coompute loss

        total_eval_loss += loss.item() # Accumulate the loss
        
        batch_time = time.time() - batch_start_time # compute batch processing time

        # Print batch-level evaluation information
        print(f"Validation Step {step + 1}/{len(val_loader)} | "
              f"Batch Loss: {loss.item():.4f} | Batch Time: {batch_time:.2f}s")

avg_eval_loss = total_eval_loss / len(val_loader) # compute average loss

eval_time = time.time() - eval_start_time # compute evaluation processing time

# Print final evaluation information
print(f"Validation Loss: {avg_eval_loss:.4f} | Evaluation Time: {eval_time:.2f}s")


Validation Step 1/1568 | Batch Loss: 1.3169 | Batch Time: 0.41s
Validation Step 2/1568 | Batch Loss: 2.5463 | Batch Time: 0.14s
Validation Step 3/1568 | Batch Loss: 1.7110 | Batch Time: 0.13s
Validation Step 4/1568 | Batch Loss: 2.1756 | Batch Time: 0.13s
Validation Step 5/1568 | Batch Loss: 1.8531 | Batch Time: 0.14s
Validation Step 6/1568 | Batch Loss: 1.8267 | Batch Time: 0.13s
Validation Step 7/1568 | Batch Loss: 1.7501 | Batch Time: 0.14s
Validation Step 8/1568 | Batch Loss: 2.1318 | Batch Time: 0.13s
Validation Step 9/1568 | Batch Loss: 1.4122 | Batch Time: 0.14s
Validation Step 10/1568 | Batch Loss: 5.5929 | Batch Time: 0.13s
Validation Step 11/1568 | Batch Loss: 2.7742 | Batch Time: 0.14s
Validation Step 12/1568 | Batch Loss: 1.5455 | Batch Time: 0.14s
Validation Step 13/1568 | Batch Loss: 1.5875 | Batch Time: 0.14s
Validation Step 14/1568 | Batch Loss: 2.6606 | Batch Time: 0.14s
Validation Step 15/1568 | Batch Loss: 3.6347 | Batch Time: 0.14s
Validation Step 16/1568 | Batch Lo