# Script to generate summaries using chunking based BART method

Assign the output_path variable according to requirements.  


## Basic *Setup*

In [None]:
output_path = "/content/drive/MyDrive/IN-Abs/output/"

Mount the drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install pytorch-lightning

## Documentation: Text Summarization with PyTorch Lightning

## **Overview**
This code implements a framework for fine-tuning transformer models (e.g., BART) for text summarization tasks using PyTorch Lightning. Below are the key components and their functionality:

---

## **Imports**
- **`nltk`, `transformers`**: For text preprocessing and working with pre-trained transformer models.
- **PyTorch (`torch`, `torch.nn.functional`, `torch.utils.data`)**: Provides tools for neural network building, training, and dataset handling.
- **`pytorch_lightning`**: A wrapper for PyTorch to streamline training workflows.
- **`ModelCheckpoint`**: Saves the best model during training.

---

## **LitModel (LightningModule)**
This is the core model training and inference logic encapsulated in a PyTorch Lightning module.

1. **Initialization (`__init__`)**:
   - Takes inputs like the transformer model, tokenizer, and learning rate.
   - Includes options to freeze encoder and embedding layers to improve efficiency.

2. **Methods**:
   - `forward`: Defines how data passes through the model.
   - `configure_optimizers`: Configures the optimizer (Adam).
   - `training_step`: Implements the training logic, computing loss using cross-entropy.
   - `validation_step`: Similar to `training_step`, used for validation loss computation.
   - `generate_text`: Generates text summaries using the model's `generate` method.

---

## **SummaryDataModule (LightningDataModule)**
Handles data preparation, splitting, and loading for training.

1. **`prepare_data`**:
   - Splits the data into training, validation, and test sets (60/20/20).

2. **`setup`**:
   - Encodes source and target sentences using the tokenizer.

3. **Data Loaders**:
   - `train_dataloader`: Prepares batched training data.
   - `val_dataloader`: Prepares batched validation data.
   - `test_dataloader`: Prepares batched test data.

---

## **Utility Functions**
1. **`freeze_params`**:
   - Freezes parameters of specific model layers for faster training.

2. **`shift_tokens_right`**:
   - Shifts input token sequences to the right for proper alignment during training.

3. **`encode_sentences`**:
   - Tokenizes and encodes source/target sentences for training and evaluation.

---

## **Key Features**
- **Freezing Layers**: Freezes encoder and embeddings for efficiency.
- **Efficient Data Handling**: Uses PyTorch DataLoader for batched data processing.
- **Text Generation**: Implements summarization using beam search.

---

## **How to Use**
1. Load your dataset into a pandas DataFrame.
2. Initialize the tokenizer and model (e.g., from Hugging Face's Transformers library).
3. Create instances of `LitModel` and `SummaryDataModule`.
4. Use PyTorch Lightning's `Trainer` to train the model.


In [None]:
import nltk
import transformers
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import torch.nn.functional as F
import pytorch_lightning as pl
import torch
from pytorch_lightning.callbacks import ModelCheckpoint

#Source - https://colab.research.google.com/drive/1Cy27V-7qqYatqMA7fEqG2kgMySZXw9I4?usp=sharing&pli=1
class LitModel(pl.LightningModule):
  # Instantiate the model
  def __init__(self, learning_rate, tokenizer, model):
    super().__init__()
    self.tokenizer = tokenizer
    self.model = model
    self.learning_rate = learning_rate
    # self.freeze_encoder = freeze_encoder
    # self.freeze_embeds_ = freeze_embeds
#     self.hparams = argparse.Namespace()

    self.hparams.freeze_encoder = True
    self.hparams.freeze_embeds = True
    self.hparams.eval_beams = 4
    # self.hparams = hparams

    if self.hparams.freeze_encoder:
      freeze_params(self.model.get_encoder())

    if self.hparams.freeze_embeds:
      self.freeze_embeds()

  def freeze_embeds(self):
    ''' freeze the positional embedding parameters of the model; adapted from finetune.py '''
    freeze_params(self.model.model.shared)
    for d in [self.model.model.encoder, self.model.model.decoder]:
      freeze_params(d.embed_positions)
      freeze_params(d.embed_tokens)

  # Do a forward pass through the model
  def forward(self, input_ids, **kwargs):
    return self.model(input_ids, **kwargs)

  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr = self.learning_rate)
    return optimizer

  def training_step(self, batch, batch_idx):
    # Load the data into variables
    src_ids, src_mask = batch[0], batch[1]
    tgt_ids = batch[2]
    # Shift the decoder tokens right (but NOT the tgt_ids)
    decoder_input_ids = shift_tokens_right(tgt_ids, self.tokenizer.pad_token_id)

    # Run the model and get the logits
    outputs = self(src_ids, attention_mask=src_mask, decoder_input_ids=decoder_input_ids, use_cache=False)
    lm_logits = outputs[0]
    # Create the loss function
    ce_loss_fct = torch.nn.CrossEntropyLoss(ignore_index=self.tokenizer.pad_token_id)
    # Calculate the loss on the un-shifted tokens
    loss = ce_loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), tgt_ids.view(-1))

    return {'loss':loss}

  def validation_step(self, batch, batch_idx):

    src_ids, src_mask = batch[0], batch[1]
    tgt_ids = batch[2]

    decoder_input_ids = shift_tokens_right(tgt_ids, self.tokenizer.pad_token_id)

    # Run the model and get the logits
    outputs = self(src_ids, attention_mask=src_mask, decoder_input_ids=decoder_input_ids, use_cache=False)
    lm_logits = outputs[0]

    ce_loss_fct = torch.nn.CrossEntropyLoss(ignore_index=self.tokenizer.pad_token_id)
    val_loss = ce_loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), tgt_ids.view(-1))

    return {'loss': val_loss}

  # Method that generates text using the BartForConditionalGeneration's generate() method
  def generate_text(self, text, eval_beams, early_stopping = True, max_len = 1024):
    ''' Function to generate text '''
    generated_ids = self.model.generate(
        text["input_ids"],
        attention_mask=text["attention_mask"],
        use_cache=True,
        decoder_start_token_id = self.tokenizer.pad_token_id,
        num_beams= eval_beams,
        max_length = max_len,
        early_stopping = early_stopping
    )
    return [self.tokenizer.decode(w, skip_special_tokens=True, clean_up_tokenization_spaces=True) for w in generated_ids]

def freeze_params(model):
  ''' Function that takes a model as input (or part of a model) and freezes the layers for faster training
      adapted from finetune.py '''
  for layer in model.parameters():
    layer.requires_grade = False


# Create a dataloading module as per the PyTorch Lightning Docs
class SummaryDataModule(pl.LightningDataModule):
  def __init__(self, tokenizer, df, batch_size):
    super().__init__()
    self.tokenizer = tokenizer
    self.batch_size = batch_size
    self.data = df

  # Loads and splits the data into training, validation and test sets with a 60/20/20 split
  def prepare_data(self):
    self.train, self.validate, self.test = np.split(self.data.sample(frac=1), [int(.6*len(self.data)), int(.8*len(self.data))])

  # encode the sentences using the tokenizer
  def setup(self, stage):
    self.train = encode_sentences(self.tokenizer, self.train['source'], self.train['target'])
    self.validate = encode_sentences(self.tokenizer, self.validate['source'], self.validate['target'])
    self.test = encode_sentences(self.tokenizer, self.test['source'], self.test['target'])

  # Load the training, validation and test sets in Pytorch Dataset objects
  def train_dataloader(self):
    dataset = TensorDataset(self.train['input_ids'], self.train['attention_mask'], self.train['labels'])
    train_data = DataLoader(dataset, sampler = RandomSampler(dataset), batch_size = self.batch_size)
    return train_data

  def val_dataloader(self):
    dataset = TensorDataset(self.validate['input_ids'], self.validate['attention_mask'], self.validate['labels'])
    val_data = DataLoader(dataset, batch_size = self.batch_size)
    return val_data

  def test_dataloader(self):
    dataset = TensorDataset(self.test['input_ids'], self.test['attention_mask'], self.test['labels'])
    test_data = DataLoader(dataset, batch_size = self.batch_size)
    return test_data



def shift_tokens_right(input_ids, pad_token_id):
  """ Shift input ids one token to the right, and wrap the last non pad token (usually <eos>).
      This is taken directly from modeling_bart.py
  """
  prev_output_tokens = input_ids.clone()
  index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1)
  prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze()
  prev_output_tokens[:, 1:] = input_ids[:, :-1]
  return prev_output_tokens

def encode_sentences(tokenizer, source_sentences, target_sentences, max_length=1024, min_length = 512, pad_to_max_length=True, return_tensors="pt"):
  ''' Function that tokenizes a sentence
      Args: tokenizer - the BART tokenizer; source and target sentences are the source and target sentences
      Returns: Dictionary with keys: input_ids, attention_mask, target_ids
  '''

  input_ids = []
  attention_masks = []
  target_ids = []
  tokenized_sentences = {}

  for sentence in source_sentences:
    encoded_dict = tokenizer(
          sentence,
          max_length=max_length,
          padding="max_length" if pad_to_max_length else None,
          truncation=True,
          return_tensors=return_tensors,
          add_prefix_space = True
      )

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

  input_ids = torch.cat(input_ids, dim = 0)
  attention_masks = torch.cat(attention_masks, dim = 0)

  for sentence in target_sentences:
    encoded_dict = tokenizer(
          sentence,
          max_length=min_length,
          padding="max_length" if pad_to_max_length else None,
          truncation=True,
          return_tensors=return_tensors,
          add_prefix_space = True
      )
    # Shift the target ids to the right
    # shifted_target_ids = shift_tokens_right(encoded_dict['input_ids'], tokenizer.pad_token_id)
    target_ids.append(encoded_dict['input_ids'])

  target_ids = torch.cat(target_ids, dim = 0)


  batch = {
      "input_ids": input_ids,
      "attention_mask": attention_masks,
      "labels": target_ids,
  }

  return batch

### Function Documentation

1. **get_summary_data(dataset, train)**  
   Retrieves names, documents, and summaries from two directories: one for judgment documents and the other for their corresponding summaries.  
   - **Parameters**:  
     - `dataset`: Dataset to be processed.  
     - `train`: Indicates whether the dataset is used for training.
   - **Returns**:  
     - `names`: A list of file names.  
     - `data_source`: A list containing the content of judgment documents.  
     - `data_summary`: A list containing the summary for each judgment document.  

   - **Usage**:  
     Change the `path` variable to point to the location of your dataset, which is expected to contain text files for judgment documents and summaries.

2. **get_req_len_dict(dataset, istrain)**  
   Reads a file containing statistics (such as document lengths) and creates a dictionary mapping document names to required lengths.  
   - **Parameters**:  
     - `dataset`: The dataset being used.  
     - `istrain`: Boolean indicating whether the dataset is for training.  
   - **Returns**:  
     - A dictionary mapping document names to required lengths.  
   - **Usage**:  
     The function reads from a file located at `/content/drive/MyDrive/IN-Abs/test-data/stats-IN-test.txt` and processes its content into a dictionary.

3. **nest_sentencesV2(document, chunk_length)**  
   Splits a document into chunks, where each chunk is a list of sentences. Each chunk's length does not exceed the specified `chunk_length`.  
   - **Parameters**:  
     - `document`: The input document to be chunked.  
     - `chunk_length`: The maximum length of each chunk, in terms of the number of words.  
   - **Returns**:  
     - A list of chunks, where each chunk is a list of sentences.  

   - **Usage**:  
     This function helps in breaking down long documents into smaller, manageable pieces that can be processed more efficiently by the model.

4. **nest_sentences(document, chunk_length)**  
   Similar to `nest_sentencesV2`, but instead of returning a list of sentences for each chunk, it joins sentences into a single string per chunk.  
   - **Parameters**:  
     - `document`: The input document to be chunked.  
     - `chunk_length`: The maximum length of each chunk, in terms of the number of words.  
   - **Returns**:  
     - A list of chunks, where each chunk is a string containing a sequence of sentences.  

   - **Usage**:  
     This function is useful when you need each chunk as a single text block, rather than a list of individual sentences, for further processing or summarization tasks.


In [None]:
import glob
from nltk import tokenize
import nltk
import transformers
from torch.utils.data import DataLoader, TensorDataset, random_split, RandomSampler, Dataset
import pandas as pd
import numpy as np
import torch.nn.functional as F
import torch

def get_summary_data(dataset, train):
    '''
    function to get names, documents, and summaries

    change the path variable to the path of the dataset
    '''
    path = "/content/drive/MyDrive/IN-Abs/test-data/judgement"
    all_files = glob.glob(path + "/*.txt")
    data_source = []
    names = []
    for filename in all_files:
        with open(filename, 'r') as f:
            p = filename.rfind("/")
            names.append(filename[p+1:])
            a = f.read()
            data_source.append(a)
    path = "/content/drive/MyDrive/IN-Abs/test-data/summary"
    all_files = glob.glob(path + "/*.txt")
    data_summary = []
    for filename in all_files:
        with open(filename, 'r') as f:
            a = f.read()
            l = len(a)
            data_summary.append(a)

    return names, data_source, data_summary

def get_req_len_dict(dataset, istrain):
    f = open("/content/drive/MyDrive/IN-Abs/test-data/stats-IN-test.txt", "r")
    a = (f.read())
    print("File Content:", a)  # Debug: Print file content
    a = a.split("\n")
    dict_names = {}
    for i in a:
        b = i.split("	")
        print("Split Line:", b)  # Debug: Print split line
        try:
            tp = int(b[2])
            dict_names[b[0]] = tp
        except Exception as e:
            print("Error processing line:", i, "Error:", e)  # Debug: Print errors
    return dict_names

def nest_sentencesV2(document,chunk_length):
    '''
    function to chunk a document
    input:  document           - Input document
            chunk_length        - chunk length
    output: list of chunks. Each chunk is a list of sentences.
    '''
    nested = []
    sent = []
    length = 0
    for sentence in nltk.sent_tokenize(document):
        length += len(sentence.split(" "))
        if length < chunk_length:
            sent.append(sentence)
        else:
            nested.append(sent)
            sent = []
            sent.append(sentence)
            length = 0
    if len(sent)>0:
        nested.append(sent)
    return nested

def nest_sentences(document,chunk_length):
    '''
    function to chunk a document
    input:  document           - Input document
            chunk_length        - chunk length
    output: list of chunks. Each chunk is a string.
    '''
    nested = []
    sent = []
    length = 0
    for sentence in nltk.sent_tokenize(document):
        length += len(sentence.split(" "))
        if length < chunk_length:
            sent.append(sentence)
        else:
            nested.append(" ".join(sent))
            sent = []
            sent.append(sentence)
            length = 0
    if len(sent)>0:
        nested.append(" ".join(sent))
    return nested


###OPTIONAL
Automate the process to format 'stats' file so that it can be parse easily (do if necessary)

In [None]:
import sys
import transformers
import pandas as pd
import numpy as np
import glob
import nltk
import torch
import math
import random
import re
import argparse
import os


In [None]:
import re

def generate_modified_stats_file(input_file_path, output_file_path):
    with open(input_file_path, 'r') as infile, open(output_file_path, 'w') as outfile:
        for line in infile:
            # Split the line by one or more occurrences of tabs or spaces
            parts = re.split(r'[\t ]+', line.strip())
            # Check if the line has at least 3 parts (filename, {some_value}, required_length)
            if len(parts) >= 3:
                try:
                    # Attempt to convert the last part to an integer
                    required_length = int(parts[-1])
                    # Join the parts back together with tabs, ensuring the last part is an integer
                    modified_line = "\t".join(parts[:-1]) + "\t" + str(required_length) + "\n"
                    outfile.write(modified_line)
                except ValueError:
                    print(f"Skipping line: {line.strip()} - Could not convert {parts[-1]} to integer")

# Call the function to generate the modified file
input_file_path = '/content/drive/MyDrive/IN-Abs/test-data/stats-IN-test.txt'
output_file_path = '/content/drive/MyDrive/IN-Abs/test-data/modified_stats-IN-test.txt'  # Choose your desired output path
generate_modified_stats_file(input_file_path, output_file_path)

###Load the Fine tuned Model, Tokenizer and Dataset


In [None]:
#Reading the test documents
names, data_source, data_summary = get_summary_data(dataset, "test")
print(len(names))
print(len(data_source))
print(len(data_summary))
dict_names = get_req_len_dict(dataset, "test")
print(dict_names)

100
100
100
File Content: 4963.txt	1326	596	0.45	40
5994.txt	6170	1048	0.17	198
7130.txt	6356	1374	0.22	186
6118.txt	2216	569	0.26	95
4860.txt	2440	826	0.34	51
660.txt	14141	507	0.04	444
3436.txt	1748	322	0.18	57
6276.txt	3699	599	0.16	105
2727.txt	1525	316	0.21	52
3356.txt	1957	354	0.18	83
1531.txt	3138	514	0.16	94
2609.txt	2380	353	0.15	88
1522.txt	3210	272	0.08	93
2392.txt	2554	480	0.19	88
1378.txt	2281	205	0.09	78
4071.txt	2508	662	0.26	75
3168.txt	3271	736	0.23	91
232.txt	29144	1872	0.06	770
4568.txt	9353	978	0.1	278
2440.txt	5231	1166	0.22	163
2913.txt	2037	488	0.24	88
1697.txt	12801	1389	0.11	405
3844.txt	3379	1448	0.43	103
5364.txt	15416	3302	0.21	409
3542.txt	2344	266	0.11	91
266.txt	6510	410	0.06	123
3210.txt	3780	603	0.16	132
1974.txt	2768	529	0.19	82
6003.txt	6554	1571	0.24	161
5141.txt	3595	919	0.26	108
2207.txt	2309	717	0.31	74
2593.txt	2836	630	0.22	80
1195.txt	4389	531	0.12	155
1406.txt	2165	410	0.19	68
2627.txt	2894	752	0.26	72
6270.txt	2090	421	0.2	61
3924.txt	2858	76

In [None]:
# Loading Model and tokenizer
from transformers import BartTokenizer, BartForConditionalGeneration, AdamW, BartConfig


tokenizer = BartTokenizer.from_pretrained('facebook/bart-large', add_prefix_space=True)

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

In [None]:
# Add special tokens since we did the same for fine tuned model (otherwise an error will occur - see finetuning to check if needed)

new_tokens = ['<F>', '<RLC>', '<A>', '<S>', '<P>', '<R>', '<RPC>']

special_tokens_dict = {'additional_special_tokens': new_tokens}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


BartScaledWordEmbedding(50272, 1024, padding_idx=1)

In [None]:
#bart_model = LitModel(learning_rate = 2e-5, tokenizer = tokenizer, model = model)

bart_model = LitModel.load_from_checkpoint("/content/drive/MyDrive/IN-Abs/model/output.ckpt",
                                       learning_rate = 2e-5, tokenizer = tokenizer, model = model)

## Define the generate summary function

### Function Documentation

  - **generate_summary_gpu(nested_sentences, p=0.2)**  

---




   This function generates summaries from a list of document chunks using a pre-trained BART model. The function computes the summary based on the number of words in the document. The summary's length is determined by a proportion (`p`) of the document's word count.
   - **Parameters**:  
     - `nested_sentences`: A list of chunks (sentences or phrases) from the document to be summarized.
     - `p`: A proportion that defines the number of words in the summary compared to the number of words in the document. The default value is 0.2 (20% of the document's length).
   - **Returns**:  
     - A list of summarized sentences. Each sentence is generated by the BART model and is decoded from token IDs into readable text.  
     
   - **Usage**:  
     The function uses a GPU for inference (defaulted in the code, can be changed to CPU) and employs the BART model to generate summaries for each chunk in the document. The `p` parameter allows flexibility in controlling the summary's length based on the original document's word count.

   - **Process**:  
     1. For each chunk of the document, calculate the summary length based on the given proportion `p`.  
     2. Tokenize the chunk and move it to the selected device (GPU).  
     3. Generate the summary using the BART model's `generate()` method, adjusting the summary's length.  
     4. Decode the generated token IDs into human-readable text.  
     5. Return a list of all the summarized sentences.


In [None]:
def generate_summary_gpu(nested_sentences,p=0.2):
  '''
    Function to generate summaries from the list containing chunks of the document
    input:  nested_sentences - chunks
            p - Number of words in summaries per word in the document
    output: document summary
    '''
  device = 'gpu' # you can select cpu
  summaries = []
  for nested in nested_sentences:
    l = int(p * len(nested.split(" ")))
    input_tokenized = tokenizer.encode(nested, truncation=True, return_tensors='pt')
    input_tokenized = input_tokenized.to(device)
    summary_ids = bart_model.model.to(device).generate(input_tokenized,
                                      length_penalty=0.01,
                                      min_length=l-5,
                                      max_length=l+5)
    output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
    summaries.append(output)
  summaries = [sentence for sublist in summaries for sentence in sublist]
  return summaries

##Generating the summaries

### Code Documentation

1. **Installing and Importing NLTK**  
   - **`!pip install nltk`**: This command installs the `nltk` library, which is used for natural language processing tasks such as tokenization.
   - **`import nltk`**: This imports the `nltk` library, enabling its functionality for text processing.

2. **Downloading NLTK Resources**  
   - **`nltk.download('punkt_tab')`**: This command downloads the `punkt_tab` resource, which is used by NLTK for tokenization (splitting text into sentences or words).

3. **Main Loop for Summary Generation**  
   The loop processes a dataset to generate summaries for each document. Below is a breakdown of the steps:
   - **Iterating through documents**:  
     The loop processes documents from `data_source`, starting from index 16. For each document:
     - The **name** and **document content** (`doc`) are extracted.
     - The **word count** of the document (`input_len`) is calculated by splitting the text into words.
     - The **required summary length** (`req_len`) for each document is retrieved from `dict_names`.
     - The function prints information about the current document (name, word count, required length).
   
   - **Chunking the document**:  
     The `nest_sentences` function is used to divide the document into smaller chunks of sentences, each with a length of 1024 tokens.

   - **Summary Length Calculation**:  
     - **`l`** is calculated as the proportion of the required length per chunk.
     - **`p`** represents the proportion of the document's word count that will be included in the summary.

   - **Summary Generation**:  
     - The `generate_summary_gpu` function is called with the chunked document and the proportion `p` to generate the summary.
     - The resulting summary (`abs_summ`) is combined into a single string.

   - **Truncating the Summary**:  
     If the length of the generated summary exceeds the required length, it is truncated to `req_len` words to meet the summary length constraints.

   - **Saving the Summary**:  
     The summary is saved to a file. Each document's summary is written to a file named after the document (`name`), and the file is saved to `output_path`.

   - **Appending to `predicted_summ`**:  
     The summary is added to the list `predicted_summ` for further analysis or evaluation.

   - **Print Statements**:  
     The loop prints:
     - The document index, name, word count, and required summary length.
     - The proportion `p` and the final length of the summary.

### Summary of Key Variables and Functions

- **`names`**: A list of document names.
- **`data_source`**: A list of document content.
- **`dict_names`**: A dictionary containing the required summary length for each document.
- **`output_path`**: Path where the generated summaries are saved.
- **`generate_summary_gpu(nested_sentences, p)`**: Function that generates summaries using GPU, based on the chunks and the word proportion `p`.
- **`nest_sentences(doc, 1024)`**: Function that splits a document into smaller chunks of sentences, each having a maximum length of 1024 tokens.


In [None]:
!pip install nltk
import nltk

# Download the 'punkt_tab' resource
nltk.download('punkt_tab')


predicted_summ=[]
# main loop to generate and save summaries of each document in the test dataset
for i in range(16,len(data_source)):
    name = names[i]
    doc = data_source[i]
    wc = doc.split(" ")
    input_len = len(wc)
    req_len = dict_names[name]
    print(str(i) + ": " + name +  " - " + str(input_len) + " : " + str(req_len), end = " ,")

    nested = nest_sentences(doc,1024)
    l = int(req_len/len(nested))
    p = float(req_len/input_len)
    print(p)

    abs_summ = generate_summary_gpu(nested,p)
#     print(abs_summ)
#     break
    abs_summ = " ".join(abs_summ)
    if len(abs_summ.split(" ")) > req_len:
        abs_summ = abs_summ.split(" ")
        abs_summ = abs_summ[:req_len]
        abs_summ = " ".join(abs_summ)
#     print(abs_summ)
#     break
    print(len((abs_summ.split(" "))))

    path = output_path + name
    file = open(path,'w')
    file.write(abs_summ)
    file.close()
    predicted_summ.append(abs_summ)







[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


16: 5266.txt - 18433 : 6079 ,0.32978896544241304
5247
17: 1778.txt - 1463 : 468 ,0.3198906356801094
427
18: 6276.txt - 3590 : 599 ,0.16685236768802228
533
19: 1329.txt - 2990 : 343 ,0.11471571906354515
309
20: 4316.txt - 6756 : 1720 ,0.2545885139135583
1482
21: 5141.txt - 3505 : 919 ,0.2621968616262482
782
22: 5597.txt - 3583 : 323 ,0.09014792073681273
283
23: 5994.txt - 5963 : 1048 ,0.17575046117725976
874
24: 4917.txt - 10531 : 3299 ,0.3132655968094198
2866
25: 4641.txt - 6934 : 1457 ,0.21012402653591


## Load the summaries generated by the model to use for generating ROUGE metrics

In [None]:
  import glob
  path = output_path
  #Get all the txt files with summaries
  all_files = glob.glob(path + "/*.txt")

  predicted_summary = []

  #Get the summaries
  for filename in all_files:
      with open(filename, 'r') as f:
          a = f.read()
          predicted_summary.append(a)


### ROUGE Evaluation Metrics

1) **ROUGE[n]-recall** =  

   $$\frac{\text{# n-grams that appear in both R and C}}{\text{number of n-grams in R}}$$


2) **ROUGE[n]-precision** =  

   $$\frac{\text{# n-grams that appear in both R and C}}{\text{number of n-grams in C}}$$


3) The **ROUGE[n]-F1 score** is then defined as:

$$
\text{ROUGE[n]-F1} = \frac{2 \times \text{ROUGE[n]-recall} \times \text{ROUGE[n]-precision}}{\text{ROUGE[n]-recall} + \text{ROUGE[n]-precision}}
$$

4) While **ROUGE-N** is based on the overlap of n-consecutive words in both the reference and the automatically produced text, **ROUGE-L** considers the longest common subsequence of words (LCS) — even if they aren’t consecutive, but still in order.

5) **ROUGE-Lsum** first splits the summaries into sentences, then performs ROUGE-L calculations for each sentence individually.

**Reference**: [Mastering ROUGE Matrix: Your Guide to Large Language Model Evaluation for Summarization with Examples](https://dev.to/aws-builders/mastering-rouge-matrix-your-guide-to-large-language-model-evaluation-for-summarization-with-examples-jjg)


In [None]:
!pip install rouge-score==0.1.2  # Install the rouge-score library

from rouge_score import rouge_scorer

def calculate_rouge_scores(predicted_summaries, reference_summaries):
  """Calculates all types of ROUGE scores.

  Args:
    predicted_summaries: A list of predicted summaries.
    reference_summaries: A list of reference summaries.

  Returns:
    A dictionary containing the average ROUGE scores for all types.
  """

  scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], use_stemmer=True)
  all_scores = {'rouge1': [], 'rouge2': [], 'rougeL': [], 'rougeLsum': []}

  for predicted, reference in zip(predicted_summaries, reference_summaries):
    scores = scorer.score(reference, predicted)  # Note: reference first, then predicted
    for rouge_type in all_scores:
      all_scores[rouge_type].append(scores[rouge_type].fmeasure)  # Store F1-scores

  # Calculate average scores
  average_scores = {rouge_type: sum(scores) / len(scores) for rouge_type, scores in all_scores.items()}

  return average_scores

# Example usage (assuming you have your summaries loaded):
predicted_summaries = predicted_summary
reference_summaries = data_summary[:len(predicted_summary)]

rouge_scores = calculate_rouge_scores(predicted_summaries, reference_summaries)
print(rouge_scores)  # Print the average ROUGE scores

{'rouge1': 0.3652336980728314, 'rouge2': 0.10150773473356164, 'rougeL': 0.18457206231026227, 'rougeLsum': 0.34250539252031587}


### ROUGE Scores

The ROUGE scores for the model are as follows:

- **ROUGE-1**: 0.3652
- **ROUGE-2**: 0.1015
- **ROUGE-L**: 0.1846
- **ROUGE-Lsum**: 0.3425
