# Final Report After T5 Model Training

### Step -1
- **Get the data from bbc news Zip file**

### Step -2
- **Data Visulaization and Processing**
  - Get some insights in data

### Step -3
- **Import Some Necessary Libaray**
  - Here I include some library which i not use but due to correcting and errors try to use everything

### Step -4
- ## Custom Dataset Class for News Summarization with Tokenization and Encoding

1. **Initialization**:
   - The `NewsSummaryDataset` class serves to process the data for news summarization.
   - It takes input data in the form of a Pandas DataFrame, a tokenizer (in this case, T5Tokenizer), and sets maximum token lengths for text and summary.
   
2. **Data Processing**:
   - The `__getitem__` method retrieves the text and summary data from the DataFrame.
   - The text and summary data are tokenized and preprocessed using the provided `tokenizer`. The data is converted to token IDs and attention masks.
   
3. **Label Creation**:
   - For the summary, special tokens and padding are added. It creates token labels by masking the padded parts of the summary text, allowing the model to ignore them during the training process.
   
4. **Returned Dictionary**:
   - It constructs and returns a dictionary containing the following key-value pairs:
       - `'text'`: Original text content.
       - `'summary'`: Original summary content.
       - `'text_input_ids'`: Flattened token IDs of the tokenized text.
       - `'text_attention_mask'`: Flattened attention mask for the tokenized text.
       - `'labels'`: Flattened token labels (with padded areas masked).
       - `'labels_attention_mask'`: Flattened attention mask for tokenized summaries.

5. **Length Information**:
   - The `__len__` method returns the length of the dataset, indicating the number of records or samples present in the input DataFrame.

This code essentially prepares the dataset for training a summarization model by processing and organizing the data into a format suitable for the model's input requirements.

### Step -5

### LightningModule Definition for T5-based News Summarization
- This LightningModule script outlines a PyTorch Lightning implementation for a T5 model used in news summarization. Here's a breakdown of this code section:

1. **Initialization**:
   - The `NewsSummaryModel` class initializes the LightningModule, setting up the T5 model (`T5ForConditionalGeneration`) from the `'t5-base'` checkpoint and enabling the return of a dictionary from the model's forward pass.

2. **Forward Method**:
   - The `forward` method is designed to receive input arguments for `input_ids`, `attention_mask`, `decoder_attention_mask`, and optional `labels`. It performs a forward pass through the T5 model with these inputs.
   - The method returns the loss and logits calculated from the model output.

3. **Training Step**:
   - The `training_step` method is meant to define a single step within the training loop. It extracts the necessary input data from the batch dictionary provided by the DataLoader.
   - It invokes the `forward` method to perform the forward pass, computes the loss, and logs the training loss for visualization using `self.log`.

4. **Configure Optimizers**:
   - The `configure_optimizers` method specifies the optimizer used for training. In this case, it employs the `AdamW` optimizer for updating the model parameters.

This script provides the necessary elements to configure the T5 model within a PyTorch Lightning setup for training on the news summarization dataset.

### Step -6

### Lightning Data Module Setup for News Summary Dataset
- This code defines a Lightning Data Module called `NewsSummaryDataModule`, used for organizing and handling the datasets for news summarization. Here's a breakdown of this script:

1. **Initialization**:
   - The `__init__` method sets up the parameters for this data module. It takes `train_df` and `test_df` DataFrames, a tokenizer, and other optional parameters such as `batch_size`, `text_max_token_len`, and `summary_max_token_len`.

2. **Setup Method**:
   - The `setup` method is used to prepare the datasets. In this case, it's creating instances of `NewsSummaryDataset` for both the training and testing sets.
   - It's important to note that the training and testing datasets are preprocessed using the `NewsSummaryDataset` class instantiated with the provided parameters.

3. **Train Dataloader**:
   - The `train_dataloader` method creates and returns a `DataLoader` for the training dataset. It uses the `train_dataset` initialized in the `setup` method.
   - The `DataLoader` is configured with the specified `batch_size`, set to shuffle the data, and allows a specified number of workers for loading the data in parallel.

This script helps organize and load data into the PyTorch Lightning training system by setting up the necessary data modules for the training process.

### Step -7
- Initialization and Training Configuration for the News Summary Model

### Step -8
- Train the model

### Step -9
- `summarize_text`, generates a summary from the input text using a pre-trained model (in this case, a T5 model). Below is the breakdown of this code:

1. **Text Encoding**:
   - The provided `text` input is tokenized and encoded using the specified tokenizer.
   - The text is converted into tokens, with settings like `max_length`, `padding`, and `truncation` controlled by the tokenizer.
   - The `return_tensors` parameter is set to return PyTorch tensors.

2. **Generating the Summary**:
   - The encoded input text is passed to the T5 model to generate a summary.
   - The `generate` method from `model.model` is utilized to create a summary.
   - It specifies various parameters such as `max_length` for the maximum length of the generated text, `num_beams` for the number of beams in beam search, `length_penalty` to modify the length of the generated output, and `early_stopping` to control whether to stop the generation early.
   - The `generate` method returns generated token IDs for the summary.

3. **Decoding and Joining**:
   - The generated token IDs are decoded using the tokenizer, skipping special tokens and cleaning up tokenization spaces.
   - The decoded tokens are joined together to form the final summarized text, which is then returned.

This function combines tokenization, model inference, and decoding to generate a summary of the provided text input using the pre-trained T5 model.

### Step -10
- ROUGE scores for the summarization model

This code computes the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores for the model-generated summaries compared to the reference summaries present in the test dataset. Here's a breakdown of what this code does:

1. **Initialization**:
   - It creates a RougeScorer object to calculate ROUGE scores. The scorer is configured to compute ROUGE-1, ROUGE-2, and ROUGE-L scores, and it uses stemming.

2. **ROUGE Scores Calculation**:
   - It iterates through each row in the `test_df` DataFrame to get the reference summary and the generated summary.
   - For each row, it uses the `summarizeText` function (assuming it's been previously defined) to generate a summary from the article in that row.
   - It then computes ROUGE scores by comparing the generated summary (`model_summary`) with the reference summary (`reference_summary`) for each row.

3. **Aggregation of Scores**:
   - For each ROUGE score (ROUGE-1, ROUGE-2, and ROUGE-L), the code aggregates the scores across all the examples in the test dataset.

4. **Average ROUGE Scores**:
   - It calculates the average ROUGE scores for all the examples in the test set for each ROUGE metric.

5. **Displaying ROUGE Scores**:
   - It prints out the overall ROUGE scores for ROUGE-1, ROUGE-2, and ROUGE-L.

This script essentially evaluates the model's summarization quality by computing ROUGE scores for the generated summaries compared to the provided reference summaries in the test dataset. Adjustments may be required depending on the actual structure and content of your `test_df` and the implementation of the `summarizeText` function.

### Step 11


**Performance Metrics Summary:**

- **ROUGE-1:** Achieved a score of 0.6012, signifying a strong match in single words between the model-generated and human-written summaries.

- **ROUGE-2:** Scored 0.5220, indicating a good similarity in two-word sequences between the model-generated and reference summaries.

- **ROUGE-L:** Obtained a score of 0.4470, representing moderate alignment in longer sequences between the model-generated and human-written summaries.
