
# NLP Project 3: Comparative Analysis of Summarization Models

## Introduction

In this segment of the research project, we delve into the realm of automated text summarization, focusing on evaluating and comparing the performance of various state-of-the-art language models. The central objective is to understand how different models behave under varying input conditions and to analyze the nuances in their summarization capabilities.

## Methodology

### Dataset Generation

To facilitate a comprehensive and unbiased comparison, a specialized program has been developed. This program is capable of generating new datasets of summaries, based on a set truncation length. This ensures a level playing field for all models under scrutiny, and enables us to compare the affect that input length has on the quality of summaries.

The program utilizes an apiEndpoint factory that implements an interface that ensures all api endpoints operate under indentical logical conditions. At runtime, the program queries the user for a prompt, a videoId, and a caption type. This is then passed off to each api endpoint sequentially ensuring identical inputs to each model used.


Link to github for the dataset generator (and specifically to the file I generated the datasets with): https://github.com/TheMaxta/summaryFactory/blob/main/runAllModels.js


### Prompts used

For summary models: None, they are all passed identical transcripts and already know how to handle the text.

For Chatbots: `Summarize the following transcript:\n\n${transcript}`

All chatbots always receive identical summaries


### Models for Comparison

The models included in this analysis are:

- **GPT-3.5 16k:** An advanced language model known for its versatility and depth in understanding context.
- **BART:** A transformer-based model designed for sequence-to-sequence tasks, excelling in summarization.
- **Pegasus:** Specifically fine-tuned for abstractive text summarization, known for generating more coherent summaries.
- **Text Summarization Model:** A general model for summarization tasks.
- **LedLargeBookSummarization:** A model tailored for summarizing longer texts, such as books.



### Complications Encountered

- **Data Preparation:** Preparing the datasets for each model was a challenge, as each model had different input requirements and sensitivities. 

- **Interpreting Results:** Deciphering the subtleties in the summaries generated by each model and comparing them objectively proved to be a complex task, given the subjective nature of summarization quality.

- **Consistency in Evaluation:** Establishing a consistent and fair framework for evaluating the performance of each model was challenging. I had to do a lot of guess work for this project, and plan to implement a more robust benchmarking technique for future project like the ones seen on model pages on hugging face.

### Experiment Design

The experiment will be conducted as follows:

1. **Input Preparation:** Identical input content will be fed into each language model. This ensures fairness and consistency in the comparison.
2. **Truncation Variations:** The program will generate datasets for inputs with lengths of 1500, 2000, and 2500 characters. This is to observe how the models perform under different input length constraints.
3. **Quality Assessment:** The quality of the generated summaries will be evaluated. This will involve assessing coherence, relevancy, and conciseness.
4. **Comparative Analysis:** Observations will be made on how the summaries vary across different models and input lengths.


## Objectives for Summaries:
  - Summary saves user time.
  - Extract the main topics from video.
  - A great summary logically orders all main points in the same order the video addresses them.
  - Summary should stay objective without directly replicating the words in the video.
  - The best summary model understands the underlying representations of text, and extracts main points then exrapolates to it's own language.

### Notes on Models post dataset review:

#### GPT-3.5 16k:
- 

#### BART:
- Bart does seem to perform really well on longer input lengths, but almost always replicates the words used in the video without extrapolating at all. 
- Bart can not handle large inputs, and this is a large reason that this analysis does not extend to past 1500 characters of transcript data.
- Based on the responses given, Bart tends to do more cutting down on filler words rather than understanding the underlying meaning of the text. (However it does extract main points)
- Bart seems to be a great tool to narrow down on the main points in text.



#### Pegasus:
- Pegasus is primarily trained on summarization specific data sets and that could determine why it's output is so different compared to the other models in use. (they were trained on a wider range of data)
- Pegasus greatly condenses giant swathes of text down to one or two main points presented in the text. It seems to do a really good job of getting the MAIN point, but is not very useful for detailed summarization.
- Certain content is not summarized very well. if the content brings up an analogy, like it did in the dataset for a video titled "Vectoring Words (Word Embeddings) - Computerphile" the speaker uses an analogy to break down the subject. The main point is not extracted by pegasus, instead we get a weird nonsensical sentence repeating cat and dog over and over. Likely because the analogy is about a cat and a dog, but that just shows how this model likely identifies repition a little too hastily.

#### Text Summarization (falcon):
- In some cases, the Falcon Text Summarization model does the best job at retaining the actual names and factual details in videos. Where other videos cut out important details like the mathematical concepts discussed in the video, this model kept that information in the summary.
- Almost all summaries replicate the speaking voice / personality of the orator from the video, almost to an annoying degree, possibly reducing the quality of summary. While I have noticed longer character lengths do less of the replication of the personality of the speaker. 

#### LedLargeBookSummarization:
- This model broke. It worked in testing, but I could not get it to run as a batch job. 





## GPT-3.5
- **Type:** Neural Network (Transformers)
- **Architecture:** Decoder-only architecture
- **Mechanism:** 
  - Utilizes a modified version of the transformer architecture, fundamentally based on self-attention mechanisms.
  - Capable of understanding and generating human-like text by predicting the next word in a sequence.
- **Datasets:** 
  - Trained on a diverse range of internet text.
  - Specific datasets are not publicly disclosed, but it includes books, websites, and other texts available online up to its training cut-off in 2021.

  
## BART (Bidirectional and Auto-Regressive Transformers)

- **Type:** Neural Network (Transformers)
- **Architecture:** Encoder-Decoder architecture
- **Mechanism:** 
  - Trained by corrupting text with an arbitrary noising function and learning to reconstruct the original text.
  - Combines bidirectional context in the encoder with autoregressive capabilities in the decoder.
- **Datasets:** 
  - Primarily trained on large-scale corpora like BookCorpus and English Wikipedia.
  - Additional training on diverse text sources for a well-rounded language understanding.

## Pegasus (Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models)

- **Type:** Neural Network (Transformers)
- **Architecture:** Encoder-Decoder architecture
- **Mechanism:** 
  - Specifically designed for abstractive text summarization.
  - Uses a novel pre-training objective called "gap sentences generation", aligning closely with the summarization task.
- **Datasets:** 
  - Utilizes large-scale datasets specifically tailored for summarization, such as news article datasets (CNN/DailyMail, XSum) and web sources.
  - Designed to excel in summarization by training on text where summaries are naturally occurring.

## Text Summarization Model

- **Type:** Likely Neural Network (Transformers)
- **Architecture:** Encoder-Decoder or Decoder-only (varies)
- **Mechanism:** 
  - Typically uses transformer-based architectures with attention mechanisms.
  - Can vary between extractive (selecting parts of the original text) or abstractive (rewriting content in a condensed form) summarization.
- **Datasets:** 
  - The exact datasets can vary depending on the specific implementation.
  - Often trained on standard summarization datasets like CNN/DailyMail, XSum, or large-scale internet text corpora.

## LedLargeBookSummarization

- **Type:** Neural Network (Transformers)
- **Architecture:** Likely an Encoder-Decoder architecture
- **Mechanism:** 
  - Designed for long-form text summarization, such as books.
  - May use techniques like Longformer to handle longer sequences, involving extended attention mechanisms or hierarchical approaches.
- **Datasets:** 
  - Likely trained on extensive literary works and long-form content sources.
  - May include datasets consisting of books, scientific papers, and comprehensive reports to handle long-text summarization.

---

Each model represents a unique approach to handling the complexities of language, tailored for specific tasks in the realm of natural language processing.

### Project Three Updates:

#### Successful Implementation of Summary Generation in ChatGPT
- **Focus:** Investigating whether models predominantly trim filler words and highlight main points without developing a unique voice.

#### New Hypothesis
- **Hypothesis:** Models don’t seem to fully extrapolate to their own voice; mainly focus on trimming filler words and identifying main points.

#### Added Features to Program for Testing Hypothesis
- **Feature 1:** Count of unique output words in summaries.
- **Feature 2:** Measurement of summary lengths.

### Testing New Hypothesis

#### Data Analysis
- **Overview:** Considering total of all available truncation lengths from all datasets combined.


| Model            | Mean Unique Words Used | Mean Summary Generated Length |
|------------------|------------------------|-------------------------------|
| GPT-3.5          | 29.25                  | 106.67                        |
| Bart             | 6.39                   | 53.22                         |
| Pegasus          | 1.42                   | 27.92                         |
| Falcon Summary   | 2.78                   | 70.22                         |


**Mean Unique Words by Truncation Length:**

#### 2500 Characters
| Model           | Mean Length |
|-----------------|-------------|
| GPT             | 31.25       |
| Bart            | 6.00        |
| Pegasus         | 1.50        |
| Falcon Summary  | 3.25        |

#### 2000 Characters
| Model           | Mean Length |
|-----------------|-------------|
| GPT             | 29.17       |
| Bart            | 7.67        |
| Pegasus         | 1.33        |
| Falcon Summary  | 2.25        |

#### 1500 Characters
| Model           | Mean Length |
|-----------------|-------------|
| GPT             | 27.33       |
| Bart            | 5.50        |
| Pegasus         | 1.42        |
| Falcon Summary  | 2.83        |


### **Findings:**
We seem to have challenged my original hypothesis that there are few original words being used and almost none in the summary specialized models. There are definitely less unique words being used, but there are still a significant amount of words that were not found in the input transcript. 

**Input Length**:

Input Length also seems to have little to no effect on the amount of unique words generated in the summary

**Additional Note**: I think it's really interesting to see that Bart was the second best at extrapolating to a new voice when it was trained to recreate texts

We also see that for all input lengths, Gpt generates the longest summaries with Falcon summary coming at a close second. Pegasus, as I could have predicted generates the smallest summaries of all.

**Thoughts:**

This analysis has become more interesting as recent developments have cast shade on GPT regurgitating training data when prompted specifically. This ability to extrapolate to new words is an interesting metric, and while it doesn't apply to regurgitating training data, it does maybe hint at a combination of training data and word from the supplied input.
