
# NLP Project 2: Comparative Analysis of Summarization Models

## Introduction

In this segment of the research project, we delve into the realm of automated text summarization, focusing on evaluating and comparing the performance of various state-of-the-art language models. The central objective is to understand how different models behave under varying input conditions and to analyze the nuances in their summarization capabilities.

## Methodology

### Dataset Generation

To facilitate a comprehensive and unbiased comparison, a specialized program has been developed. This program is capable of generating new datasets of summaries, based on a set truncation length. This ensures a level playing field for all models under scrutiny, and enables us to compare the affect that input length has on the quality of summaries.

The program utilizes an apiEndpoint factory that implements an interface that ensures all api endpoints operate under indentical logical conditions. At runtime, the program queries the user for a prompt, a videoId, and a caption type. This is then passed off to each api endpoint sequentially ensuring identical inputs to each model used.

I plan to use chatbots in the final project, simply by supplying the prompt "please summarize the text {captions}" and again, supplying identical captions through the api endpoint factory. However, I cold not get my openai account working before the deadline. (this video tripped up most of the models, although some did mention the main points at the end of the summary)

Link to github for the dataset generator (and specifically to the file I generated the datasets with): https://github.com/TheMaxta/summaryFactory/blob/main/runAllModels.js

### Prompts used

For summary models: None, they are all passed identical transcripts and already know how to handle the text.

For Chatbots: None, I tried multiple and they all crashed at runtime.


### Models for Comparison

The models included in this analysis are:

- **GPT-4:** An advanced language model known for its versatility and depth in understanding context.
- **BART:** A transformer-based model designed for sequence-to-sequence tasks, excelling in summarization.
- **Pegasus:** Specifically fine-tuned for abstractive text summarization, known for generating more coherent summaries.
- **Text Summarization Model:** A general model for summarization tasks.
- **LedLargeBookSummarization:** A model tailored for summarizing longer texts, such as books.



### Complications Encountered

- **Data Preparation:** Preparing the datasets for each model was a challenge, as each model had different input requirements and sensitivities. 

- **Interpreting Results:** Deciphering the subtleties in the summaries generated by each model and comparing them objectively proved to be a complex task, given the subjective nature of summarization quality.

- **Consistency in Evaluation:** Establishing a consistent and fair framework for evaluating the performance of each model was challenging. I had to do a lot of guess work for this project, and plan to implement a more robust benchmarking technique for future project like the ones seen on model pages on hugging face.

### Experiment Design

The experiment will be conducted as follows:

1. **Input Preparation:** Identical input content will be fed into each language model. This ensures fairness and consistency in the comparison.
2. **Truncation Variations:** The program will generate datasets for inputs with lengths of 500, 1000, and 1500 characters. This is to observe how the models perform under different input length constraints.
3. **Quality Assessment:** The quality of the generated summaries will be evaluated. This will involve assessing coherence, relevancy, and conciseness.
4. **Comparative Analysis:** Observations will be made on how the summaries vary across different models and input lengths.


## Objectives for Summaries:
  - We want to use summary to save time
  - Get the general points brought up in the video. 
  - A great summary logically orders all main points.
  - Summary should stay objective without directly replicating the words in the video.
  - The best summary model can understand the underlying representations of the text really well, and extracts pionts and then exrapolates to it's own language.

### Notes on Models post dataset review:

#### GPT-4:
- Unfortunately, the model broke before I could submit the assignment. I ran out of API usage and OpenAi never reinstated my api key, so I couldn't generate summaries

#### BART:
- Bart does seem to perform really well on longer input lengths, but almost always replicates the words used in the video without extrapolating at all. 
- Bart can not handle large inputs, and this is a large reason that this analysis does not extend to past 1500 characters of transcript data.
- Based on the responses given, Bart tends to do more cutting down on filler words rather than understanding the underlying meaning of the text. (However it does extract main points well)
- Bart seems to be a great tool to narrow down on the main points in text.
- The model was trained for the purpose of transcript summarization, so needless to say, this model performs pretty well on transcripts.


#### Pegasus:
- Pegasus is primarily trained on summarization specific data sets and that could determine why it's output is so different compared to the other models in use. (they were trained on a wider range of data)
- Pegasus greatly condenses giant swathes of text down to one or two main points presented in the text. It seems to do a really good job of getting the MAIN point, but is not very useful for detailed summarization.
- Certain content is not summarized very well. if the content brings up an analogy, like it did in the dataset for a video titled "Vectoring Words (Word Embeddings) - Computerphile" the speaker uses an analogy to break down the subject. The main point is not extracted by pegasus, instead we get a weird nonsensical sentence repeating cat and dog over and over. Likely because the analogy is about a cat and a dog, but that just shows how this model likely identifies repition a little too hastily.

#### Text Summarization (falcon):
- In some cases, the Falcon Text Summarization model does the best job at retaining the actual names and factual details in videos. Where other videos cut out important details like the mathematical concepts discussed in the video, this model kept that information in the summary.
- Almost all summaries replicate the speaking voice / personality of the orator from the video, almost to an annoying degree, possibly reducing the quality of summary. While I have noticed longer character lengths do less of the replication of the personality of the speaker. 

#### LedLargeBookSummarization:
- This model also broke. It worked in testing, but I could not get it to run as a batch job. 

### In conclusion:
- None of the models are capable of extrapolating to their own voice, rather, they cut out filler words really well and understand some of the main points (leave this to gpt4)
- Almost all of the models very clearly identify repition in keywords and phrases and apply a higher weight to those words appearing in the summary, while some of the better performing models are still able to gather some of the topics despite the presence of repition. 
- In general, these models still perform quite well for their size and the fact that they are free and easy to access. They do a good job of cutting down on filler words, however, weird nuances will trip them up and you will get weird outputs.
- All of these models utilize transformers, and almost all (except for text summarization [falcon]) are encoder-decoder models, which heavily rely on attention mechanisms to learn the underlying meaning / representations of words. These models use embeddings, positional encodings, and attention mechanisms to process their input / output, and that is likely why many of them generate similar results with identical inputs. However, these models are trained on various datasets and that likely has a big affect on the results

### Changes Observed:

#### Between 500 and 1000 Characters:
- Lots of summaries cut off randomly when only supplied 500 characters, and there is a little more explanation at 1000 characters.
- However, these responses from the model are better than you might expect, because often times the main topics of a video are brought up at the start. This is likely something that many of the models identify while training.
- 

#### Between 500 and 1500 Characters:
- There is usually much more detail at 1500 characters.
- The first part of the summary is only slightly changed, while the summaries are typically much longer when supplied more input characters. But this does not happen every time.
- In one example, (fromt Bart) the 500 character summary is basically just an intro, but at 1500 characters we are introduced to the main talking points rather than the intro. However, this is one case and later on the same model repeats the introduction in the long form version, and just includes a couple additional details.
- 

# NLP Summarization Models Overview

An overview of various state-of-the-art models used in natural language processing for text summarization tasks.





## GPT-4
- **Type:** Neural Network (Transformers)
- **Architecture:** Decoder-only architecture
- **Mechanism:** 
  - Utilizes a modified version of the transformer architecture, fundamentally based on self-attention mechanisms.
  - Capable of understanding and generating human-like text by predicting the next word in a sequence.
- **Datasets:** 
  - Trained on a diverse range of internet text.
  - Specific datasets are not publicly disclosed, but it includes books, websites, and other texts available online up to its training cut-off in 2021.

  
## BART (Bidirectional and Auto-Regressive Transformers)

- **Type:** Neural Network (Transformers)
- **Architecture:** Encoder-Decoder architecture
- **Mechanism:** 
  - Trained by corrupting text with an arbitrary noising function and learning to reconstruct the original text.
  - Combines bidirectional context in the encoder with autoregressive capabilities in the decoder.
- **Datasets:** 
  - Primarily trained on large-scale corpora like BookCorpus and English Wikipedia.
  - Additional training on diverse text sources for a well-rounded language understanding.

## Pegasus (Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models)

- **Type:** Neural Network (Transformers)
- **Architecture:** Encoder-Decoder architecture
- **Mechanism:** 
  - Specifically designed for abstractive text summarization.
  - Uses a novel pre-training objective called "gap sentences generation", aligning closely with the summarization task.
- **Datasets:** 
  - Utilizes large-scale datasets specifically tailored for summarization, such as news article datasets (CNN/DailyMail, XSum) and web sources.
  - Designed to excel in summarization by training on text where summaries are naturally occurring.

## Text Summarization Model

- **Type:** Likely Neural Network (Transformers)
- **Architecture:** Encoder-Decoder or Decoder-only (varies)
- **Mechanism:** 
  - Typically uses transformer-based architectures with attention mechanisms.
  - Can vary between extractive (selecting parts of the original text) or abstractive (rewriting content in a condensed form) summarization.
- **Datasets:** 
  - The exact datasets can vary depending on the specific implementation.
  - Often trained on standard summarization datasets like CNN/DailyMail, XSum, or large-scale internet text corpora.

## LedLargeBookSummarization

- **Type:** Neural Network (Transformers)
- **Architecture:** Likely an Encoder-Decoder architecture
- **Mechanism:** 
  - Designed for long-form text summarization, such as books.
  - May use techniques like Longformer to handle longer sequences, involving extended attention mechanisms or hierarchical approaches.
- **Datasets:** 
  - Likely trained on extensive literary works and long-form content sources.
  - May include datasets consisting of books, scientific papers, and comprehensive reports to handle long-text summarization.

---

Each model represents a unique approach to handling the complexities of language, tailored for specific tasks in the realm of natural language processing.


In [None]:
import pandas as pd

# Paths to your .xlsx files
xlsx_files = ['./summaries500char.xlsx', './summaries1000char.xlsx', './summaries1500char.xlsx']

# Read and display each dataset
for file_path in xlsx_files:
    # Read the .xlsx file into a DataFrame
    df = pd.read_excel(file_path)

    # Display the DataFrame
    print(f"Displaying data from: {file_path}")
    display(df)
