
# NLP Project 2: Comparative Analysis of Summarization Models

## Introduction

In this segment of the research project, we delve into the realm of automated text summarization, focusing on evaluating and comparing the performance of various state-of-the-art language models. The central objective is to understand how different models behave under varying input conditions and to analyze the nuances in their summarization capabilities.

## Methodology

### Dataset Generation

To facilitate a comprehensive and unbiased comparison, a specialized program has been developed. This program is capable of generating new datasets of summaries, catering to different truncation lengths. This ensures a level playing field for all models under scrutiny.

### Models for Comparison

The models included in this analysis are:

- **GPT-4:** An advanced language model known for its versatility and depth in understanding context.
- **BART:** A transformer-based model designed for sequence-to-sequence tasks, excelling in summarization.
- **Pegasus:** Specifically fine-tuned for abstractive text summarization, known for generating more coherent summaries.
- **Text Summarization Model:** A general model for summarization tasks.
- **LedLargeBookSummarization:** A model tailored for summarizing longer texts, such as books.

### Complications Encountered

- **Data Preparation:** Preparing the datasets for each model was a challenge, as each model had different input requirements and sensitivities. This

- **Interpreting Results:** Deciphering the subtleties in the summaries generated by each model and comparing them objectively proved to be a complex task, given the subjective nature of summarization quality.

- **Consistency in Evaluation:** Establishing a consistent and fair framework for evaluating the performance of each model was challenging, particularly in balancing quantitative metrics with qualitative assessments.

### Experiment Design

The experiment will be conducted as follows:

1. **Input Preparation:** Identical input content will be fed into each language model. This ensures fairness and consistency in the comparison.
2. **Truncation Variations:** The program will generate datasets for inputs with lengths of 500, 1000, and 1500 characters. This is to observe how the models perform under different input length constraints.
3. **Quality Assessment:** The quality of the generated summaries will be evaluated. This will involve assessing coherence, relevancy, and conciseness.
4. **Comparative Analysis:** Observations will be made on how the summaries vary across different models and input lengths.

## Preliminary Notes Template

For recording observations, the following template will be used:


### Initial Notes on Models:

#### GPT-4:
- 

#### BART:
- 

#### Pegasus:
- 

#### Text Summarization:
- 

#### LedLargeBookSummarization:
- 

### Changes Observed:

#### Between 500 and 1000 Characters:
- 

#### Between 1000 and 1500 Characters:
- 



In [None]:
import pandas as pd
from IPython.display import display

# Read the CSV file into a DataFrame
df = pd.read_csv('./TranscriptDataset.csv')

# Display the first few rows of the DataFrame to verify its structure
display(df.head())

##Helper functions for outputting summary data for comparison.
def display_descriptions_and_summaries(df, row_index, num_lines, truncation_length):
    """
    Display brief description and user specified number of lines from each summary column 
    for a given row, with text formatted according to the truncation length.

    Args:
    df (DataFrame): The DataFrame containing the data.
    row_index (int): The index of the row to display.
    num_lines (int): The number of lines to display from each summary.
    truncation_length (int): The number of characters after which to insert a newline in the summary text.
    """
    # Select the specified row
    selected_row = df.loc[row_index]

    # Display the brief description
    brief_description = selected_row.get('BriefDescription', 'No description provided')
    print(f"========\nBrief Description (Row {row_index}):\n{brief_description}\n========\n")

    # Display the specified number of lines from each summary column
    for column in ['GPT-4 Summary', 'Gpt-3.5 Turbo Summary', 'LLama 2 70b Summary', 'Bard Summary']:
        print(f"Top {num_lines} lines of '{column}' (Row {row_index}):")
        
        # Get the formatted text
        formatted_text = format_summary_text(str(selected_row[column]), truncation_length)

        # Print the specified number of lines
        print('\n'.join(formatted_text.split('\n')[:num_lines]))
        print("\n---\n")


def format_summary_text(text, truncation_length):
    """
    Formats the summary text by inserting a newline character after a specified number of characters.

    Args:
    text (str): The text to be formatted.
    truncation_length (int): The number of characters after which to insert a newline.

    Returns:
    str: The formatted text.
    """
    # Split the text into words
    words = text.split()

    # Initialize variables to keep track of the current line length and the formatted text
    current_line_length = 0
    formatted_text = ""

    for word in words:
        # Add the word to the current line
        formatted_text += word + " "
        current_line_length += len(word) + 1

        # If the current line reaches or exceeds the truncation length, add a newline
        if current_line_length >= truncation_length:
            formatted_text += "\n"
            current_line_length = 0

    return formatted_text

    
# Example usage
# Assuming the 'BriefDescription' column has been added to the DataFrame
display_descriptions_and_summaries(df, row_index=0, num_lines=10, truncation_length=80)  # Adjust the arguments as needed


print("\n\n Here is the analysis of the next video in dataset ...\n\n")

display_descriptions_and_summaries(df, row_index=1, num_lines=10, truncation_length=80)  # Adjust the arguments as needed



: 