## 1. Introduction/Background

Research project: **Comparing the Summary Generating Capabilities of Known LLMs Big and Small**

### Objective:
The goal is to evaluate and compare the summary generating capabilities of various large language models (LLMs), both big and small, in terms of their effectiveness and efficiency.

### Methodology:
- **Prompts Used:** For all LLMs, the prompt will be, "Summarize the given transcript. Try to get to the underlying concepts presented in the video transcript to save the reader time and narrow down to the important takeaways."
- **LLMs to be Evaluated:** The models under scrutiny will include GPT-4 preview, GPT-3.5 turbo 16k, LLama 2, and Bard.

### Research Purpose:
The research aims to:
- Understand the strengths and weaknesses of these billion-parameter large language models.
- Assess their utility in practical applications that could positively impact the world.

### Significance:
Summarizing texts or video transcripts can save substantial time for students, researchers, or self-learners. This functionality, becoming increasingly common in media consumption, holds potential for significant influence in the coming years. The study will explore:
- The quality of summaries produced by different LLMs.
- Comparative effectiveness of these models.
- Potential utility for the general populace.

### Potential Dangers:
I am also looking into this, because if these summaries are gonna be widely used in future applications, how accurate are they? As I have seen every LLM produces a different output and they each have their own quirks.

### What I will Learn:
By delving into this, I can get a solid picture of what different LLM models are capable of. Then I can ask further questions like:
- Was it the training data that lead to this effect?
- Was it the model architecure that lead to some quirks in the summarization process?
- Why are certain models capped at smaller lengths while others can handle more?


## 2. Exploratory Data Analysis

Here are the functions to analyze the output of each model by video in the dataset. 

You can pass in row numbers 0-4 to analyzes summary outputs from 5 videos in the dataset. 

The example is set to video 1 and 2 or by index videos 0 and 1

display_descriptions_and_summaries(df, row_index=0, num_lines=10, truncation_length=80)  # Adjust the arguments as needed


In [None]:
import pandas as pd
from IPython.display import display

# Read the CSV file into a DataFrame
df = pd.read_csv('./TranscriptDataset.csv')

# Display the first few rows of the DataFrame to verify its structure
display(df.head())

##Helper functions for outputting summary data for comparison.
def display_descriptions_and_summaries(df, row_index, num_lines, truncation_length):
    """
    Display brief description and user specified number of lines from each summary column 
    for a given row, with text formatted according to the truncation length.

    Args:
    df (DataFrame): The DataFrame containing the data.
    row_index (int): The index of the row to display.
    num_lines (int): The number of lines to display from each summary.
    truncation_length (int): The number of characters after which to insert a newline in the summary text.
    """
    # Select the specified row
    selected_row = df.loc[row_index]

    # Display the brief description
    brief_description = selected_row.get('BriefDescription', 'No description provided')
    print(f"========\nBrief Description (Row {row_index}):\n{brief_description}\n========\n")

    # Display the specified number of lines from each summary column
    for column in ['GPT-4 Summary', 'Gpt-3.5 Turbo Summary', 'LLama 2 70b Summary', 'Bard Summary']:
        print(f"Top {num_lines} lines of '{column}' (Row {row_index}):")
        
        # Get the formatted text
        formatted_text = format_summary_text(str(selected_row[column]), truncation_length)

        # Print the specified number of lines
        print('\n'.join(formatted_text.split('\n')[:num_lines]))
        print("\n---\n")


def format_summary_text(text, truncation_length):
    """
    Formats the summary text by inserting a newline character after a specified number of characters.

    Args:
    text (str): The text to be formatted.
    truncation_length (int): The number of characters after which to insert a newline.

    Returns:
    str: The formatted text.
    """
    # Split the text into words
    words = text.split()

    # Initialize variables to keep track of the current line length and the formatted text
    current_line_length = 0
    formatted_text = ""

    for word in words:
        # Add the word to the current line
        formatted_text += word + " "
        current_line_length += len(word) + 1

        # If the current line reaches or exceeds the truncation length, add a newline
        if current_line_length >= truncation_length:
            formatted_text += "\n"
            current_line_length = 0

    return formatted_text

    
# Example usage
# Assuming the 'BriefDescription' column has been added to the DataFrame
display_descriptions_and_summaries(df, row_index=0, num_lines=10, truncation_length=80)  # Adjust the arguments as needed


print("\n\n Here is the analysis of the next video in dataset ...\n\n")

display_descriptions_and_summaries(df, row_index=1, num_lines=10, truncation_length=80)  # Adjust the arguments as needed



: 