# Project Part 1

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/sgeinitz/CS39AA-project/blob/main/project_part1.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/CS39AA-project/blob/main/project_part1.ipynb)

This notebook is intended to serve as a template to complete Part 1 of the projects. Feel free to modify this notebook as needed, but be sure to have the two main parts, a) a introductory proposal section describing what it is your doing to do and where the dataset originates, and b) an exploratory analysis section that has the histograms, charts, tables, etc. that are the output from your exploratory analysis. 

__Note you will want to remove the text above, and in the markdown cells below, and replace it with your own text describing the dataset, task, exploratory steps, etc.__

## 1. Introduction/Background

Research project: **Comparing the Summary Generating Capabilities of Known LLMs Big and Small**

### Objective:
The goal is to evaluate and compare the summary generating capabilities of various large language models (LLMs), both big and small, in terms of their effectiveness and efficiency.

### Methodology:
- **Prompts Used:** For all LLMs, the prompt will be, "Summarize the given transcript. Try to get to the underlying concepts presented in the video transcript to save the reader time and narrow down to the important takeaways."
- **LLMs to be Evaluated:** The models under scrutiny will include GPT-4 preview, GPT-3.5 turbo 16k, LLama 2, and Bard.

### Research Purpose:
The research aims to:
- Understand the strengths and weaknesses of these billion-parameter large language models.
- Assess their utility in practical applications that could positively impact the world.

### Significance:
Summarizing texts or video transcripts can save substantial time for students, researchers, or self-learners. This functionality, becoming increasingly common in media consumption, holds potential for significant influence in the coming years. The study will explore:
- The quality of summaries produced by different LLMs.
- Comparative effectiveness of these models.
- Potential utility for the general populace.

### Potential Dangers:
I am also looking into this, because if these summaries are gonna be widely used in future applications, how accurate are they? As I have seen every LLM produces a different output and they each have their own quirks.

### What I will Learn:
By delving into this, I can get a solid picture of what different LLM models are capable of. Then I can ask further questions like:
- Was it the training data that lead to this effect?
- Was it the model architecure that lead to some quirks in the summarization process?
- Why are certain models capped at smaller lengths while others can handle more?


## 2. Exploratory Data Analysis



In [20]:
import pandas as pd
from IPython.display import display

# Read the CSV file into a DataFrame
df = pd.read_csv('./TranscriptDataset.csv')

# Display the first few rows of the DataFrame to verify its structure
display(df.head())

##Helper functions for outputting summary data for comparison.
def display_descriptions_and_summaries(df, row_index, num_lines, truncation_length):
    """
    Displays the brief description and the specified number of lines from each summary column 
    for a given row, with text formatted according to the truncation length.

    Args:
    df (DataFrame): The DataFrame containing the data.
    row_index (int): The index of the row to display.
    num_lines (int): The number of lines to display from each summary.
    truncation_length (int): The number of characters after which to insert a newline in the summary text.
    """
    # Select the specified row
    selected_row = df.loc[row_index]

    # Display the brief description
    brief_description = selected_row.get('BriefDescription', 'No description provided')
    print(f"========\nBrief Description (Row {row_index}):\n{brief_description}\n========\n")

    # Display the specified number of lines from each summary column
    for column in ['GPT-4 Summary', 'Gpt-3.5 Turbo Summary', 'LLama 2 70b Summary', 'Bard Summary']:
        print(f"Top {num_lines} lines of '{column}' (Row {row_index}):")
        
        # Get the formatted text
        formatted_text = format_summary_text(str(selected_row[column]), truncation_length)

        # Print the specified number of lines
        print('\n'.join(formatted_text.split('\n')[:num_lines]))
        print("\n---\n")

# Example usage
# Assuming the 'BriefDescription' column has been added to the DataFrame
display_descriptions_and_summaries(df, row_index=0, num_lines=10, truncation_length=80)  # Adjust the arguments as needed





Unnamed: 0,Number,BriefDescription,URL,Captions.txt,Captions.srt,GPT-4 Summary,Gpt-3.5 Turbo Summary,LLama 2 70b Summary,Bard Summary
0,1,"MIT Open Course on Independence, Basis, and Di...",https://www.youtube.com/watch?v=eeMJg4uI7o0,"""so as long as I'm introducing the idea\nof a ...",00:06:33>00:06:40 0 1 0 0\n00:06:36>00:06:43 I...,The transcript is from a teaching session focu...,The video introduces the concept of a vector s...,The speaker is discussing the concept of a bas...,"Sure, here is a summary of the video transcrip..."
1,2,MIT Open Course on Basis and Dimension\n,https://www.youtube.com/watch?v=MMWqGD4Urso,"""hi there welcome back to recitation in\nlectu...","""00:00:07>00:00:12 hi there welcome back to re...",The video is a lesson on linear algebra concep...,The video discusses how to find the dimension ...,he lecture is discussing the concept of findin...,"Sure, here is a summary of the given transcrip..."
2,3,New York Times video on ‘His Name Was Bélizair...,https://www.youtube.com/watch?v=n60NTrKs-wc,"""For over 100 years,\nthis family portrait\nhe...","""00:00:01>00:00:03 For over 100 years,\nthis f...",The Metropolitan Museum of Art acquired a 19th...,The transcript discusses the story of a family...,The Metropolitan Museum of Art has acquired a ...,"\nSure, here is a summary of the given transcr..."
3,4,Vice News video on Mega Yachts and False Flags...,https://www.youtube.com/watch?v=KjpnObVhKpg,"""[ Upbeat music ]\n-In the open ocean you get\...","""00:00:02>00:00:03 [ Upbeat music ]\n00:00:03>...","The video discusses the concept of ""flags of c...",The video discusses how the wealthy and powerf...,The video discusses how wealthy individuals an...,The video talks about how wealthy people use f...
4,5,Channel: 'Top LuxuryTop 10' on Biggest Megapro...,https://www.youtube.com/watch?v=ZSD_dAwl8Dg,"""From the largest offshore wind farm the world...","""00:00:00>00:00:06 From the largest offshore w...",This video highlights the top 10 megaprojects ...,The transcript outlines the top 10 megaproject...,"Sure, here's a summary of the top 10 megaproje...",The top 10 megaprojects in Europe are:\n\nStad...


Brief Description (Row 0):
MIT Open Course on Independence, Basis, and Dimension

Top 10 lines of 'GPT-4 Summary' (Row 0):
The transcript is from a teaching session focusing on the foundational concepts 
of vector spaces in linear algebra. The speaker begins by discussing the importa
nce of understanding the basis of a vector space and its dimension, emphasizing 
independence and spanning as key factors.

Three key concepts are illustrated:

1. Independence of vectors: Vectors are independent if none of them can be expre
ssed as a linear combination of the others. If vectors are on the same line, the
y are dependent; in contrast, vectors that point in different directions are ind

---

Top 10 lines of 'Gpt-3.5 Turbo Summary' (Row 0):
The video introduces the concept of a vector space and explains its dimension an
d basis. A vector space can be three-dimensional space, and its dimension is 3. 
A basis for a vector space is a set of independent vectors that span the space. 
The video ill