## GistTube

This workshop demonstrates how to create an AI/ML pipeline to convert spoken audio into a text summary. You will learn how to convert audio into text via automatic speech recognition and natural language processing using the Whisper ASR model.

Upon completion you will have a basic understanding of:

1. FFMPEG for audio extraction
1. Automatic speech recognition 
1. NLP concepts (tokenization, summarization)
1. Whisper and Huggingface APIs
1. Pandas DataFrames


**Note:** This file is intended to be run on [Google Colab](https://colab.research.google.com). If you're viewing this file on github, [click here](https://githubtocolab.com/fbsamples/mit-dl-workshop/blob/main/video-summarizer/exercise.ipynb) to load it into google colab.

If you get stuck, feel free to refer to the [solution](https://githubtocolab.com/fbsamples/mit-dl-workshop/blob/main/video-summarizer/solution.ipynb)

### 1. Import necessary libraries

In [None]:
%pip install -U openai-whisper 

In [None]:
import os
import torch
import pandas as pd
import whisper
from transformers import AutoTokenizer, pipeline

### 2. Set options

In [None]:
# The size of the ASR model to use
ASR_MODEL_SIZE = "small.en"

# The maximum length of each bullet point (in tokens)
SUMMARY_LENGTH = 128

# Set device to GPU if available, otherwise use CPU
DEVICE = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

# This is the NLP model for summarization
# Play around with other models at https://huggingface.co/models?pipeline_tag=summarization&sort=downloads
NLP_ARCH = 'facebook/bart-large-cnn'

# We will save any artifact we create in this folder
FOLDER = "refik_interview"
os.makedirs(FOLDER)

### 3. Download the video and extract audio


In this step you will download the video that you will be summarizing. Later you will use the Whisper ASR model to transcribe the audio. It only accepts audio as input, so you will need to extract the audio from the video file. 

In this step of the excercise you will:

1. Download the video file.
1. Extract the audio. 

To download the file from the Amazon S3 bucket we provided, you will use the curl command. It's a tool for transferring data from or to a server. Then you will use FFMPEG command to extract the audio. 

*FFMPEG is a suite of libraries and programs for handling video, audio, other multimedia files, and streams.*

Now that you understand the process, run the code below.

In [None]:
# these are shell commands, but you can run them from a jupyter notebook (like colab) by prefixing an exclamation mark (!)

!curl "https://pytorch-workshops.s3.amazonaws.com/videos/refik_interview.mp4" -o refik_interview/video.mp4
!ffmpeg -i refik_interview/video.mp4 -vn -acodec libmp3lame -ab 128k refik_interview/audio.mp3

🏆🏆 Congrats! With just two lines of code you have downloaded a video off the internet and extracted the audio into a separate file! Pretty neat huh! 😀 🏆🏆

If you recall above we mentioned that the Whisper ASR model requires audio as input. Since our source material was a video we had to extract the audio. 

Lets say our source material was an audio podcast. 

**Question:** Would we need to use ffmpeg to extract the audio?

**Answer:** No! In that case we wouldn't need to extract the audio because the file is already in the correct format - audio!

---- 
**Note:** For this workshop we included the audio-only version of the file in our Amazon S3 bucket. To download just the **audio/refik_interview.mp3** file, you would run the code below:

`!curl "https://pytorch-workshops.s3.amazonaws.com/videos/audio/refik_interview.mp3" -o refik_interview/audio.mp3`

----

Now that you have your audio file to transcribe, you need to build the ASR model that will do the transcription. 

### 4. Load the Whisper ASR Model

For this workshop, you will use the Whisper ASR model ([blog](https://openai.com/blog/whisper/)). It is one of the best performing ASR models today. 

Here are the instructions for using Whisper ASR in python: https://github.com/openai/whisper#python-usage

Run the code below to load the model.

In [None]:
asr_model = whisper.load_model(ASR_MODEL_SIZE).to(DEVICE)

🏆🏆 Now that the model was loaded you can transcibe the text🏆🏆

In the next step you will test out the code for transcription on a 3-second audio file we provided for you. 

Go ahead and open the file in your browser to listen to it: [Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav](https://pytorch-tutorial-assets.s3.amazonaws.com/VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav)

Now that you've given it a listen, it's time to transcribe it! To do that use the `.transcribe` function:

`asr_model.transcribe("<path to file>")`

The code below has already been written to download and save the audio into a file called **tmp_audio.wmv**. 

Calling transcribe will return a dictionary that includes some information about the file in addition to the transcribed text.

Given that information find and replace `# write code here` with the code to transcribe the model and print the transcribed text.

In [None]:
# EXERCISE: 
# - Test the ASR model on an audio file at https://pytorch-tutorial-assets.s3.amazonaws.com/VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav
# - What does result['segments'] return?

# download the audio file
import requests
with requests.get("https://pytorch-tutorial-assets.s3.amazonaws.com/VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav", stream=True) as resp:
    with open('tmp_audio.wav', 'wb') as f:
        f.write(resp.content)


# Run ASR transcription
# ... 
# write code here
# ...

🏆🏆Were you able to transcribe the text? Does it match what the audio file says? If so, congratulations!🏆🏆

Now it's time to learn a little bit more about how the transcribe function handles the audio for larger files.

### 5. Transcribe speech in the audio

The ASR model chunks up the audio into "segments". Since the audio above is short, it has only one segment; longer audio tracks will have more segments. 

Each segment has a start and end timestamp, along with the transcribed text. We'll load all the segments into a pandas dataframe. DataFrames are like spreadsheets in python; they make data easier to read and manage. In our dataframe, each row corresponds to one segment. 

For this excercise you will write a function that takes an audio path (`audio_path`) and an ASR model (`asr_model`), and returns the transcribed dataframe `transcript_df`. To accomplish this task you will need to:

1. Transcribe `audio_path` using the `.transcribe` function you learned above.
1. Create the dataframe and load the segments into the pandas dataframe: `pd.DataFrame(result["segments"])`
1. Keep only `['start', 'end', 'text']` columns.


The code to save the dataframe to a CSV has been provided for you. Make sure to name your pandas dataframe `transcript_df`.

Now that you understand what you need to do, try it yourself. If you get stuck, check out the [solution](https://githubtocolab.com/fbsamples/mit-dl-workshop/blob/main/video-summarizer/solution.ipynb).

In [None]:
# EXERCISE:
# Write a function to 
# - transcribe an audio file using an ASR model 
# - load the start, end, text values of the ASR result into a pandas dataframe
# - save the dataframe as a csv file in {FOLDER}
# - return the dataframe


def transcribe_audio(audio_path: str, asr_model):
    """
    Transcribe an audio file using the provided ASR model.
    
    Parameters:
        audio_path (str): The file path of the audio file.
        asr_model: The ASR model to use for transcription.
    
    Returns:
        str: The file path of the transcript.
    """
    # ... 
    # write code here
    # ...
    
    transcript_df.to_csv(f"{FOLDER}/transcript.csv", index_label=False)
    return transcript_df

#### 4.1 Test your function

Once you've written the function, try it on the 3 second audio you just downloaded:

In [None]:
transcribe_audio('tmp_audio.wav', asr_model)

🏆🏆 Do you see a dataframe with a single row and 3 columns? If so, good job! 🏆🏆

If you're stuck don't worry. We've got you covered. Compare your code with the [solution](https://github.com/fbsamples/mit-dl-workshop/blob/main/video-summarizer/solution.ipynb).

Now that your function works, it's time to transcribe the full audio file.

#### 4.2 Transcribe the full audio file

The time has finally come to transcribe the audio file you downloaded into text! The file is quite large so it will take a while to transcribe. Feel free to grab a drink, get some coffee, and put on your favorte cat videos while you wait 😸

In [None]:
audio_file = f"{FOLDER}/audio.mp3"

transcript_df = transcribe_audio(audio_file, asr_model)  # this takes a while... 
transcript_df.head()

🏆🏆Congratulations! You now have fully transcribed the audio file containing multiple segments!🏆🏆

In the next step you will convert that transcript into a summary.

### 5. Generate summary of transcription

The `transcribe_audio` function you wrote saved the raw transcript in `transcript.csv`. The csv file has 3 columns: start, end, text, populated by the ASR model. It's now time to create the next part of the project that will populate the summary column.

Given that one row represents one segement, you could generate a summary for each row. But consider how each segment is composed. Each row corresponds to around 7 seconds of audio, which includes silence. Some rows have barely any words at all - this is not a good candidate for summarization!

NLP models have a maximum input length they can accept; in case of our model it is 1024. We'll iterate over the `text` column and chunk it up into segments containing 1019 tokens or just under that. 

**Question:** Why split segments into 1019 instead of 1024 tokens?

**Answer:** To prevent inadvertent overflows we chose to use a buffer of 5 tokens - 1024 - 5 = 1019.


#### 5.1 NLP Tokenization

We defined `NLP_ARCH = facebook/bart-large-cnn` above; this is Meta's BART language model that has been finetuned for summarization on the CNN/Daily Mail dataset. We need to use the tokenizer for this architecture. 

The `transformers` library has an `AutoTokenizer` class that provides model-specific tokenizers.

In [None]:
NLP_TOKENIZER = AutoTokenizer.from_pretrained(NLP_ARCH)

def get_tokens(tokenizer, input_text):
    return tokenizer(input_text, add_special_tokens=False)['input_ids']

text = "Artifical Intelligence brings forth a new dawn of the industrial age"
tokens = get_tokens(text)

print("Number of words: ", len(' '.split(text)))
print("Number of tokens: ", len(tokens))

🏆🏆 Great! Now that you've got a hang of it, we're going to step back and challenge you to continue down this notebook. 🏆🏆

Of course, feel free to ask your instructors if you have any questions.

#### 5.2 Tokenize the transcribed text and create longer segments

In [None]:

NLP_MAXLEN = NLP_TOKENIZER.model_max_length - 5

def generate_timestamped_segments(asr_df: pd.DataFrame):
    """
    Tokenize transcribed text, chunking into segments of length <= NLP_MAXLEN-5
    while preserving correct timestamps from ASR transcription.
    
    Parameters:
        asr_df (pd.DataFrame): The transcription segments in dataframe format. 
        Contains columns 'text', 'start', 'end'.
    
    Returns:
        pd.DataFrame: Dataframe where each row is a segment. Must contain columns 'text', 'start', 'end', and
        'tokens' corresponding to the NLP tokens of 'text'
    """
    # ...
    # write code here
    # ...

    # HINTS:
    # - Create an empty list called segments.
    # - Create a dictionary called curr_segment with keys 'start', 'end', 'text', and 'tokens'. 
    #   Each value is initialized to None, "" or [].
    # - Iterate through each row in asr_df.
    # - Get the value of 'text' in the current row and assign it to a variable called text.
    # - Tokenize the text using the function get_tokens and store the result in a variable called tokens.
    # - If the total number of tokens in the current segment plus the number of tokens in the tokens list 
    #   exceed the maximum length, add the current segment to segments list and create a new curr_segment.
    # - If the total number of tokens in the current segment plus the number of tokens in the tokens list 
    #   does not exceed the maximum length, update the curr_segment dictionary with the 'start', 'end', 
    #   'text' and 'tokens' values.
    # - Append the current curr_segment to the segments list.
    # - Create a pandas DataFrame from the segments list.
    # - Return the pandas DataFrame.

In [None]:
segments_df = generate_timestamped_segments(transcript_df)

#### 5.3 Explore the generated dataframe 

Compared to `transcript_df`, what is the time duration that each row captures? How many words on average in each segment?

In [None]:
# EXERCISE: what is the time duration that each row captures? 
   # ...
   # write code here
   # ...

# EXERCISE: How many words on average in each segment?
   # ...
   # write code here
   # ...

# EXERCISE: How many tokens on average in each segment?
   # ...
   # write code here
   # ...

Verify that no text segment contains more than NLP_MAXLEN tokens

In [None]:
assert (segments_df['tokens'].str.len() > NLP_MAXLEN).sum() == 0, f"At least one text segment has more than {NLP_MAXLEN} tokens"

#### 5.4 Summarize the new longer segments

The next step is to pass these text segments to an NLP Summarizer model. For each text segment the model will generate summaries that are not more than `SUMMARY_LENGTH = 128` tokens.

The `transformers` library offers a convenient `pipeline` class that wraps up complex code for summarization (and more tasks) into a single API call. [API Reference](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/pipelines#transformers.SummarizationPipeline)

Here's an example of how to use `pipeline` to summarize text. Note that you can simply pass in the text to the pipeline, it automatically tokenizes and generates summaries for each input.

In [None]:
# Define two BART abstracts as strings
bart_abstract_1 = "BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token."
bart_abstract_2 = " BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance."

# Combine the two abstracts into a list
bart_abstracts = [bart_abstract_1, bart_abstract_2]

# Create a summarization pipeline using the specified pre-trained model and device
summarizer = pipeline("summarization", model=NLP_ARCH, device=torch.device('cuda:0'))

# Use the summarization pipeline to generate summaries of the abstracts, with maximum length 64 and minimum length 20
abstracts_summary = summarizer(bart_abstracts, max_length=64, min_length=20)

# Print the summaries of the two abstracts
print(abstracts_summary[0])
print(abstracts_summary[1])

# Print the original length and summarized length of the two abstracts
print("original length: ", len(bart_abstract_1) + len(bart_abstract_2))
print("summarized length: ", len(abstracts_summary[0]['summary_text']) + len(abstracts_summary[1]['summary_text']))


Now let's summarize our video transcriptions

In [None]:
# EXERCISE: 
# Write a function that 
# - takes in the dataframe generated above, 
# - generates summaries of 128 tokens or less for each segment, and 
# - adds them in a new column of the dataframe. Return this dataframe

def generate_timestamped_summaries(segments_df: pd.DataFrame, summary_lengths: int = 128):
    """
    Generate summaries of each timestamped segments
    
    Parameters:
        segments_df (DataFrame): The dataframe containing timestamps and text segments
        summary_lengths (int): The maximum length of each generated summary
    
    Returns:
        pd.DataFrame: A dataframe with timestamps, transcriptions and summaries
    """
       
    # ...
    # write code here to
    # - Extract the sentences from the timestamped transcript
    # - Initialize the summarization pipeline
    # - Generate summaries for the sentences
    # - Add the summaries to the dataframe
    # - return the dataframe
    # ...


summary_df = generate_timestamped_summaries(segments_df, SUMMARY_LENGTH)

### 6. View the summary

In [None]:
segments_df.head()

### 7. Format and save the dataframe

Looks good! Let's make the timestamps more readable so we can scroll in the video if we need to. Here's a `format_time` helper function.

Save the dataframe as a CSV file. This will be useful when we need to cross-check something in the video

In [None]:
def format_time(t):
    """
    Convert a time in seconds to a string in HH:MM:SS format.
    
    Parameters:
        t (str): The time in seconds.
    
    Returns:
        str: The time in HH:MM:SS format.
    """
    t = round(float(t))
    hh = t // 3600
    t %= 3600
    mm = t // 60
    ss = t % 60
    return f"{hh:02d}:{mm:02d}:{ss:02d}"

# Format the timestamp columns to be human-readable
segments_df['start'] = segments_df['start'].apply(format_time)
segments_df['end'] = segments_df['end'].apply(format_time)

segments_df.to_csv(f'{FOLDER}/timestamped_summaries.csv', index_label=False)

### 8. Preview `segments_df`

In [None]:
segments_df.head(10)

Reading the summary row-by-row is tedious. Concatenate all the individial summaries into a single passage.

In [None]:
# EXERCISE: Concatenate the segment-summaries into a single paragraph

# ...
# write code here
# ...

We're busy people, we need a TL;DR

In [None]:
# EXERCISE: Generate a synopsis of the entire video not exceeding 64 tokens 

# ...
# write code here
# ...

# HINTS:
# - Tokenize the given text passage using a specified tokenizer
# - Create a summarization pipeline using a specified pre-trained model and device
# - If the number of tokens in the passage is greater than the maximum allowed length,
#   split the list of tokens into smaller lists of length NLP_MAXLEN and decode each
#   sublist of tokens to create a list of sentences. Generate a summary for each.
# - If the number of tokens is within the allowed limit, generate a summary
#   directly using the entire passage as input to the summarization pipeline
# - Extract the summary text from each generated summary and join them to create
#   a single string containing the summary of the text passage


Write the TL;DR and paragraph to a file

In [None]:
# EXERCISE: Write a file called video_summary.txt that contains the above generated synopsis and passage

# ...
# write code here
#

