#  YouTube Video Summarization using Hugging Face Transformers and Whisper ASR

### Brett Neubeck
<br>

### Table of Contents

- [Summary](#summary)
- [Imports](#imports)
- [Model Selection](#model)
- [YouTube Meta Data Processing](#meta)
- [Audio Trimming](#trim)
- [Transcribing Audio Stream](#trans)
- [Summarize Transcribed Text](#sum)
- [Example Of Shorted Summarized Text](#short)
- [Conclusion](#conclusion)

### Summary
<a id='summary'></a>

Recently, I was tasked with watching a YouTube video and discussing it in a graduate class. Inspired by the idea of automating this process, I embarked on a project to leverage Python and machine learning techniques to summarize YouTube video transcripts. This project showcases how to utilize Hugging Face Transformers and Whisper ASR to achieve this task.

**Project Goals:**
The primary goal of this project was to develop a Python script capable of summarizing YouTube video transcripts automatically. To achieve this, I followed a series of steps, including package installation, data retrieval, audio processing, ASR (Automatic Speech Recognition), and NLP (Natural Language Processing) for summarization.

**Project Steps:**

1. **Package Installation and Import:**
   - The project begins with the installation of necessary Python packages.
   - Key libraries include Hugging Face Transformers for NLP and PyTube for YouTube data extraction.

2. **Data Retrieval:**
   - The project involves selecting a YouTube video to summarize.
   - The PyTube library is used to extract relevant information about the video.

3. **Audio Processing:**
   - The audio stream from the selected YouTube video is downloaded.
   - If required, audio splicing (similar to sampling a record) is performed to extract the relevant portion of the audio.

4. **Automatic Speech Recognition (ASR):**
   - The Whisper ASR model, available through Hugging Face Transformers, is utilized to transcribe the audio feed.
   - Whisper converts the spoken words in the video into text, enabling further analysis.

5. **Text Summarization with NLP:**
   - Using NLP techniques, the transcribed text is summarized.
   - Hugging Face Transformers' NLP capabilities are employed to achieve this summarization.

**Project Outcome:**
The end result of this project is a Python script that can take a YouTube video, transcribe its audio content, and then generate a concise textual summary. This automated summarization process significantly simplifies the task of reviewing and discussing YouTube videos, making it more efficient and accessible.

In [1]:
 !pip install git+https://github.com/openai/whisper.git -q
 !sudo apt update && sudo apt install ffmpeg -q
 !pip install pytube
 !pip install transformers

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:7 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [993 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Hit:9 h

In [2]:
!nvidia-smi

Thu Sep 14 19:13:47 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Imports
<a id='imports'></a>

In [3]:
import whisper
from pytube import YouTube
import datetime
import pprint
from transformers import pipeline

### Model Selection
<a id='model'></a>

###### Whisper ASR models have the following options <br>
- tiny, base, small, medium, large

In [4]:
# initialize whisper asr model
model = whisper.load_model('medium')

100%|██████████████████████████████████████| 1.42G/1.42G [00:13<00:00, 111MiB/s]


### YouTube Meta Data Processing
<a id='meta'></a>

In [5]:
# select youtube video
youtube_video_url = 'https://www.youtube.com/watch?v=hVimVzgtD6w'
youtube_video = YouTube(youtube_video_url)

In [6]:
# print youtube video title from link
youtube_video.title

"The best stats you've ever seen | Hans Rosling"

In [7]:
# list all meta data in youtube video
dir(youtube_video)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_age_restricted',
 '_author',
 '_embed_html',
 '_fmt_streams',
 '_initial_data',
 '_js',
 '_js_url',
 '_metadata',
 '_player_config_args',
 '_publish_date',
 '_title',
 '_vid_info',
 '_watch_html',
 'age_restricted',
 'allow_oauth_cache',
 'author',
 'bypass_age_gate',
 'caption_tracks',
 'captions',
 'channel_id',
 'channel_url',
 'check_availability',
 'description',
 'embed_html',
 'embed_url',
 'fmt_streams',
 'from_id',
 'initial_data',
 'js',
 'js_url',
 'keywords',
 'length',
 'metadata',
 'publish_date',
 'rating',
 'register_on_complete_callback',
 'register_on_progress_callback',
 'stream_monostate',
 'streamin

In [8]:
# look for audio streams / focus on audio / audio streams will be smaller to download
youtube_video.streams

[<Stream: itag="17" mime_type="video/3gpp" res="144p" fps="8fps" vcodec="mp4v.20.3" acodec="mp4a.40.2" progressive="True" type="video">, <Stream: itag="18" mime_type="video/mp4" res="360p" fps="30fps" vcodec="avc1.42001E" acodec="mp4a.40.2" progressive="True" type="video">, <Stream: itag="133" mime_type="video/mp4" res="240p" fps="30fps" vcodec="avc1.4d400d" progressive="False" type="video">, <Stream: itag="242" mime_type="video/webm" res="240p" fps="30fps" vcodec="vp9" progressive="False" type="video">, <Stream: itag="160" mime_type="video/mp4" res="144p" fps="30fps" vcodec="avc1.4d400c" progressive="False" type="video">, <Stream: itag="278" mime_type="video/webm" res="144p" fps="30fps" vcodec="vp9" progressive="False" type="video">, <Stream: itag="139" mime_type="audio/mp4" abr="48kbps" acodec="mp4a.40.5" progressive="False" type="audio">, <Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">, <Stream: itag="249" mime_type="audio

In [9]:
# run for loop to find streams
for stream in youtube_video.streams:
  print(stream)

<Stream: itag="17" mime_type="video/3gpp" res="144p" fps="8fps" vcodec="mp4v.20.3" acodec="mp4a.40.2" progressive="True" type="video">
<Stream: itag="18" mime_type="video/mp4" res="360p" fps="30fps" vcodec="avc1.42001E" acodec="mp4a.40.2" progressive="True" type="video">
<Stream: itag="133" mime_type="video/mp4" res="240p" fps="30fps" vcodec="avc1.4d400d" progressive="False" type="video">
<Stream: itag="242" mime_type="video/webm" res="240p" fps="30fps" vcodec="vp9" progressive="False" type="video">
<Stream: itag="160" mime_type="video/mp4" res="144p" fps="30fps" vcodec="avc1.4d400c" progressive="False" type="video">
<Stream: itag="278" mime_type="video/webm" res="144p" fps="30fps" vcodec="vp9" progressive="False" type="video">
<Stream: itag="139" mime_type="audio/mp4" abr="48kbps" acodec="mp4a.40.5" progressive="False" type="audio">
<Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">
<Stream: itag="249" mime_type="audio/webm" ab

In [10]:
# find all audio streams
streams = youtube_video.streams.filter(only_audio=True)
streams

[<Stream: itag="139" mime_type="audio/mp4" abr="48kbps" acodec="mp4a.40.5" progressive="False" type="audio">, <Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">, <Stream: itag="249" mime_type="audio/webm" abr="50kbps" acodec="opus" progressive="False" type="audio">, <Stream: itag="250" mime_type="audio/webm" abr="70kbps" acodec="opus" progressive="False" type="audio">, <Stream: itag="251" mime_type="audio/webm" abr="160kbps" acodec="opus" progressive="False" type="audio">]

In [11]:
# .first calls first stream in streams
stream = streams.first()
stream

<Stream: itag="139" mime_type="audio/mp4" abr="48kbps" acodec="mp4a.40.5" progressive="False" type="audio">

In [12]:
# download audio stream and save name
# will download to google colab
stream.download(filename='HansRosling.mp4')

'/content/HansRosling.mp4'

### Audio Stream Trimming
<a id='trim'></a>

In [13]:
# process audio using ffmpeg
# ! runs command line commands
# trim audio like slicing a beat
# 23 seconds is when audio began
# 1193 seconds is when audio stopped
!ffmpeg -ss 23 -i HansRosling.mp4 -t 1193 HansRoslingTrimmed.mp4

# make sure to click refresh on file window to show new trimmed file.


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

### Transcribing Audio Stream
<a id='trans'></a>

In [14]:
# save a timestamp before transcription
t1 = datetime.datetime.now()
print(f'started at  {t1}')

# do the transcript using whisper
output = model.transcribe('HansRoslingTrimmed.mp4')

#show time elapsed after transcription is complete
t2 = datetime.datetime.now()
print(f'ended at {t2}')
print(f'time elapses: {t2 - t1}')

started at  2023-09-14 19:15:12.562712
ended at 2023-09-14 19:18:20.614672
time elapses: 0:03:08.051960


In [15]:
# pretty print the output dictionary to inspect its structure
pprint.pprint(output)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                          912,
                          1605,
                          2744,
                          510,
                          294,
                          50746]},
              {'avg_logprob': -0.17825694023808347,
               'compression_ratio': 1.5320197044334976,
               'end': 344.32000000000005,
               'id': 99,
               'no_speech_prob': 0.3483344614505768,
               'seek': 32972,
               'start': 337.36,
               'temperature': 0.0,
               'text': ' Vietnam, 19, 2003 as in United States, 1974 by the '
                       'end of the war.',
               'tokens': [50746,
                          11013,
                          11,
                          1294,
                          11,
                          16416,
                          382,
                          294,
                          2824,
             

In [16]:
# access the transcribed text using the appropriate key
transcribed_text = output['text']  # 'text' is the key
print(transcribed_text)  # print the transcribed text

 About ten years ago, I took on the task to teach global development to Swedish undergraduate students. That was after having spent about 20 years together with African institutions studying hunger in Africa. So I was sort of expected to know a little about the world. And I started in our medical university, Karolinska Institute, an undergraduate course called Global Health. But when you get that opportunity, you get a little nervous. I thought, these students coming to us actually have the highest grade you can get in Swedish college system. So I thought maybe they know everything I'm going to teach them about. So I did a pre-test when they came. And one of the questions from which I learned a lot was this one. Which country has the highest child mortality of these five pairs? And I put them together so that in each pair of country, one has twice the child mortality of the other. And this means that it's much bigger the difference than the uncertainty of the data. I won't put you at a

In [17]:
# Extract and concatenate the transcribed text
result = " ".join([segment['text'] for segment in output['segments']])

# Print the length of the concatenated text
print(f"Length of transcribed text: {len(result)}")

# Print the transcribed text (uncomment the line below to print the text)
# print(result)


Length of transcribed text: 17564


### Summarize Transcribed Text
<a id='sum'></a>

In [18]:
# initialize hugging face pipeline
summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [19]:
# run chuncks thru hugging face pipline
num_iters = int(len(result)/1000)
summarized_text = []
for i in range(0, num_iters + 1):
  start = 0
  start = i * 1000
  end = (i + 1) * 1000
  print("input text \n" + result[start:end])
  out = summarizer(result[start:end])
  out = out[0]
  out = out['summary_text']
  print("Summarized text\n"+out)
  summarized_text.append(out)

#print(summarized_text)

input text 
 About ten years ago, I took on the task to teach global development to Swedish undergraduate  students.  That was after having spent about 20 years together with African institutions studying  hunger in Africa.  So I was sort of expected to know a little about the world.  And I started in our medical university, Karolinska Institute, an undergraduate course called  Global Health.  But when you get that opportunity, you get a little nervous.  I thought, these students coming to us actually have the highest grade you can get in Swedish  college system.  So I thought maybe they know everything I'm going to teach them about.  So I did a pre-test when they came.  And one of the questions from which I learned a lot was this one.  Which country has the highest child mortality of these five pairs?  And I put them together so that in each pair of country, one has twice the child mortality  of the other.  And this means that it's much bigger the difference than the uncertainty of th

Your max_length is set to 142, but your input_length is only 140. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=70)


Summarized text
 The number of internet users are going up like this. s not so equal any longer.  And it's appearing here overlooking the United States almost like a ghost, isn't it?  It's pretty scary.  But I think it's very important to have all this information.  We need really to see it.  Instead of looking at this, I would like to end up by showing the internet users per  1000 .
input text 
his is the GDP per capita.  And it's a new technology coming in.  But amazingly how well it fits to the economy of the countries.  That's why the $100 computer will be so important.  But it's a nice tendency.  It's as if the world is flattening off, isn't it?  These countries are lifting more than the economy.  And it will be very interesting to follow this over the year as I would like you to  be able to do with all the publicly funded data.  Thank you very much.  What if great ideas weren't cherished?  What if they carried no importance?  Or held no value?
Summarized text
 It's as if the worl

In [20]:
# print length of summarized text
len(str(summarized_text))

6072

In [21]:
# print summarized text
print(summarized_text)

[' Swedish professor teaches global development to students in his university\'s Global Health course . He says one of the questions from which he learned a lot was which country has the highest child mortality of these five pairs . "In each pair of countries, one has twice the child mortality  of the other," he says .', ' Swedish students know statistically less about the world than chimpanzees, professor says . Turkey, Poland, Russia, Pakistan, and South Africa are the highest countries in the world, he says . Professor: "The problem for me was not ignorance.  It was preconceived ideas.  I did also an unethical study of the professors of the Karolinska Institute"', ' Every bubble here is a country.  This country over here is China. This is India.  The size of the bubble is the population.  And on this axis here, I put fertility rate. r with the chimpanzee there.  So this is where I realized that there was really a need to communicate because the data .', ' We have very good data sinc

### Example Of Shorter Summarized Text
<a id='short'></a>

In [22]:
chunk_size = 1500  # change this to your desired chunk size

In [23]:
# define the maximum length for summaries
max_summary_length = 100  # adjust this value as needed

In [24]:
# changing the chunk size and max length to decrease thge summary output

# create a list to store the summarized text
summarized_text = []

for i in range(0, len(result), chunk_size):
    start = i
    end = i + chunk_size
    input_text = result[start:end]

    print("Input text:\n" + input_text)

    # generate a shorter summary by limiting the max_length
    out = summarizer(input_text, max_length=max_summary_length, min_length=10)  # adjust min_length as needed
    out = out[0]
    out = out['summary_text']

    print("Summarized text:\n" + out)

    summarized_text.append(out)

Input text:
 About ten years ago, I took on the task to teach global development to Swedish undergraduate  students.  That was after having spent about 20 years together with African institutions studying  hunger in Africa.  So I was sort of expected to know a little about the world.  And I started in our medical university, Karolinska Institute, an undergraduate course called  Global Health.  But when you get that opportunity, you get a little nervous.  I thought, these students coming to us actually have the highest grade you can get in Swedish  college system.  So I thought maybe they know everything I'm going to teach them about.  So I did a pre-test when they came.  And one of the questions from which I learned a lot was this one.  Which country has the highest child mortality of these five pairs?  And I put them together so that in each pair of country, one has twice the child mortality  of the other.  And this means that it's much bigger the difference than the uncertainty of th

In [25]:
 summarized_text_shorter = summarized_text

In [26]:
# print length of summarized text
len(str(summarized_text_shorter))

3437

In [27]:
# print shorter summarized text
print(summarized_text_shorter)

[' Professor of international health at Karolinska Institute teaches global development to Swedish students . He says one of the questions from which he learned a lot was which country has the highest child mortality of these five pairs . Turkey, Poland, Russia, Pakistan, and South Africa have the highest rates of child mortality .', ' I have shown that Swedish top students know statistically less about the world than chimpanzees . The problem for me was not ignorance.  It was preconceived ideas.  I did also an unethical study of the professors of the Karolinska Institute that hands out the Nobel Prize in medicine, and they are on par with the chimpanzee there.', ' China is moving against better health, improving their, all the green Latin American countries are moving towards smaller families . Life expectancy at birth at birth from 30 years in some countries up to about 70 years .', ' The data during the war indicate that even with all the death, there was an improvement  of life exp

### Conclusion
<a id='conclusion'></a>

This machine learning project demonstrates the power of combining various Python libraries, including Hugging Face Transformers and Whisper ASR, to automate the summarization of YouTube video transcripts. By following the outlined steps, one can easily extract key information from videos, making it a valuable tool for educational and research purposes.

The flexibility of this approach allows for fine-tuning the summarization process. By changing the chunk size, one can control how much text is processed at once, which can affect the granularity of the summaries. A larger chunk size may result in shorter summaries, as the model has more context to work with, while a smaller chunk size may yield slightly longer summaries for each segment.

Additionally, the max length parameter plays a crucial role in determining the length of the summarization output. Adjusting this parameter can result in either shorter or longer summaries. It's essential to strike a balance between brevity and informativeness to meet specific requirements.

In summary, this project not only showcases the automation of transcript summarization but also highlights the importance of parameter tuning in tailoring the summarization process to your needs, ultimately enhancing its utility for educational and research endeavors