<a href="https://colab.research.google.com/github/DhanushKarthikeyaAJ/Youtube_Summariser/blob/main/Youtube_summariser_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**NLP (Natural Language Processing)** is rapidly advancing and finding applications everywhere, from spam filters to medical chatbots. One popular use case is text summarization, which can save time by providing a quick overview of content.

Imagine getting a brief summary of a lengthy YouTube tutorial or video before watching it—this would give you a quick understanding and save time. This project, a *YouTube Summarizer*, aims to do just that by summarizing the content (subtitles) of YouTube videos. Often, only 50-60% of a video's length contains the main content, so our summarizer will capture the key points to make the information concise and easy to understand.

Summarization in NLP involves creating a short version of a large text while retaining essential information.


---



There are two types: **extractive**, which pulls key sentences from the text, and **abstractive**, which generates a summary in its own words. For our YouTube Summarizer, we use extractive summarization techniques, such as **BART** (Bidirectional and Auto-Regressive Transformer) and **TFIDF Vectorizer**(term frequency-inverse document frequency), to efficiently capture important details from the video content. This tool can be especially useful for summarizing lecture videos.

In [2]:
!pip install youtube_transcript_api

Collecting youtube_transcript_api
  Downloading youtube_transcript_api-0.6.2-py3-none-any.whl.metadata (15 kB)
Downloading youtube_transcript_api-0.6.2-py3-none-any.whl (24 kB)
Installing collected packages: youtube_transcript_api
Successfully installed youtube_transcript_api-0.6.2


In [3]:
import youtube_transcript_api
from youtube_transcript_api import YouTubeTranscriptApi
import nltk
import re
from nltk.corpus import stopwords
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
link = "https://www.youtube.com/watch?v=I4EWvMFj37g"
unique_id = link.split("=")[-1]
sub = YouTubeTranscriptApi.get_transcript(unique_id)
subtitle = " ".join([x['text'] for x in sub])

In [11]:
!pip install nltk
import nltk
nltk.download('punkt')

from nltk.tokenize import sent_tokenize



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
subtitle = subtitle.replace("n","")
sentences = sent_tokenize(subtitle)

In [25]:
subtitle

"bash a commad laguage iterpreter for iteractig with a computer from the commad lie it's also called a shell because it surrouds the operatig system kerel to hide its itricate details while allowig you the programmer to do importat stuff like access data ad write files by typig simple commads this was a revolutioary cocept whe it was developed i the early 70s back whe programmers were still usig puch cards the shell cocept evolved over the years with the boure shell beig the most popular versio that is util 1989 whe the bor-agai shell or bash came about whe you ope up the termial o a uix machie like mac os ad most liux distros the default shell is usually bash it provides a prompt where you ca type a commad which will the be iterpreted by the shell ad executed o the operatig system to fid out if you're ruig bash type i which dollar sig shell from the commad lie it's like ay other applicatio that lives i the biaries directory but bash is also a programmig laguage that allows us to write

In [13]:
organized_sent = {k:v for v,k in enumerate(sentences)}

In [16]:
tf_idf = TfidfVectorizer(min_df=1,
                                    strip_accents='unicode',
                                    max_features=None,
                                    lowercase = True,
                                    token_pattern=r'w{1,}',
                                    ngram_range=(1, 3),
                                    use_idf=1,
                                    smooth_idf=1,
                                    sublinear_tf=1,
                                    stop_words = 'english')

In [18]:
import numpy as np
sentence_vectors = tf_idf.fit_transform(sentences)
sent_scores = np.array(sentence_vectors.sum(axis=1)).ravel()



In [19]:
N = 3
top_n_sentences = [sentences[index] for index in np.argsort(sent_scores, axis=0)[::-1][:N]]

In [20]:
# mapping the scored sentences with their indexes as in the subtitle
mapped_sentences = [(sentence,organized_sent[sentence]) for sentence in top_n_sentences]
# Ordering the top-n sentences in their original order
mapped_sentences = sorted(mapped_sentences, key = lambda x: x[1])
ordered_sentences = [element[0] for element in mapped_sentences]
# joining the ordered sentence
summary = " ".join(ordered_sentences)

In [26]:
summary

"bash a commad laguage iterpreter for iteractig with a computer from the commad lie it's also called a shell because it surrouds the operatig system kerel to hide its itricate details while allowig you the programmer to do importat stuff like access data ad write files by typig simple commads this was a revolutioary cocept whe it was developed i the early 70s back whe programmers were still usig puch cards the shell cocept evolved over the years with the boure shell beig the most popular versio that is util 1989 whe the bor-agai shell or bash came about whe you ope up the termial o a uix machie like mac os ad most liux distros the default shell is usually bash it provides a prompt where you ca type a commad which will the be iterpreted by the shell ad executed o the operatig system to fid out if you're ruig bash type i which dollar sig shell from the commad lie it's like ay other applicatio that lives i the biaries directory but bash is also a programmig laguage that allows us to write

In [28]:
#output summarization using the Bart Summarization model.
!pip install transformers
import transformers
from transformers import BartTokenizer, BartForConditionalGeneration
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
input_tensor = tokenizer.encode( subtitle, return_tensors="pt", max_length=512)
outputs_tensor = model.generate(input_tensor, max_length=160, min_length=120, length_penalty=2.0, num_beams=4, early_stopping=True)
outputs_tensor



Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


tensor([[    2,     0, 44551,    10,  7034,   625,   784, 11993,  1580, 41393,
          5234,  1334,    13, 41393,  7257,  1023,    19,    10,  3034,    31,
             5,  7034,   625,  6105,    24,    18,    67,   373,    10, 10785,
           142,    24,  8113, 22552,    29,     5, 19470,   415,  1023,   467,
          7321,  5982,     7,  7433,    63,    24,  4063,   877,  1254,   150,
          1157,  1023,    47,     5, 38988,     7,   109,  6595,   415,  2682,
           101,   899,   414,     4,    20,  6814, 10785,    16,  2333, 12882,
            24,  1639,    10, 14302,   147,    47,  6056,  1907,    10,  7034,
          7842,    61,    40,     5,    28, 41393, 42354,   196,    30,     5,
         10785,  2329,  9390,  1021,     5, 19470,    23,  1023,   467,     4,
            20, 10785,  1029, 16771, 12236,    81,     5,   107,    19,     5,
         20437,   241, 10785,    28,  1023,     5,   144,  1406, 19431,  1020,
            14,    16, 14258, 10206,     4,     2]])

In [30]:
print(tokenizer.decode(outputs_tensor[0])) #summary

</s><s>bash a commad laguage iterpreter for iteractig with a computer from the commad lie it's also called a shell because it surrouds the operatig system kerel to hide its itricate details while allowig you the programmer to do importat stuff like access data. The default shell is usually bash it provides a prompt where you ca type a commads which will the be iterpreted by the shell ad executed o the oper atig system. The shell cocept evolved over the years with the boure shell beig the most popular versio that is util 1989.</s>




---



In [32]:
#summarization using transformers’ summarization pipeline
from transformers import pipeline
summarizer = pipeline('summarization')
summary = summarizer(subtitle, max_length = 180, min_length =  30)
summary

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': " Bash a commad laguage iterpreter for iteractig with a computer from the commad lie it's also called a shell because it surrouds the operatig system kerel to hide its details while allowig you the programmer to do importat stuff like access data ad write files by typig simple commads . The default shell is usually bash it provides a prompt where you ca type a . commad which will the be iterpreted by the shell ad executed o the operating system . The shell cocept evolved over the years with the boure shell being the most popular versio that is util 1989 ."}]

This is a basic implementation of a summarization model using transformers. In addition to BART, other models like** DistilBART, BERT, and GPT-2** can also provide accurate summaries.

The YouTube summarizer aims to deliver a concise summary of video content, which is particularly helpful for *videos with subtitles*. This tool can significantly save time by offering a quick overview of key points before watching the full video. I hope you find this **mini NLP project** interesting and useful!