#Youtube Summarizer (Gen AI Project)

YouTube video summarizers are tools that condense long videos into shorter, more manageable summaries. They extract key points and essential information from the video, providing a concise overview without the need to watch the entire content.

In [None]:
#installing required APIs
!pip install transformers youtube-transcript-api sentencepiece



In [None]:
#Importing Requirements
import pandas as pd
from youtube_transcript_api import YouTubeTranscriptApi as YTapi
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

## Exploring API <a class="anchor" id="ExploringAPI"></a>

In [None]:
youtube_link = "https://www.youtube.com/watch?v=FM7Z-Xq8Drc"

In [None]:
def videoID(link):
    video_id = link.split("=")[1]
    return video_id

In [None]:
id = videoID(youtube_link)
id

'FM7Z-Xq8Drc'

In [None]:
def GetTranscript(video_id):
    try:
        transcript = YTapi.get_transcript(video_id)
        FinalTranscript = ' '.join([i['text'] for i in transcript])
    except Exception as e:
        print(e)

    return FinalTranscript

In [None]:
transcript_en = GetTranscript(id)

Using Auto-translate feature from YouTube which allows automatically translation of subtitles:

1. Retrieve the available transcripts
2. Iterate over all available transcripts
3. Translating the transcript

In [None]:
transcript_list  = YTapi.list_transcripts(id)

for transcript in transcript_list:
    ln = transcript.language
    check = transcript.is_translatable
    print(ln, check)

Arabic True
Chinese True
English True
French True
German True
Hindi True
Italian True
Japanese True
Korean True
Polish True
Portuguese True
Russian True
Spanish True
Turkish True
English (auto-generated) True


Transcript for this youtube video is available in multiple langauge, and I'm choosig Hindi language and to check all the available  langauge, run follwing code:

In [None]:
for transcript in transcript_list:
    available_ln = transcript.translation_languages
print(available_ln)

[{'language': 'Afrikaans', 'language_code': 'af'}, {'language': 'Akan', 'language_code': 'ak'}, {'language': 'Albanian', 'language_code': 'sq'}, {'language': 'Amharic', 'language_code': 'am'}, {'language': 'Arabic', 'language_code': 'ar'}, {'language': 'Armenian', 'language_code': 'hy'}, {'language': 'Assamese', 'language_code': 'as'}, {'language': 'Aymara', 'language_code': 'ay'}, {'language': 'Azerbaijani', 'language_code': 'az'}, {'language': 'Bangla', 'language_code': 'bn'}, {'language': 'Basque', 'language_code': 'eu'}, {'language': 'Belarusian', 'language_code': 'be'}, {'language': 'Bhojpuri', 'language_code': 'bho'}, {'language': 'Bosnian', 'language_code': 'bs'}, {'language': 'Bulgarian', 'language_code': 'bg'}, {'language': 'Burmese', 'language_code': 'my'}, {'language': 'Catalan', 'language_code': 'ca'}, {'language': 'Cebuano', 'language_code': 'ceb'}, {'language': 'Chinese (Simplified)', 'language_code': 'zh-Hans'}, {'language': 'Chinese (Traditional)', 'language_code': 'zh-

In [None]:
 transcript_hi = ' '.join([i['text'] for i in transcript.translate('hi').fetch()])

##Printing only first 300 character of both Hindi and English transcript.

In [None]:
#Transcripting the youtube video in english
transcript_en[:300]

"- Behind me are 100 people. And they range from the age one all the way through age 100. And I've trapped each of them\nin their very own glass cube. The last one to leave\ntheir cube is going to win half a million dollars. The challenge has officially begun. Let's see which age is the best. - So, I'm"

In [None]:
#transcripting youtube video in hindi
transcript_hi[:300]

'मेरे पीछे 100 लोग हैं और उनमें एक वर्ष से लेकर 100 वर्ष की आयु तक के लोग हैं और मैंने उनमें से प्रत्येक को अपने स्वयं के ग्लास क्यूब में फंसा लिया है, जो अंतिम व्यक्ति अपना क्यूब छोड़ेगा वह आधा मिलियन डॉलर जीतने जा रहा है, यह चुनौती है आधिकारिक तौर पर शुरू हो गया है, आइए देखें कि कौन सी उम्र सबसे अच'

## Applying Models <a class="anchor" id="Models"></a>

Applying `Pegasus` and `Bart` on English transcript and using `mT5` on the Hindi transcript.

In [None]:
checkpoint1 = "google/pegasus-large"
checkpoint2 = "csebuetnlp/mT5_multilingual_XLSum"
checkpoint3 = "sshleifer/distilbart-cnn-12-6"
checkpoint4 = "ai4bharat/IndicBART"

####Downloading Models

In [None]:
tokenizer1 = AutoTokenizer.from_pretrained(checkpoint1)
model1 = AutoModelForSeq2SeqLM.from_pretrained(checkpoint1)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
tokenizer2 = AutoTokenizer.from_pretrained(checkpoint2)
model2 = AutoModelForSeq2SeqLM.from_pretrained(checkpoint2)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
tokenizer3 = AutoTokenizer.from_pretrained(checkpoint3)
model3 = AutoModelForSeq2SeqLM.from_pretrained(checkpoint3)

In [None]:
tokenizer4 = AutoTokenizer.from_pretrained(checkpoint4)
model4 = AutoModelForSeq2SeqLM.from_pretrained(checkpoint4)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def summarize(toenizer, model, text):

    """
    Function to summarize the text(transcript)
    """
    inputs = toenizer(text,
                    max_length=1024,
                    truncation=True,
                    return_tensors="pt")

    summary_ids = model.generate(inputs["input_ids"])
    summary = toenizer.batch_decode(summary_ids,
                                  skip_special_tokens=True,
                                  clean_up_tokenization_spaces=False)
    return summary

**English Summary**

In [None]:
pegasus = summarize(tokenizer1, model1, transcript_en)

In [None]:
pegasus[0]

"And they range from the age one all the way through age 100. The last one to leave their cube is going to win half a million dollars. I've got the best view and the best motivation. Keep in mind, I have over 300 cameras set up literally everywhere, so we'll be monitoring their every move. Keep in mind, I have over 300 cameras set up literally everywhere, so we'll be monitoring their every move. - My strategy is to make a lot of noise, so people like you don't get any sleep and they really wanna get out. We're a couple hours into the challenge and 10 people have already gotten out. So, we decide to leave them alone for the rest of the day, which caused even more people to get out."

In [None]:
bart = summarize(tokenizer3, model3, transcript_en)

In [None]:
bart[0]

' 100 people have been trapped in a glass cube for 24 hours in a bid to win half a million dollars . The last one to leave their glass cube is going to win $500,000 . Some of the kids had a crazy strategy to make a lot of noise to get out of the cube . Some draw on walls with deodorant, while others build pillow forts .'

**Hindi Summary**

In [None]:
mt5 = summarize(tokenizer2, model2, transcript_hi)

In [None]:
mt5[0]

'मैं एक साल से भी अधिक समय तक अकेला रहना चाहता हूं और यह मेरे लिए एक चुनौती है.'

In [None]:
HiBart = summarize(tokenizer4, model4, transcript_hi)



In [None]:
HiBart[0]

'मुंह मेरे पीछे 100 लोग हैं और उनमें एक वर्ष से लेकर 100 वर्ष की आयु तक के'

-----