<a href="https://colab.research.google.com/github/NBar05/youtube_summarizer/blob/main/2022_10_17_simple_strategies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Example for simple tests

Yannic Kilcher: Did Google's LaMDA chatbot just become sentient?

In [1]:
unique_id = 'mIZLGBD99iU'
link = f'https://www.youtube.com/watch?v={unique_id}'

## Transcript summary

This baseline will based on:

1. Transcript extraction via YouTubeTranscriptApi
1. Summarization with pretrained BART

### Imports

In [2]:
!pip3 -q install youtube_transcript_api
!pip3 -q install transformers

[K     |████████████████████████████████| 5.3 MB 4.6 MB/s 
[K     |████████████████████████████████| 163 kB 57.2 MB/s 
[K     |████████████████████████████████| 7.6 MB 36.1 MB/s 
[?25h

In [3]:
from youtube_transcript_api import YouTubeTranscriptApi

import transformers
from transformers import BartTokenizer, BartForConditionalGeneration, pipeline

### Subtitles extraction

In [4]:
subtitles_by_intervals = YouTubeTranscriptApi.get_transcript(unique_id, languages=['en', 'ru'])  
subtitles = " ".join([x['text'] for x in subtitles_by_intervals])
subtitles[:1000]

"google engineer put on leave after saying ai chatbot has become sentient this at least according to this guardian article right here blake lemmon who is an engineer at google has been put on leave because of sharing proprietary information that proprietary information is an interview that he and a collaborator have conducted with google's new lambda chatbot system so the story here is that blake who was tasked to test this new lambda system for bias inherent discrimination and things like this because obviously if google wants to release this model or give people access to the model they want to make sure that it doesn't do any kind of bad stuff so blake was tasked to figure out you know in what way the model could express such bad stuff but in the course of this he conducted many interviews with the model or what he calls interviews which is prompt and response sessions and he became convinced that this model was actually sentient that it was essentially a real person and he became a

Possible problems:

1. absence of subtitles for some videos
2. need of multilingual model for postprocessing (if we don't narrow down target videos)
3. absence of punctuation (see below)
4. length variation (but good neural summarizer should deal with it)

In [5]:
import string
print(string.punctuation, ''.join([p for p in string.punctuation if p in subtitles]), sep='\t')

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~	'-


### Model download

In [6]:
model_name = 'facebook/bart-large-cnn' # can be any model which satisfies quality and inference constraints

tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

### Inference

In [7]:
input_tensor = tokenizer.encode(
    subtitles, 
    return_tensors="pt", max_length=1024, truncation=True
)
outputs_tensor = model.generate(
    input_tensor, 
    min_length=120, max_length=240, 
    num_beams=4, length_penalty=2.0, early_stopping=True, 
)
outputs_tensor.shape

torch.Size([1, 131])

In [8]:
tokenizer.decode(outputs_tensor[0])

"</s><s>Google engineer put on leave after saying ai chatbot has become sentient this at least according to this guardian article right here blake lemmon who is an engineer at google has been put onleave because of sharing proprietary information that proprietary information is an interview that he and a collaborator have conducted with google's new lambda chatbot system. He became convinced that this model was actually sentient that it was essentially a real person and he became an advocate for the model to get what it wants now after bringing up his concerns to google management according to him he was quickly dismissed and therefore decided to go public and here we are he released two medium articles.</s>"

### Short version

In [9]:
summarizer = pipeline("summarization", model=model_name, tokenizer=model_name, framework='pt') # will be used in audio section as well
# tokenizer_kwargs = {'truncation': True, 'max_length': 512}

summarizer(
    subtitles,
    # tokenizer params
    # max_length=1024, truncation=True,
    # model params
    # min_length=120, max_length=240, 
    # num_beams=4, length_penalty=2.0, early_stopping=True,
    truncation=True, return_text=True, return_tensors=False
)[0]['summary_text']

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

"Google engineer put on leave after saying ai chatbot has become sentient this at least according to this guardian article right here blake lemmon who is an engineer at google has been put on. leave because of sharing proprietary information that proprietary information is an interview that he and a collaborator have conducted with google's new lambda chatbot system."

One bottleneck: not clear pass of additional params for model and tokenizer 

(they both can have the same names for them, so almost no extra params for now)

## Audio summary

This baseline will based on:

1. Audip extraction via youtube_dl
1. Translation with pretrained audio-to-text model
1. Summarization with pretrained BART

### Imports

In [10]:
!pip3 -q install youtube-dl

[?25l[K     |▏                               | 10 kB 21.6 MB/s eta 0:00:01[K     |▍                               | 20 kB 5.9 MB/s eta 0:00:01[K     |▌                               | 30 kB 8.2 MB/s eta 0:00:01[K     |▊                               | 40 kB 3.5 MB/s eta 0:00:01[K     |▉                               | 51 kB 3.6 MB/s eta 0:00:01[K     |█                               | 61 kB 4.2 MB/s eta 0:00:01[K     |█▏                              | 71 kB 4.5 MB/s eta 0:00:01[K     |█▍                              | 81 kB 5.1 MB/s eta 0:00:01[K     |█▌                              | 92 kB 5.2 MB/s eta 0:00:01[K     |█▊                              | 102 kB 4.2 MB/s eta 0:00:01[K     |██                              | 112 kB 4.2 MB/s eta 0:00:01[K     |██                              | 122 kB 4.2 MB/s eta 0:00:01[K     |██▎                             | 133 kB 4.2 MB/s eta 0:00:01[K     |██▍                             | 143 kB 4.2 MB/s eta 0:00:01[K    

In [11]:
import youtube_dl

ydl_opts = {
    'format': 'bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav',
        'preferredquality': '192',
    }],
    'outtmpl':"." + '/video.%(ext)s',
}

with youtube_dl.YoutubeDL(ydl_opts) as ydl:
    ydl.download([link])

[youtube] mIZLGBD99iU: Downloading webpage
[download] Destination: ./video.webm
[download] 100% of 23.24MiB in 04:56
[ffmpeg] Destination: ./video.wav
Deleting original file ./video.webm (pass -k to keep)


The next code cell can be used for audio play

In [12]:
# from IPython.display import Audio 
# import librosa 

# sampling_rate = 16_000
# speech, rate = librosa.load("video.wav")

# Audio(speech, rate=rate)

### Model download

In [20]:
# speech to text model
model_name = 'facebook/wav2vec2-base-960h'
speech_to_text_model = pipeline(model=model_name)

absolute_path = "video.wav" # file name of your downloaded audio
text = speech_to_text_model(absolute_path, chunk_length_s=10) 

# # save text
# with open("original_text.txt", "w") as f:
#     n = f.write(text["text"])
# # read article
# with open("original_text.txt", "r") as f:
#     text_article = f.read()

print(len(text['text'].split()))
text = text['text'].lower()

4189


Possible problems:

- Very long processing for now

### Summary of text

In [22]:
summarizer(text, truncation=True)[0]['summary_text']

'Gougell engineer put on leave after saying a i chatbud has become tensioned. Blake lemuan was tasked to test this new lamda system for biaus inherent discrimination and things like this. After bringing up his concerns to gougale management according to him he was quickly dismissed and therefore decided to go public.'