[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb)
[GitHub](https://github.com/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb)
# Creating YouTube Captions with Wav2Vec

---

The Wav2Vec model was introduced by Facebook [here](https://arxiv.org/abs/2006.11477). Thanks to 🤗 Transformers, we can load it in seconds and build cool applications on top of it!

This notebooks aim is to serve as an inspiration for just that. We will build a simple script to create captions for YouTube videos! The notebook can be run on CPU. If you have any questions feel free to raise an issue at the GitHub link above.

## Setup

---

In [1]:
!pip -q install transformers
!pip -q install youtube_dl

[K     |████████████████████████████████| 2.2MB 8.4MB/s 
[K     |████████████████████████████████| 870kB 45.2MB/s 
[K     |████████████████████████████████| 3.3MB 28.0MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.9MB 9.5MB/s 
[?25h

In [2]:
from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC
from IPython.display import Audio

import moviepy.editor as mp
import torch
import librosa
import os

Imageio: 'ffmpeg-linux64-v3.3.1' was not found on your computer; downloading it now.
Try 1. Download from https://github.com/imageio/imageio-binaries/raw/master/ffmpeg/ffmpeg-linux64-v3.3.1 (43.8 MB)
Downloading: 8192/45929032 bytes (0.0%)2990080/45929032 bytes (6.5%)6291456/45929032 bytes (13.7%)9101312/45929032 bytes (19.8%)12263424/45929032 bytes (26.7%)15343616/45929032 bytes (33.4%)18604032/45929032 bytes (40.5%)21749760/45929032 bytes (47.4%)24739840/45929032 bytes (53.9%)27836416/45929032 bytes (60.6%)31268864/45929032 bytes (68.1%)34709504/45929032 bytes (75.6%)37912576/45929032 bytes (82.5%)

## Get Clip

---

Choose your favorite clip from YouTube & paste in the YouTube link. Ideally make it a short clip, as it will take some time to download. Choose the start & end seconds for the sequence whose caption you'd like to create. You can also give it a run with the default first 😊

In [5]:
# Substitute below YT link
clip = "https://www.youtube.com/watch?v=7Ood-IE7sx4"

# Substitue below for start/end seconds
start = 1
end = 4

In [4]:
# Download the clip as mp4 & rename it for usability
os.system('youtube-dl {} --recode-video mp4'.format(clip))
os.system('mv *.mp4 clip.mp4')

0

## Model and tokenizer

---

Load the Wav2Vec model from 🤗 Transformers. See [here](https://huggingface.co/transformers/model_doc/wav2vec2.html) for the models documentation.

In [6]:
# Load Wav2Vec from huggingface
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=291.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=163.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=85.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=843.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=377667514.0, style=ProgressStyle(descri…




Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Extract Audio

---

First we'll extract the audio in mp3 format from the clip, as the Wav2Vec models expects audio input. We do this in subclips of 10 second length to save some memory lateron. 

In [7]:
clip = mp.VideoFileClip("clip.mp4")
end = min(clip.duration, end)

# Save the paths for later
clip_paths = []

# Extract Audio-only from mp4
for i in range(start, int(end), 10):
  sub_end = min(i+10, end)
  sub_clip = clip.subclip(i,sub_end)

  sub_clip.audio.write_audiofile("audio_" + str(i) + ".mp3")
  clip_paths.append("audio_" + str(i) + ".mp3")

[MoviePy] Writing audio in audio_1.mp3


100%|██████████| 67/67 [00:00<00:00, 620.82it/s]

[MoviePy] Done.





In [10]:
# Play Audio 
Audio(clip_paths[0])

## Transcribe Audio

---

The last step is turning the Audio into text! The Wav2Vec model does most of the job here for us. We do each 10-second clip one-by-one to save memory.

In [13]:
cc = ""

for path in clip_paths:
    # Load the audio with the librosa library
    input_audio, _ = librosa.load(path, 
                                sr=16000)

    # Tokenize the audio
    input_values = tokenizer(input_audio, return_tensors="pt", padding="longest").input_values

    # Feed it through Wav2Vec & choose the most probable tokens
    with torch.no_grad():
      logits = model(input_values).logits
      predicted_ids = torch.argmax(logits, dim=-1)

    # Decode & add to our caption string
    transcription = tokenizer.batch_decode(predicted_ids)[0]
    cc += transcription + " "

In [9]:
# Here's your caption!
# Note that there may be mistakes especially if the audio is noisy or there are uncommon words
# If you picked the default video and change start to 0, you will see that the model gets confused by the word "Anakin"
print(cc)

I HAVE THE HIGH GROUND 
