<a href="https://colab.research.google.com/github/RQledotai/holocron-colab/blob/master/notebooks/whisper_to_synopsis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>
<div align="center">
  <h1>Whisper to Synopsis
  <br/>
  <img src="https://raw.githubusercontent.com/RQledotai/holocron-colab/master/img/whisper-synopsis.png" width="200"/>
  </h1>
</div>
</center>

---

## Table of Content
- [Description](#description)
- [Initialization](#initialization)
  - [Installing Dependencies](#installing-dependencies)
  - [Installing Vosk Model](#installing-vosk-model)
- [Processing Video](#processing-video)
  - [Downloading the YouTube Video](#downloading-video)
  - [Extracting the Audio Track](#extracting-audio)
  - [Transcribing the Audio](#transcribing-audio)
  - [Summarizing the Transcribed Text](#summarizing-text)
- [Conclusion](#conclusion)

## Description <a name="description"></a>


This Google Colab notebook demonstrates the end-to-end process of converting MP4 videos (downloaded from YouTube) into concise summaries using the power of [Google AI API](https://ai.google.dev/).

The process involves the following steps:
1. **Audio Extraction**: Isolating the audio track from the MP4 video.
2. **Speech-to-Text Transcription**: Converting the extracting audio into text format.
3. **Key Takeaway Summarization**: Leveraging Google AI API to analyze the transcribed text and generate a succinct summary of the video's key points.

This offers a streamlined way to quickly grasp the essence of video content, saving users valuable time and effort.

## Initialization <a name="initialization"></a>

Before processing the video, we need to initialize runtime with the necessary Python libraries and artefacts.

### Installing Dependencies <a name="installing-dependencies"></a>

To download video from [YouTube](https://www.youtube.com/), we need to install the [`pytubefix` library](https://github.com/JuanBindez/pytubefix). The reason for selecting this libray is that it addresses a known issue with the standard `pytube` library (see [bug #1894](https://github.com/pytube/pytube/issues/1894#issue-2180600881)).

In [1]:
%pip install pytubefix

Collecting pytubefix
  Downloading pytubefix-6.10.2-py3-none-any.whl.metadata (5.4 kB)
Downloading pytubefix-6.10.2-py3-none-any.whl (74 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/74.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.8/74.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytubefix
Successfully installed pytubefix-6.10.2


The [`moviepy` library](https://zulko.github.io/moviepy/) is a Python module for video editing, which can be used for basic operations (like cuts, concatenations, title insertions), video compositing (a.k.a. non-linear editing), video processing, or to create advanced effects. In this notebook, the library will be used to extract the audio from the video.

In [2]:
%pip install moviepy



To transcribe the audio into text, we will use the following libraries:
* [`SpeechRecognition`](https://github.com/Uberi/speech_recognition): Interface to leverage different engines and APIs for performing speech recognition.
* [`vosk-api`](): Toolkit to perform offline speech recognition

In [3]:
%pip install speechrecognition vosk

Collecting speechrecognition
  Downloading SpeechRecognition-3.10.4-py2.py3-none-any.whl.metadata (28 kB)
Collecting vosk
  Downloading vosk-0.3.45-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (1.8 kB)
Collecting srt (from vosk)
  Downloading srt-3.5.3.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting websockets (from vosk)
  Downloading websockets-12.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading SpeechRecognition-3.10.4-py2.py3-none-any.whl (32.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.8/32.8 MB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading vosk-0.3.45-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (7.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m86.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading websockets-12.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x8

**Note**: `vosk-api` requires a speech recognition to be available. The instructions to download / install a vosk model are available in the [*Installing Vosk Model*](#installing-vosk-model) section.

The [Google AI Python SDK](https://github.com/google-gemini/generative-ai-python) is the easiest way for Python developers to build with the Gemini API. Gemini models are built from the ground up to be multimodal, so you can reason seamlessly across text, images, and code.

In [4]:
%pip install google-generativeai



### Installing Vosk Model <a name="installing-vosk-model"></a>

As mentioned earlier, the `vosk-api` relies on speech recognition models to be available locally. There are many speech recognition models that can be leveraged (see [Model list](https://alphacephei.com/vosk/models)). For this notebook, we will leverage the small model for English (i.e. `vosk-model-small-en-us-0.15`).

In [5]:
!wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip && unzip vosk-model-small-en-us-0.15.zip
!mv vosk-model-small-en-us-0.15 model

--2024-08-14 21:06:45--  https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
Resolving alphacephei.com (alphacephei.com)... 188.40.21.16, 2a01:4f8:13a:279f::2
Connecting to alphacephei.com (alphacephei.com)|188.40.21.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41205931 (39M) [application/zip]
Saving to: ‘vosk-model-small-en-us-0.15.zip’


2024-08-14 21:06:49 (12.7 MB/s) - ‘vosk-model-small-en-us-0.15.zip’ saved [41205931/41205931]

Archive:  vosk-model-small-en-us-0.15.zip
   creating: vosk-model-small-en-us-0.15/
   creating: vosk-model-small-en-us-0.15/am/
  inflating: vosk-model-small-en-us-0.15/am/final.mdl  
   creating: vosk-model-small-en-us-0.15/graph/
  inflating: vosk-model-small-en-us-0.15/graph/disambig_tid.int  
  inflating: vosk-model-small-en-us-0.15/graph/HCLr.fst  
  inflating: vosk-model-small-en-us-0.15/graph/Gr.fst  
   creating: vosk-model-small-en-us-0.15/graph/phones/
  inflating: vosk-model-small-en-us-0.15/graph/

After extracting the Vosk model, we can remove the downloaded zip file to free up disk space in our Colab environment.

In [6]:
!rm vosk-model-small-en-us-0.15.zip

## Processing Video <a name="processing-video"></a>

### Downloading the YouTube Video <a name="downloading-video"></a>

To begin, we'll fetch a video from YouTube. For this example, we'll use the [*How Google Search Works* video](https://www.youtube.com/watch?v=0eKVizvYSUQ), which describes how Google Search works, including how Google’s software indexes the web, ranks sites, flags spam, and serves up results.

We'll use the `pytubefix` library to handle the download:

In [7]:
from pytubefix import YouTube

yt_object = YouTube('https://www.youtube.com/watch?v=0eKVizvYSUQ')
# print the title of the video
print(f'Title: {yt_object.title}')

Title: How Google Search Works (in 5 minutes)


This following code retrieves the video, selects the highest resolution stream, and downloads it to the Colab environment.

In [9]:
yt_object_name = 'how-google-search-works'
# download the video stream
yt_object_high_res = yt_object.streams.get_highest_resolution()
print(f'Downloading: {yt_object_high_res}')
yt_object_high_res.download(filename=f'{yt_object_name}.mp4')

Downloading: <Stream: itag="18" mime_type="video/mp4" res="360p" fps="30fps" vcodec="avc1.42001E" acodec="mp4a.40.2" progressive="True" type="video">


'/content/how-google-search-works.mp4'

### Extracting the Audio Track <a name="extracting-audio"></a>

Now that have downloaded the video, let's extract the audio content using the `moviepy` library. The following code loads the video, extracts the audio track, and saves it as an audio file within the Colab environment.

In [10]:
import moviepy.editor as mpe

video = mpe.VideoFileClip(f'/content/{yt_object_name}.mp4')
video.audio.write_audiofile(f'/content/{yt_object_name}.wav')
video.close()

MoviePy - Writing audio in /content/how-google-search-works.wav




MoviePy - Done.


### Transcribing the Audio <a name="transcribing-audio"></a>

With the audio file ready, we'll transcribe it into text using the `SpeechRecognition` library and the Vosk API. The following code initializes a speech recognizer, loads the audio file, and performs speech-to-text transcription using Vosk.

In [11]:
import speech_recognition as sr

# initialize the recognizer engine
recognizer_engine = sr.Recognizer()

# upload the audio file to be processed
with sr.AudioFile(f'/content/{yt_object_name}.wav') as audio_file:
  audio_track = recognizer_engine.record(audio_file)
  audio_output = recognizer_engine.recognize_vosk(audio_track)

**Note** One limitation of the small model for English (i.e. `vosk-model-small-en-us-0.15`) is that it doesn't distinguish between different speakers.

Once the speech-to-text transcription has been performed, we can print the recognized text:

In [12]:
import json

# print the recognized text
recognized_text = json.loads(audio_output)['text']
print(recognized_text)

everyday billions of people come here with questions about all kinds of things sometimes we even get questions about google search itself like how this whole thing actually works and while this is a subject entire books have been written about there's a good chance you're in the market for something a little more concise so let's say it's getting close to dinner and you want a recipe for lasagna you've probably seen this before the let's go a little deeper since the beginning back when the home page looked like this google has been continuously mapping the web hundreds of billions of pages to create something called an index think of it as the giant library we look through whenever you do a search for lasagna or anything else now the worthless on yeah shows up a lot on the web pages about the history of lasagna articles by scientists whose last name happened to be lasagna stuff other people might be looking for but if you're hungry randomly clicking through millions of links is no fun 

### Summarizing the Transcribed Text <a name="summarizing-text"></a>

To summarize the transcribed text using the Gemini API, we will first need to set up the authentication. This relies on defining `GOOGLE_API_KEY` as a secret with your actual API key from [Google AI Studio](https://aistudio.google.com/app/apikey).

In [13]:
import google.generativeai as genai
from google.colab import userdata

genai.configure(api_key=userdata.get('GOOGLE_API_KEY'))

A *system prompt* in generative AI is a set of instructions provided to a Large Language Model (LLM) before any user input, designed to guide the model's behavior and responses.

To summarize the content of the video, we will use the following system prompt:

In [14]:
system_prompt = """
# Objective
You are an AI assistant specialized in marketing campaign. Your challenge is to write engaging content based on the transcript of a podcast.

# Output
The output should include the following:
* Tagline: The tagline should encapsulate the essence of the topics discussed in the transcribed text.
* Summary: Summary of text should be 500-1000 words that uses informative, concise and relevant language.

# Compliance
Write your response in Markdown.
Do *not* include any information that is not found in the user input.
"""

Now, let's use the Gemini API to generate the summary:

In [18]:
model = genai.GenerativeModel(
    model_name='models/gemini-1.5-flash',
    system_instruction=system_prompt
)
response = model.generate_content(recognized_text)

Finally, let's print the summary generated by the Gemini API:

In [17]:
print(response.text)

## Tagline: 
**Unveiling the Magic Behind Google Search: From Lasagna to the Latest News.**

## Summary: 
Every day, billions of people use Google Search to find answers to all sorts of questions, even about search itself.  While there are entire books dedicated to explaining how Google works, this summary offers a concise overview. 

Imagine you're searching for a lasagna recipe. Google has a massive "index," essentially a library of hundreds of billions of web pages. This index is constantly being updated, mapping the entire web to ensure it captures the latest information. 

However, simply throwing a vast amount of information at you isn't helpful.  That's where Google's ranking algorithms come into play.  These algorithms analyze your search query to understand what you're looking for, even if your wording is slightly off or your spelling isn't perfect.  Then, they sift through millions of potential matches within the index and prioritize the most relevant results at the top of th

## Conclusion <a name="conclusion"></a>

In this notebook, we explored a pipeline for processing MP4 videos, from audio extraction and transcription to text summarization using the Google Generative AI API.

We leveraged libraries the following libraries
* `pytubefix` to download a YouTube video
* `moviepy` to extract the audio from the video
* `speechrecognition` and `vosk` to transcribe the text from the audio
* `google-generativeai` to summarize the content of the video

While this demonstration offers a basic framework, there's ample room for customization and enhancement. Potential next steps include:
* **Improving Transcription Accuracy**: Experiment with different Vosk models or explore cloud-based speech recognition services for potentially better accuracy.
* **Fine-tuning Summarization**: Adjust parameters within the Gemini API call to tailor the summarization to your specific needs (e.g., length, focus).
* **Adding Speaker Identification**: If distinguishing between speakers is important, investigate libraries or services that offer speaker diarization capabilities.

By building upon this foundation, you can create powerful tools for extracting key insights from video content efficiently.