<a href="https://colab.research.google.com/github/CDAC-lab/isie2023/blob/main/tutorial-notebook-3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exposing an AI based Question Answering Assistant on top of media content

# Overview

This notebook is designed to demonstrate an end-to-end pipeline leveraging the capabilities of OpenAI's Whisper, ChromaDB, and Langchain to enable intelligent querying of YouTube videos. We begin by taking a YouTube video URL, from which the audio is extracted and transcribed using Whisper, OpenAI's automatic speech recognition (ASR) system. This transcription is then vectorized using ChromaDB, a high-performance vector database, effectively transforming the unstructured text data into a structured, queryable form. Finally, Langchain is utilized to provide a natural language interface for querying the stored vector data, allowing users to extract meaningful information from the video content.

## Table of Contents

1. [Introduction and Setting Up](#section1)
    - Introduction to the Notebook
    - Installing Necessary Libraries
    - Importing Libraries and Dependencies
2. [Data Acquisition](#section2)
    - Getting Video Data from YouTube
    - Extracting Audio from YouTube Video
3. [Transcription using Whisper](#section3)
    - Introduction to Whisper
    - Transcribing Audio to Text
4. [Vectorization using ChromaDB](#section4)
    - Introduction to ChromaDB
    - Preprocessing Text for Vectorization
    - Vectorizing Text Data
5. [Querying with Langchain](#section5)
    - Introduction to Langchain
    - Setting Up Langchain for Querying
    - Formulating and Executing Queries
6. [Analysis and Visualization](#section6)
    - Analyzing Query Results
    - Visualizing Query Results
7. [Conclusion and Possible Extensions](#section7)
    - Summary of Achievements
    - Potential Future Work
8. [References and Additional Resources](#section8)




# Introduction and Setting Up

## Introduction to the Notebook
Welcome to our notebook! This project aims to create an end-to-end pipeline to extract, process, vectorize, and query the content of YouTube videos. By using state-of-the-art tools like Whisper, ChromaDB, and Langchain, we aim to transform unstructured video content into a structured and easily searchable form.

## Installing Necessary Libraries
In this section, we'll guide you through the installation process for all the necessary libraries that we'll use throughout this notebook. This includes OpenAI's Whisper for speech recognition, ChromaDB for vectorization, and Langchain for natural language querying.

## Importing Libraries and Dependencies
Here, we will import all the required Python libraries and dependencies that we'll be using in our notebook. This includes standard libraries for data handling and manipulation, as well as libraries specific to our pipeline such as the API wrappers for Whisper, ChromaDB, and Langchain.

## Install libraries

In [None]:
!pip -qqq install git+https://github.com/openai/whisper.git
!pip -qqq install pytube
!pip install langchain
!pip install chromadb
!pip install openai

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.204-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

## Import libraries

In [None]:
#libraries for google drive authentication
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

import whisper
import torch
import os
from pytube import YouTube
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import DataFrameLoader
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
import pandas as pd

In [None]:
# Import the libraries

# Set the device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model
whisper_model = whisper.load_model("large", device=device)

100%|██████████████████████████████████████| 2.87G/2.87G [00:12<00:00, 246MiB/s]


# Data Acquisition

## Getting Video Data from YouTube
In this section, we'll discuss how to input a YouTube video URL and use it to extract the video data. This involves using a YouTube data extraction library to access and download the video.

## Extracting Audio from YouTube Video
After obtaining the video, the next step is to extract the audio which will be transcribed into text. We'll discuss the method used to perform this extraction and the format in which the audio data is saved.

# Transcription using Whisper

## Introduction to Whisper
Whisper is OpenAI's automatic speech recognition (ASR) system. In this section, we'll provide a brief introduction to Whisper and explain how it is used to transcribe the audio from our YouTube video.

## Transcribing Audio to Text
Here, we'll walk you through the process of transcribing the extracted audio into text using Whisper. This involves sending the audio data to the Whisper API and receiving a text transcript in return.

## Extract the audio from youtube video

In [None]:
def extract_and_save_audio(video_URL, destination, final_filename):
  video = YouTube(video_URL)#get video
  audio = video.streams.filter(only_audio=True).first()#seperate audio
  output = audio.download(output_path = destination)#download and save for transcription
  _, ext = os.path.splitext(output)
  new_file = final_filename + '.mp3'
  os.rename(output, new_file)

In [None]:
# Video to audio
video_URL = 'https://www.youtube.com/watch?v=3G5hWM6jqPk'
destination = "."
final_filename = "MIT 6.S191: Deep Generative Modeling"
extract_and_save_audio(video_URL, destination, final_filename)

## Transcribe

In [None]:
# run the whisper model
audio_file = "MIT 6.S191: Deep Generative Modeling.mp3"
result = whisper_model.transcribe(audio_file)

This is a 1 hour lecture, it takes around 10 minutes to complete the transcription. For the time being, a pre-transcribed version is being loaded from the disk.

In [None]:
#authenticate with you google drive credentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# This is the file ID of the data set, this will download the datafile from the shared location
transcription_id = '14Cqyn3ND_X8PUNkIujYr0qUpJ6xdj8d6'
transcription_data = drive.CreateFile({'id':transcription_id})
transcription_data.GetContentFile('transcription.csv')

## Chunk Clips

In [None]:
transcription = pd.read_csv('transcription.csv')


def chunk_clips(transcription, clip_size):
  texts = []
  sources = []
  for i in range(0,len(transcription),clip_size):
    clip_df = transcription.iloc[i:i+clip_size,:]
    text = " ".join(clip_df['text'].to_list())
    source = str(round(clip_df.iloc[0]['start']/60,2))+ " - "+str(round(clip_df.iloc[-1]['end']/60,2)) + " min"
    print(text)
    print(source)
    texts.append(text)
    sources.append(source)

  return [texts,sources]



In [None]:
chunks = chunk_clips(transcription, 50)
documents = chunks[0]
sources = chunks[1]

 I'm really, really excited about this lecture because as Alexander introduced  yesterday, right now we're in this tremendous age of generative AI. And  today we're going to learn the foundations of deep generative modeling,  where we're going to talk about building systems that can not only look for  patterns in data, but can actually go a step beyond this to generate brand new  data instances based on those learned patterns. This is an incredibly complex  and powerful idea, and as I mentioned it's a particular subset of deep  learning that has actually really exploded in the past couple of years and  this year in particular. So to start and to demonstrate how powerful these  algorithms are, let me show you these three different faces. I want you to  take a minute, think. Think about which face you think is real. Raise your hand  if you think it's face A. Okay, I see a couple of people. Face B. Many more people.  Face C. About second place. Well the truth is that all of you are wrong.

# Vectorization using ChromaDB

## Introduction to ChromaDB
ChromaDB is a high-performance vector database used to transform our text data into a structured, queryable form. In this section, we'll explain what ChromaDB is and why it's useful in our pipeline.

## Preprocessing Text for Vectorization
Before we can vectorize our text data, it may need to be preprocessed. This section discusses any necessary preprocessing steps such as tokenization or normalization.

## Vectorizing Text Data
Once our text data is preprocessed, it's time to vectorize it using ChromaDB. We'll explain how to send our text data to ChromaDB, receive vectorized data in return, and store this data for future use.

## Process text and store in VectorDB

In [None]:
os.environ["OPENAI_API_KEY"] = "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"
embeddings = OpenAIEmbeddings(openai_api_key = os.environ["OPENAI_API_KEY"])
#vstore with metadata. Here we will store page numbers.
vStore = Chroma.from_texts(documents, embeddings, metadatas=[{"source": s} for s in sources])
#deciding model
model_name = "gpt-3.5-turbo"

retriever = vStore.as_retriever()
retriever.search_kwargs = {'k':2}

In [None]:
model = RetrievalQAWithSourcesChain.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)


# Querying with Langchain

## Introduction to Langchain
Langchain provides a natural language interface for querying our vector data. In this section, we'll provide an introduction to Langchain and explain how it fits into our pipeline.

## Setting Up Langchain for Querying
Before we can start querying, we need to set up Langchain. This section will guide you through the process of setting up Langchain to work with our vectorized data.

## Formulating and Executing Queries
With Langchain set up, we can now formulate and execute queries on our data. We'll walk you through the process of creating a query, sending it to Langchain, and interpreting the results.

## Q&A

In [None]:
query = "What is this video about?"
response = model({"question":query}, return_only_outputs=True)
print('Answer :',response['answer'])
print('Referred clip segments :',response['sources'])

Answer :  This video is about deep generative models, specifically latent variable models, autoencoders, variational autoencoders, generative adversarial networks and diffusion models. It also shows an example of CycleGAN which is used to synthesize Obama's voice from Alexander's voice.

Referred clip segments : 53.18 - 57.95 min


In [None]:
query = "What is a generative ai?"
response = model({"question":query}, return_only_outputs=True)
print('Answer :',response['answer'])
print('Referred clip segments :',response['sources'])

Answer :  Generative AI is a subset of deep learning that is used to generate new data instances based on patterns found in existing data.

Referred clip segments : 0.0 - 4.95 min, 57.95 - 59.83 min
