üìåYouTube Transcript RAG System Using Whisper + LangChain
  Build an end-to-end Retrieval Augmented Generation (RAG) pipeline 
for YouTube videos using Whisper, FAISS, and LangChain.

üìñ Project Overview

This project demonstrates how to convert any public YouTube video into a searchable knowledge system using a Retrieval Augmented Generation (RAG) pipeline.

You provide a YouTube URL, and the system:

Downloads the video‚Äôs audio

Transcribes it using OpenAI Whisper

Chunks the text using LangChain

Creates embeddings using OpenAI

Stores them in a FAISS Vector DB

Retrieves relevant chunks based on a question

Feeds them into an LLM for final answer generation

This notebook is an end-to-end demonstration of a real RAG chatbot applied to YouTube content.

üß± Architecture

YouTube URL
      ‚îÇ
      ‚ñº
Download Audio with pytubefix
      ‚îÇ
      ‚ñº
OpenAI Whisper ‚Üí Transcript Text
      ‚îÇ
      ‚ñº
Chunking (RecursiveCharacterTextSplitter)
      ‚îÇ
      ‚ñº
OpenAI Embeddings ‚Üí Vector DB (FAISS)
      ‚îÇ
      ‚ñº
Retriever
      ‚îÇ
      ‚ñº
Composable LangChain Pipeline
      ‚îÇ
      ‚ñº
LLM Answer (ChatOpenAI)

üöÄ 1. Import Required Libraries

In [74]:
from openai import OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from pytubefix import YouTube
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
from dotenv import load_dotenv
load_dotenv()

True

üéß 2. Download YouTube Audio

This function extracts audio-only from a YouTube video and saves it as audio.mp4.

In [75]:

def download_youtube_audio(url, filename="audio.mp4"):
    save_path = r"c:\Users\HP\Desktop\sagar_handson" 
    yt = YouTube(url)
    stream = yt.streams.filter(only_audio=True).first()
    # This will save the audio in the same folder as this script
    file_path = stream.download(output_path=save_path, filename=filename)
    print("Saved at:", file_path)
    print("Exists?", os.path.exists(file_path))

    return file_path

In [76]:
audio_path = download_youtube_audio("https://www.youtube.com/watch?v=HFfXvfFe9F8&t=16s")

Saved at: c:\Users\HP\Desktop\sagar_handson\audio.mp4
Exists? True


ü§ñ 3. Transcribe Audio Using Whisper

In [77]:
from openai import OpenAI
client = OpenAI()   # assumes OPENAI_API_KEY is set in your environment

audio_path = "audio.mp4"   # the file pytubefix created

with open(audio_path, "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        # language="en",  # optional: force English if needed
    )

print(transcript.text) #test if the extract works as needed

Hello all my name is Krushnayak and welcome to my YouTube channel. So guys yet another amazing video here we are going to create an end-to-end project using Google Gemini Pro and project name is related to YouTube videos transcriber. Now this is an amazing project our main aim will be that we will try to just give the video YouTube link YouTube video link and then it should be able to automatically extract all the text all the transcript text from that specific videos. Now before I go ahead and start implementing this I would like to give some important credits to Dipendra Verma so you can see that his post was there and here you can see like what all things he has specifically implemented and by seeing the tutorials right where I've created a lot of Gemini Google Gemini projects he has specifically used this and he has actually created this so I asked for the link so that you know I could have made a video for you all but again this entire project credit goes to Dipendra Verma and but

‚úÇÔ∏è 4. Chunk the Transcript for RAG

Using an optimal chunk size for YouTube content:

chunk_size = 1000

chunk_overlap = 200

In [78]:
splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,chunk_overlap = 250)
chunks = splitter.create_documents([transcript.text]) # needs to be in a list for splitting
len(chunks)

28

üîé 5. Create Embeddings and Build Vector DB (FAISS)

In [79]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = FAISS.from_documents(chunks, embeddings)

In [80]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})
retriever

VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000002138666B110>, search_kwargs={'k': 4})

üìÑ 6. Prompt Template for RAG

In [81]:
prompt = PromptTemplate(
    template="""
      You are a helpful assistant.
      Answer ONLY from the provided transcript context.
      If the context is insufficient, just say you don't know.

      {context}
      Question: {question}
    """,
    input_variables = ['context', 'question']
)

üîó 7. Build the Parallel RAG Chain

In [82]:
def format_docs(retrieved_docs):
  context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
  return context_text

In [83]:
parallel_chain = RunnableParallel({
    'context': retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
})

üß† 8. Build the Final RAG Pipeline

In [86]:
parser = StrOutputParser()
llm = ChatOpenAI(model="gpt-5-mini")
final_chain = parallel_chain | prompt | llm | parser


üß™ 9. Test RAG ‚Äî Example Query

In [87]:
result = final_chain.invoke("summarize the video for me in 5 lines")
print(result)

- The speaker demonstrates a YouTube video-summarizer workflow that takes transcript text and generates concise, point-wise summaries or detailed notes (e.g., on data science and statistics) using configurable prompts.  
- They show how to create and modify the prompt (example: instructing the summarizer to produce a 200‚Äì250 word summary) and how to remove unnecessary fields like subject.  
- The tool extracts additional assets such as the video thumbnail and appears to run reliably across examples.  
- The presenter mentions exploring related tools and libraries (LlamaIndex and local LLMs on Linux/Mac) and references generative AI topics like machine learning and cloud platforms (e.g., AWS Bedrock).  
- They encourage viewers to try multiple prompts, share the video, and announce more related content coming soon.


üèÅ Final Results

‚úî End-to-end RAG pipeline
‚úî YouTube ‚Üí Audio ‚Üí Whisper Transcript ‚Üí Chunking ‚Üí Embeddings ‚Üí FAISS ‚Üí Retriever ‚Üí LLM
‚úî Accurate responses with rich context
‚úî Clean modular LangChain architecture
‚úî Ready for Streamlit UI (future improvement)