# 🎥 InsightTube - YouTube Transcript Q&A Bot using RAG

## 📌 Project Overview

**InsightTube** is an intelligent Q&A system built on top of YouTube videos. The project takes a YouTube **Video ID** as input, extracts the video's transcript (if available), and then builds a **Retrieval-Augmented Generation (RAG)** pipeline to enable natural language question answering on the video content.

This system is particularly useful for:
- Summarizing long-form videos
- Extracting insights from educational content
- Creating conversational Q&A agent for interactive learning

---

## ⚙️ Features

- Automatic transcript extraction from YouTube videos
- Chunking and embedding of transcript text
- Vector store indexing for efficient retrieval
- LLM-powered RAG for natural language answers

---

## 🧱 Components

1. **Transcript Extraction**  
   Extract transcript using the YouTube API or libraries like `youtube-transcript-api`.

2. **Text Chunking & Embedding**  
   Break transcript into chunks and generate vector embeddings using open source embedding models

3. **Vector Store Indexing**  
   Store embeddings in a vector database like FAISS (Facebook AI Similarity Search) for fast retrieval.

4. **RAG Pipeline**  
   Retrieve relevant chunks and pass them along with user questions to a language model for answer generation.

---


In [None]:
from langchain_ollama.llms import OllamaLLM
from langchain_core.prompts import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from sentence_transformers import SentenceTransformer
from langchain_core.runnables import RunnableParallel, RunnableSequence, RunnableLambda
from langchain_core.runnables.passthrough import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model = OllamaLLM(model = 'llama3.2')

In [3]:
def get_transcript(video_id):
    # video_id = "Gfr50f6ZBvo" # only the ID, not full URL
    try:
        # If you don’t care which language, this returns the “best” one
        transcript_list = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])

        # Flatten it to plain text
        transcript = " ".join(chunk["text"] for chunk in transcript_list)
        # print(transcript)
        return transcript

    except TranscriptsDisabled:
        print("No captions available for this video.")
        return "No Transcript found"

transcript = get_transcript('E5RjzSK0fvY')
print(transcript)


A linear regression is one
of the easiest algorithm in machine learning. It is a statistical model that attempts to
show the relationship between two variables. with the linear equation. Hello everyone. This is Atul from Edureka
and in today's session will learn about
linear regression algorithm. But before we drill down
to linear regression algorithm in depth, I'll give you a quick overview
of today's agenda. So we'll start a session with a quick overview
of what is regression as linear regression
is one of a type of regression algorithm. Once we learn about regression, it's use case the various
types of it next. We'll learn about the algorithm
from scratch where I'll teach you it's mathematical
implementation first, then we'll drill down
to the coding part and Implement linear
regression using python in today's session will deal with linear regression algorithm
using least Square method check its goodness of fit or how close the data is to the fitted regression line
using the R squar

In [4]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
chunks = splitter.create_documents([transcript])


embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vector_store = FAISS.from_documents(chunks, embeddings)


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


In [5]:
# Showing individual indexes and the doc-id generated for the text stored
vector_store.index_to_docstore_id

{0: 'bef3dc93-962e-4d2c-b278-cf77c13e5473',
 1: '1707a7d0-dc41-4098-a698-8376d3ebc1f0',
 2: '969710c4-4292-4e72-875d-fc28d93450f0',
 3: 'c727152b-2b8a-49f8-b52d-05cbfe7af230',
 4: 'c9542e29-79ae-4ff5-a919-96257b0b1f09',
 5: '7f40c211-33ec-4df1-b53a-75e04aac0c15',
 6: '89c3f642-7534-47e7-8b96-dfa75d4fc0e4',
 7: 'e1339c92-d55e-4af2-858d-ea8486b16245',
 8: '53996d31-b40d-40a7-8de7-09b0b30ee35f',
 9: '9276c61b-3178-42a1-b000-1027eb7377c4',
 10: '28c909a7-824e-4736-b6fb-867e9ad2fa55',
 11: '952c5125-5dab-497a-bb4a-eb88f8273108',
 12: '75158d93-f594-4cdf-a594-17068c27d0f5',
 13: '5222f233-ee3d-454d-969e-e6a111d6d2fb',
 14: 'e38a3ade-2ecc-4754-a08d-2649f4bcfde7',
 15: 'd9b39f25-70e3-4cd4-b1f8-ab6d79f0f5b4',
 16: '7ef3a4f1-5428-4f40-bd19-747057b083a8',
 17: 'df211fe0-0a98-4950-999b-97aa23caf81d',
 18: '372d79c2-63ad-4065-9ace-dc36febe93ee',
 19: 'c46d5b98-ff3f-4d6b-8401-b2ca4a9c1ee5',
 20: '56138f23-ad1a-4848-a6fe-74a5e66930cd'}

In [7]:
## Accessing a particular document data in vector db
vector_store.get_by_ids(['56138f23-ad1a-4848-a6fe-74a5e66930cd'])

[Document(id='56138f23-ad1a-4848-a6fe-74a5e66930cd', metadata={}, page_content="when you execute it, you will get the value\nof R square as 0.63 which is pretty Good now that you have implemented\nsimple linear regression model using least Square method. Let's move on and see\nhow will you implement the model using machine learning\nlibrary call scikit-learn. All right. So this scikit-learn is a simple\nmachine learning library in Python welding machine\nlearning model are very easy using scikit-learn. So suppose there's\na python code. So using the scikit-learn\nlibraries your code shortens to this length like so let's execute\nthe Run button and see you will get the same are\nto score as Well, this was all for today's\ndiscussion in case you have any doubt. Feel free to add your query\nto the comment section. Thank you. I hope you have enjoyed\nlistening to this video. Please be kind enough to like it and you can comment any\nof your doubts and queries and we will reply them at the e

In [8]:
# Creating Retriever which would interact with Vector Database to fetch relevant chunks
retriever = vector_store.as_retriever(search_type = 'similarity',search_kwargs = {'k':2})

In [9]:
retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001FFD8C97500>, search_kwargs={'k': 2})

In [10]:
# Testing retriever, would return top k chunks for the given query
retriever.invoke('What is Linear Regression')

[Document(id='bef3dc93-962e-4d2c-b278-cf77c13e5473', metadata={}, page_content="A linear regression is one\nof the easiest algorithm in machine learning. It is a statistical model that attempts to\nshow the relationship between two variables. with the linear equation. Hello everyone. This is Atul from Edureka\nand in today's session will learn about\nlinear regression algorithm. But before we drill down\nto linear regression algorithm in depth, I'll give you a quick overview\nof today's agenda. So we'll start a session with a quick overview\nof what is regression as linear regression\nis one of a type of regression algorithm. Once we learn about regression, it's use case the various\ntypes of it next. We'll learn about the algorithm\nfrom scratch where I'll teach you it's mathematical\nimplementation first, then we'll drill down\nto the coding part and Implement linear\nregression using python in today's session will deal with linear regression algorithm\nusing least Square method chec

In [11]:
prompt = PromptTemplate(
    template="""
      You are a helpful learning assistant.
      **Answer ONLY from the provided transcript context**.
      If the context is insufficient, just say you don't know.

      {context}
      Question: {question}
    """,
    input_variables = ['context', 'question']
)

In [12]:
question          = "is the topic of Random Forest discussed in this video? if yes then what was discussed"
retrieved_docs    = retriever.invoke(question)

In [13]:
context_text = '\n\n'.join([doc.page_content for doc in retrieved_docs])
context_text

"when you execute it, you will get the value\nof R square as 0.63 which is pretty Good now that you have implemented\nsimple linear regression model using least Square method. Let's move on and see\nhow will you implement the model using machine learning\nlibrary call scikit-learn. All right. So this scikit-learn is a simple\nmachine learning library in Python welding machine\nlearning model are very easy using scikit-learn. So suppose there's\na python code. So using the scikit-learn\nlibraries your code shortens to this length like so let's execute\nthe Run button and see you will get the same are\nto score as Well, this was all for today's\ndiscussion in case you have any doubt. Feel free to add your query\nto the comment section. Thank you. I hope you have enjoyed\nlistening to this video. Please be kind enough to like it and you can comment any\nof your doubts and queries and we will reply them at the earliest. Do look out\nfor more videos in our playlist and subscribe to Edureka\

In [14]:
final_prompt = prompt.invoke({'context':context_text,'question':question})

In [15]:
answer = model.invoke(final_prompt)
answer

'No, the topic of Random Forest is not discussed in this video. The video discusses simple linear regression using least squares method and scikit-learn library for machine learning implementation, as well as logistic regression, but does not mention Random Forest.'

### Optimised Implementation Using Runnables

In [16]:
def concat_docs(docs):
    final_docs = '\n\n'.join([doc.page_content for doc in docs])
    return final_docs

parallel_chain = RunnableParallel(
    {
    'context':retriever | RunnableLambda(concat_docs),
    'question':RunnablePassthrough()
    }
    )

In [17]:
main_chain = parallel_chain | prompt | model | StrOutputParser()

In [22]:
main_chain.get_graph().print_ascii()

            +---------------------------------+         
            | Parallel<context,question>Input |         
            +---------------------------------+         
                    **               ***                
                 ***                    **              
               **                         ***           
+----------------------+                     **         
| VectorStoreRetriever |                      *         
+----------------------+                      *         
            *                                 *         
            *                                 *         
            *                                 *         
    +-------------+                    +-------------+  
    | concat_docs |                    | Passthrough |  
    +-------------+*                   +-------------+  
                    **               ***                
                      ***         ***                   
                         **    

In [23]:
main_chain.invoke('What is OLS in Linear Regression?')

"According to the transcript, OLS stands for Ordinary Least Squares. It's a method used in linear regression to find the best fit line that minimizes the sum of the squared errors between the observed data points and the predicted values."