<a href="https://colab.research.google.com/github/Juhapelailee/tubeTalk/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
! pip install -q youtube_transcript_api langchain-community faiss-cpu langchain_google_genai

In [14]:
from youtube_transcript_api import YouTubeTranscriptApi
from langchain_text_splitters import RecursiveCharacterTextSplitter

**1. Document Ingestion**

In [15]:
# https://www.youtube.com/watch?v=y3cw_9ELpQw
video_id = "y3cw_9ELpQw"
try:
  yt = YouTubeTranscriptApi()
  transcript = yt.fetch(video_id,languages=['en'])
  combined_text = " ".join(chunk.text for chunk in transcript)
  # combined all of the chunked transcripts into one string
  # print(combined_text)
except TranscriptsDisabled:
  print("No transcript is available for this video!")

**2. Text Splitting**

In [16]:
from langchain_core.documents import Document

def split_documents(docs,chunk_size=1000,chunk_overlap=200):
  text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )

  doc = Document(page_content=docs)#creating a document object cause that is what splitter accepts
  text_chunks = text_splitter.split_documents([doc])
  return text_chunks

# print(len(combined_text))
split_chunks = split_documents(combined_text)
#print(split_chunks[1])

page_content='podcast the supported please check out our sponsors in the description and now dear friends here's Andrew strominger you are part of the Harvard black hole initiative which has theoretical physicists experimentalists and even philosophers so let me ask the big question what is a black hole from a theoretical from an experimental uh maybe even from a philosophical perspective so a black hole is defined theoretically as a region of space-time from which light can never Escape therefore it's black now that's just the starting point many weird things uh follow from that basic definition but that is that is the basic definition what is light they can't escape from a black hole well light is uh you know the stuff that comes out of the Sun that stuff that goes into your eyes light is one of the the stuff that disappears when the lights go off this is stuff that appears when the lights come on um of course that could give you a Beth a medical definition but or physical mathematic

**3. Storing the chunks in a vector store**

In [17]:
from langchain_community.embeddings import HuggingFaceEmbeddings

def download_embeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"):
    embeddings = HuggingFaceEmbeddings(model_name=model_name)
    return embeddings

embeddings = download_embeddings()
print("Embeddings model downloaded successfully.")

Embeddings model downloaded successfully.


In [18]:
from langchain_community.vectorstores import FAISS
vector_store = FAISS.from_documents(split_chunks,embeddings)
#convert the given chunks to respective vectors ; the vector ids are different every time!

# print(vector_store.index_to_docstore_id)
# chunks are respe

In [19]:
vector_store.get_by_ids(['e5be311d-0356-4186-a8ed-574653cc8126'])

[]

**RETRIEVER**

In [30]:
retriever = vector_store.as_retriever(search_type="similarity",search_kwargs={"k":3})
#using same vector store as a retriever which searches for semantic similarity and outputs 3 relevant blocks ;

tags=['FAISS', 'HuggingFaceEmbeddings'] vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x798d07d74cb0> search_kwargs={'k': 3}


**Setting up LLM**

In [21]:
from langchain_google_genai import GoogleGenerativeAI
from langchain_core.prompts import PromptTemplate

from dotenv import load_dotenv
import os
from google.colab import userdata
# Load environment variables from .env file
load_dotenv()
# Get API key from Colab user data secrets
api_key = userdata.get("GEMINI_API_KEY")
os.environ["GOOGLE_API_KEY"] = api_key


llm = GoogleGenerativeAI(model="gemini-2.5-flash")

# print(result)

prompt_template = PromptTemplate(
    template="""
    You are a helpful assistant.
    Answer ONLY from the provided transcript context of the video.
    If the context is insufficient, just say that you donot know the answer.
    Context: {context}
    Question: {question}
    """,
    input_variables=["context", "question"]
)
#efficient prompt for llm questioning!

User interface #Added retrieved_docs, context_text, prompt, and answer inside the chatbot_interface function

In [28]:
import gradio as gr

def chatbot_interface(question):
    retrieved_docs = retriever.invoke(question) # Retrieving most relevant transcript chunks based on the user's question
    context_text = "\n\n".join(content.page_content for content in retrieved_docs) #Combining retrieved text chunks into one context string
    prompt = prompt_template.invoke({"context":context_text,"question":question}) #Creating a prompt
    answer = llm.invoke(prompt) #LLM generating answer
    return f"**Answer:** {answer}\n\n---\n\n**Context used:**\n{context_text}" #Function returns the LLM's answer and the context used for generating the LLM's answer.
    #We can delete the context part if you guys think it's too confusing for the end user.
ui = gr.Interface(
    fn=chatbot_interface,
    inputs=gr.Textbox(label="Ask a question about the video:"),
    outputs="markdown",
    title="🎬 Video QA Assistant",
    description="Ask questions about the video transcript. The model answers only from the retrieved transcript context."
)

ui.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://0c595f03d6e24780dd.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


