# YouTube Transcript Knowledge Base - Demo

This notebook demonstrates the key functionality of the YouTube Transcript Knowledge Base project. It allows you to process YouTube videos, build a searchable knowledge base from their transcripts, organize videos into lists, and query the knowledge base for specific information.

## Setup

First, ensure you have the required dependencies installed and your OpenAI API key configured.

In [1]:
import os
import json
import dotenv
os.environ.clear()
# Load environment variables from .env file (containing your OPENAI_API_KEY)
# dotenv.load_dotenv()

# Verify OpenAI API key is available
if 'OPENAI_API_KEY' not in os.environ:
    print("⚠️ OPENAI_API_KEY not found in environment variables.")
    print("Please create a .env file with your OpenAI API key or set it manually below:")
    # Uncomment and replace with your key if needed
    # os.environ['OPENAI_API_KEY'] = 'your-api-key-here'
else:
    print("✅ OpenAI API key found in environment variables.")

⚠️ OPENAI_API_KEY not found in environment variables.
Please create a .env file with your OpenAI API key or set it manually below:


## Initialize Data Paths

Set up paths for data storage and initialize the MCP tools.

In [2]:
# Define paths similar to main.py
BASE_PATH = os.path.dirname(os.path.abspath("."))
DATA_FOLDER = os.path.join(BASE_PATH, "data")
DATA_PATH = os.path.join(DATA_FOLDER, "processed_data")
FAISS_INDEX_PATH = os.path.join(DATA_FOLDER, "youtube_faiss_index")
VIDEO_LISTS_PATH = os.path.join(DATA_FOLDER, "video_lists.json")
VIDEO_SUMMARIES_PATH = os.path.join(DATA_FOLDER, "video_summaries.json")
ALL_VIDEOS_METADATA_PATH = os.path.join(DATA_FOLDER, "all_videos_metadata.json")

# Create directories if they don't exist
os.makedirs(DATA_PATH, exist_ok=True)
os.makedirs(os.path.dirname(FAISS_INDEX_PATH), exist_ok=True)

# Create a dictionary of paths to pass to init_mcp_tools
data_paths = {
    'DATA_PATH': DATA_PATH,
    'FAISS_INDEX_PATH': FAISS_INDEX_PATH,
    'VIDEO_LISTS_PATH': VIDEO_LISTS_PATH,
    'VIDEO_SUMMARIES_PATH': VIDEO_SUMMARIES_PATH,
    'ALL_VIDEOS_METADATA_PATH': ALL_VIDEOS_METADATA_PATH
}

print(f"Data will be stored in: {DATA_FOLDER}")

Data will be stored in: /Users/mk/Work/Agent_learning/data


In [3]:
print(DATA_FOLDER)

/Users/mk/Work/Agent_learning/data


## Import the MCP Tools

Here we import the main functionality from the YouTube Transcript Knowledge Base project.

In [4]:
try:
    # Try importing as a package
    from youtube_knowledgebase_mcp.mcp_tools import init_mcp_tools
    from youtube_knowledgebase_mcp.data_management import initialize_data_files
except ImportError:
    # If package import fails, try importing from local files
    print("Importing from local files instead of packages")
    
    # This requires the project files to be in the same directory as this notebook
    from importlib.machinery import SourceFileLoader
    
    # Load necessary modules from files
    data_management = SourceFileLoader("data_management", "./data_management.py").load_module()
    mcp_tools = SourceFileLoader("mcp_tools", "./mcp_tools.py").load_module()
    
    # Get required functions
    init_mcp_tools = mcp_tools.init_mcp_tools
    initialize_data_files = data_management.initialize_data_files

# Initialize data files
initialize_data_files(data_paths)

# Initialize MCP tools
init_mcp_tools(data_paths)

print("✅ MCP tools initialized successfully")

✅ MCP tools initialized successfully


## Import Tools for Direct Use

Now we'll import the specific tools we need for our demo.

In [5]:
# Import all the tools we'll use in this demo
try:
    from youtube_knowledgebase_mcp.mcp_tools import (
        process_youtube_video,
        youtube_transcript_query_tool,
        check_knowledge_base_status,
        create_video_list,
        add_video_to_list,
        get_video_lists,
        add_video_summary,
        get_video_summary,
        get_all_videos_info,
        get_video_info,
        filter_videos
    )
except ImportError:
    # If package import fails, get functions from the module loaded above
    process_youtube_video = mcp_tools.process_youtube_video
    youtube_transcript_query_tool = mcp_tools.youtube_transcript_query_tool
    check_knowledge_base_status = mcp_tools.check_knowledge_base_status
    create_video_list = mcp_tools.create_video_list
    add_video_to_list = mcp_tools.add_video_to_list
    get_video_lists = mcp_tools.get_video_lists
    add_video_summary = mcp_tools.add_video_summary
    get_video_summary = mcp_tools.get_video_summary
    get_all_videos_info = mcp_tools.get_all_videos_info
    get_video_info = mcp_tools.get_video_info
    filter_videos = mcp_tools.filter_videos

print("✅ All tools imported successfully")

✅ All tools imported successfully


## 1. Check Knowledge Base Status

First, let's check the current status of our knowledge base.

In [6]:
status = check_knowledge_base_status()
print(status)

Knowledge base doesn't exist yet. Add transcripts first.


## 2. Process a YouTube Video

Now, let's process a YouTube video and add it to our knowledge base. Replace the URL with any YouTube video you'd like to process.

In [None]:
# Choose an educational YouTube video to process (replace with any video URL)
video_url = "https://www.youtube.com/watch?v=CDjjaTALI68"  # Example: Understanding MCP From Scratch

print(f"Processing video: {video_url}\n")
result = process_youtube_video(video_url)
print(result)

Processing video: https://www.youtube.com/watch?v=ESfA3PXDKqM

Getting metadata for: https://www.youtube.com/watch?v=ESfA3PXDKqM
Processing video: 'Hearthstone But Your Deck is Random Legendaries'
Extracting transcript from YouTube
Extracted video ID: ESfA3PXDKqM
[youtube] Extracting URL: https://www.youtube.com/watch?v=ESfA3PXDKqM
[youtube] ESfA3PXDKqM: Downloading webpage
[youtube] ESfA3PXDKqM: Downloading tv client config
[youtube] ESfA3PXDKqM: Downloading player 9a279502-main
[youtube] ESfA3PXDKqM: Downloading tv player API JSON
[youtube] ESfA3PXDKqM: Downloading ios player API JSON
[youtube] ESfA3PXDKqM: Downloading m3u8 information
[info] ESfA3PXDKqM: Downloading subtitles: en




[info] ESfA3PXDKqM: Downloading 1 format(s): 18
[info] Writing video subtitles to: /var/folders/c9/d6ptm61x6k7fqt_phjr3b5p80000gn/T/tmpqri7dsr_/ESfA3PXDKqM.en.vtt
[download] Destination: /var/folders/c9/d6ptm61x6k7fqt_phjr3b5p80000gn/T/tmpqri7dsr_/ESfA3PXDKqM.en.vtt
[download] 100% of  151.90KiB in 00:00:00 at 1.69MiB/s
Video info extracted: Hearthstone But Your Deck is Random Legendaries
Found subtitle files: ['ESfA3PXDKqM.en.vtt']
Using manual subtitle: ESfA3PXDKqM.en.vtt
Successfully extracted transcript for video ID: ESfA3PXDKqM
Processing transcript content...
Saved processed transcript data to /Users/mk/Work/Agent_learning/data/processed_data/ESfA3PXDKqM_processed.json
Saved video metadata to centralized storage
Creating semantic chunks and updating knowledge base...
Trying Ollama embeddings...
Cannot connect to Ollama server: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/tags (Caused by NewConnectionError('<urllib3.connection.HTTPConnectio

  return HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating new FAISS index with 50 documents


Successfully processed YouTube video 'Hearthstone But Your Deck is Random Legendaries' (ID: ESfA3PXDKqM) with 50 semantic chunks
Successfully processed YouTube video 'Hearthstone But Your Deck is Random Legendaries' (ID: ESfA3PXDKqM) with 50 semantic chunks


## 3. Query the Knowledge Base

Now that we have a video in our knowledge base, let's query it to find specific information.

In [None]:
from langchain_openai import ChatOpenAI

# If you processed Understanding MCP From Scratch, a good query might be:
query = "Can you tell what is MCP based on the video?"

# You can modify this query for your specific video
print(f"Querying: '{query}'\n")
results = youtube_transcript_query_tool(query)
print(results) # raw results for FAISS retrieval
# Process the query results using langchain_openai to get a more structured answer

# Initialize the LLM
llm = ChatOpenAI(model="gpt-3.5-turbo")

# Format a prompt with the results to get a concise answer
prompt = f"""
Based on the transcript segments from the video, please provide a clear explanation of what MCP is.
Here are the relevant transcript segments:
{results}

Please summarize what MCP is according to this video in a concise paragraph.
"""

# Get a structured answer
structured_answer = llm.invoke(prompt)
print("\n=== Structured Answer ===")
print(structured_answer.content)

In [24]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from youtube_knowledgebase_mcp.vector_store import get_or_create_faiss_index


vectorstore = get_or_create_faiss_index(FAISS_INDEX_PATH)

# Create a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})



# Create the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    verbose=False
)

def ask_llm(question):
    print(f"Question: {question}\n")
    result = qa_chain({"query": question})
    
    # Print the LLM's answer
    print("Answer:")
    print(result["result"])
    
    # Print the source documents used
    print("\nSource Documents:")
    for i, doc in enumerate(result["source_documents"], 1):
        video_id = doc.metadata.get("video_id", "unknown")
        title = doc.metadata.get("title", "Unknown Title")
        start_time = doc.metadata.get("start_time", "00:00:00")
        print(f"Document {i}: Video '{title}' (ID: {video_id}) at {start_time}")
    
    return result

# Ask a question about the content we just added to the knowledge base
llm_query = "Based on the video transcript, can you explain what MCP is?"
llm_result = ask_llm(llm_query)

Loaded existing FAISS index from /Users/mk/Work/Agent_learning/data/youtube_faiss_index
Question: Based on the video transcript, can you explain what MCP is?



  result = qa_chain({"query": question})


Answer:
Based on the video transcript, MCP (Model Control Protocol) is a standard protocol that provides a convenient interface to bind tools directly to various applications like IDEs and CLA desktop apps. It allows for communication between clients, applications, and resources like raw docs, with customizable communication. Essentially, MCP enables the integration of tools and context into different applications, making it simpler to connect and utilize various software tools efficiently.

Source Documents:
Document 1: Video 'Understanding MCP From Scratch' (ID: CDjjaTALI68) at 00:00:00.080
Document 2: Video 'Understanding MCP From Scratch' (ID: CDjjaTALI68) at 00:00:00.080
Document 3: Video 'Understanding MCP From Scratch' (ID: CDjjaTALI68) at 00:00:01.520
Document 4: Video 'Understanding MCP From Scratch' (ID: CDjjaTALI68) at 00:00:01.520


## 4. Get Video Information

Let's examine the metadata for the video we just processed.

In [None]:
# We need to extract the video ID from the result
# This is a simple way to do it from the previous processing result
import re

# Extract video ID from the result or use a known ID
video_id_match = re.search(r'ID: ([\w-]+)', result)
if video_id_match:
    video_id = video_id_match.group(1)
    print(f"Found video ID: {video_id}\n")
else:
    # Fallback in case regex didn't work
    video_id = "zduSFxRajkE"  # Replace with the actual video ID if known
    print(f"Using default video ID: {video_id}\n")

# Get detailed information about the video
video_info = get_video_info(video_id)
print(video_info)

## 5. Create a Video List and Add the Video

Let's organize our videos by creating a themed list.

In [None]:
# Create a new list for educational videos
list_name = "educational-videos"
list_description = "Videos about learning, education, and study techniques"

create_result = create_video_list(list_name, list_description)
print(create_result)

# Add our video to the list
add_result = add_video_to_list(video_id, list_name)
print(add_result)

# View all lists
lists = get_video_lists()
print("\nCurrent video lists:")
print(lists)

## 6. Add a Custom Summary

Let's add our own summary to the video to enhance searchability.

In [None]:
# Create a summary for the video
summary = """
This video is a comprehensive introduction to MCP (Model Context Protocol) presented by Lan from LangChain. 
The 12-minute tutorial takes a hands-on approach to explaining what MCP is and how to implement it from scratch.
"""

# Add the summary to the video
summary_result = add_video_summary(video_id, summary)
print(summary_result)

# Retrieve the summary to verify
get_summary_result = get_video_summary(video_id)
print("\nRetrieved summary:")
print(get_summary_result)

## 7. Process Another Video (Optional)

To build a more useful knowledge base, let's add another video.

In [None]:
# Uncomment and run this cell to process another video

# video_url2 = "https://www.youtube.com/watch?v=D7_ipDqhtwk"  # Example: How We Build Effective Agents: Barry Zhang, Anthropic
# print(f"Processing second video: {video_url2}\n")
# result2 = process_youtube_video(video_url2)
# print(result2)

# # Extract video ID for the second video
# video_id2_match = re.search(r'ID: ([\w-]+)', result2)
# if video_id2_match:
#     video_id2 = video_id2_match.group(1)
#     print(f"\nAdding video ID: {video_id2} to educational-videos list")
#     add_video_to_list(video_id2, list_name)


## 8. Get All Videos Information

Finally, let's get comprehensive information about all videos in our knowledge base.

In [None]:
all_videos = get_all_videos_info()
print("All videos in knowledge base:")
print(all_videos)

## Conclusion

This notebook has demonstrated the main functionality of the YouTube Transcript Knowledge Base:

1. Processing YouTube videos to extract transcripts
2. Querying the knowledge base for specific information
3. Getting detailed information about videos
4. Organizing videos into lists
5. Adding custom summaries


You can continue building your knowledge base by:
- Processing more videos on topics you're interested in
- Creating more specific lists to organize your videos
- Adding detailed summaries to improve searchability
- Running increasingly specific queries to find exactly the information you need

This system helps you retain and retrieve valuable information from videos without having to rewatch them completely.