Install packages

In [11]:
%pip install youtube-transcript-api google-generativeai chromadb

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Import modules

In [2]:
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter
import chromadb
from chromadb.utils import embedding_functions
import ollama


Set up resources

In [None]:
chroma_client = chromadb.PersistentClient(path="my_vectordb")

ollama_ef = embedding_functions.OllamaEmbeddingFunction(
    model_name="nomic-embed-text",  
)

chroma_collection = chroma_client.get_or_create_collection(
    name='yt_notes',
    embedding_function=ollama_ef
)


INPUTS:

In [None]:
yt_video_id = 'bCz4OMemCcA'

prompt = "Extract key notes from video transcript: "


Extract Transcript

In [None]:
ytt_api = YouTubeTranscriptApi()

# Fetch transcript (English priority)
fetched_transcript = ytt_api.fetch(yt_video_id, languages=['en', 'en-US', 'en-GB'])

# Convert to plain text
transcript_text = "\n".join([snippet.text for snippet in fetched_transcript])

# Save to file
with open("temp_transcript.txt", "w", encoding="utf-8") as file:
    file.write(transcript_text)


Generate Notes

In [15]:
# Create an API instance
ytt_api = YouTubeTranscriptApi()

# Fetch transcript (English priority)
fetched_transcript = ytt_api.fetch(yt_video_id, languages=['en', 'en-US', 'en-GB'])

# Convert transcript objects to plain text
transcript_text = "\n".join([snippet.text for snippet in fetched_transcript])

# Generate notes using Ollama
response = ollama.chat(model="llama3", messages=[
    {"role": "user", "content": prompt + transcript_text}
])
notes = response["message"]["content"]

Save Notes

In [None]:
# Save transcript 
with open("temp_transcript.txt", "w", encoding="utf-8") as file:
    file.write(transcript_text)

# Save notes
with open("temp_notes.txt", "w", encoding="utf-8") as file:
    file.write(notes)

# Read notes back in
with open("temp_notes.txt", "r", encoding="utf-8") as file:
    notes = file.read()

# Insert or update record in Chroma
chroma_collection.upsert(
    documents=[notes],
    ids=[yt_video_id]
)

# Validation
result = chroma_collection.get(yt_video_id, include=['documents'])
print(result)

{'ids': ['bCz4OMemCcA'], 'embeddings': None, 'documents': ['Here are the key notes from the video transcript:\n\n**Introduction**\n\n* The Transformer model is an attention-based neural network architecture designed for machine translation.\n* This video will go through each aspect of the Transformer model.\n\n**Encoder-Decoder Structure**\n\n* The Transformer consists of an encoder and a decoder.\n* The encoder takes in a sequence of input tokens and outputs a sequence of hidden states.\n* The decoder takes in the output from the encoder and generates a sequence of output tokens.\n\n**Self-Attention Mechanism**\n\n* Self-attention is used to allow the model to attend to different parts of the input sequence simultaneously.\n* This mechanism helps the model capture long-range dependencies and contextual relationships between tokens.\n\n**Multi-Head Attention**\n\n* Multi-head attention allows the model to jointly attend to information from different representation subspaces at differen

Search Notes

In [19]:
# Define query parameters
query_text = "How does the transformer work and what is the difference between the transformer and the RNN"
n_results = 5

# Perform semantic search
results = chroma_collection.query(
    query_texts=[query_text],
    n_results=n_results,
    include=["documents", "distances", "metadatas"]
)

# Display search results
print("\n🔍 Top Search Results:\n")
for idx, (vid_id, doc) in enumerate(zip(results["ids"][0], results["documents"][0]), start=1):
    print(f"{'*' * 72}")
    print(f"{idx}.  https://youtu.be/{vid_id}")
    print(f"{'*' * 72}")
    print(doc, "\n")



🔍 Top Search Results:

************************************************************************
1.  https://youtu.be/bCz4OMemCcA
************************************************************************
Here are the key notes from the video transcript:

**Introduction**

* The Transformer model is an attention-based neural network architecture designed for machine translation.
* This video will go through each aspect of the Transformer model.

**Encoder-Decoder Structure**

* The Transformer consists of an encoder and a decoder.
* The encoder takes in a sequence of input tokens and outputs a sequence of hidden states.
* The decoder takes in the output from the encoder and generates a sequence of output tokens.

**Self-Attention Mechanism**

* Self-attention is used to allow the model to attend to different parts of the input sequence simultaneously.
* This mechanism helps the model capture long-range dependencies and contextual relationships between tokens.

**Multi-Head Attention**

*

In [20]:
results = chroma_collection.query(
    query_texts=[query_text],
    n_results=5,
    include=['documents', 'distances', 'metadatas']
)

context_doc = results['documents'][0][0]
answer_prompt = f"Answer the QUESTION using DOCUMENT as context.\nQUESTION: {query_text}\nDOCUMENT: {context_doc}"

response = ollama.chat(model="llama3", messages=[
    {"role": "user", "content": answer_prompt}
])

print(response["message"]["content"])


Based on the document, here's how the Transformer works:

The Transformer is an attention-based neural network architecture designed for machine translation. It consists of an encoder and a decoder. The encoder takes in a sequence of input tokens and outputs a sequence of hidden states. The decoder then takes in the output from the encoder and generates a sequence of output tokens.

The key mechanisms that make the Transformer work are:

1. **Self-Attention Mechanism**: This allows the model to attend to different parts of the input sequence simultaneously, capturing long-range dependencies and contextual relationships between tokens.
2. **Multi-Head Attention**: This mechanism enables the model to jointly attend to information from different representation subspaces at different positions, helping it capture more complex relationships between tokens.
3. **Causal Masking**: This ensures that the decoder only attends to information up to the current position in the sequence, preventing 