In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
!pip install -q youtube-transcript-api langchain-community langchain-openai \
               faiss-cpu tiktoken python-dotenv


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate

## Step 1a - Indexing (Document Ingestion)

In [4]:
video_id = "wjZofJX0v4M" # only the ID, not full URL
try:
    transcript_list = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])

    # Flatten it to plain text
    transcript = " ".join(chunk["text"] for chunk in transcript_list)
    print(transcript)

except TranscriptsDisabled:
    print("No captions available for this video.")

The initials GPT stand for Generative Pretrained Transformer. So that first word is straightforward enough, these are bots that generate new text. Pretrained refers to how the model went through a process of learning from a massive amount of data, and the prefix insinuates that there's more room to fine-tune it on specific tasks with additional training. But the last word, that's the real key piece. A transformer is a specific kind of neural network, a machine learning model, and it's the core invention underlying the current boom in AI. What I want to do with this video and the following chapters is go through a visually-driven explanation for what actually happens inside a transformer. We're going to follow the data that flows through it and go step by step. There are many different kinds of models that you can build using transformers. Some models take in audio and produce a transcript. This sentence comes from a model going the other way around, producing synthetic speech just from

In [5]:
transcript_list

[{'text': 'The initials GPT stand for Generative Pretrained Transformer.',
  'start': 0.0,
  'duration': 4.56},
 {'text': 'So that first word is straightforward enough, these are bots that generate new text.',
  'start': 5.22,
  'duration': 3.8},
 {'text': 'Pretrained refers to how the model went through a process of learning',
  'start': 9.8,
  'duration': 3.381},
 {'text': "from a massive amount of data, and the prefix insinuates that there's",
  'start': 13.181,
  'duration': 3.429},
 {'text': 'more room to fine-tune it on specific tasks with additional training.',
  'start': 16.61,
  'duration': 3.43},
 {'text': "But the last word, that's the real key piece.",
  'start': 20.72,
  'duration': 2.18},
 {'text': 'A transformer is a specific kind of neural network, a machine learning model,',
  'start': 23.38,
  'duration': 4.191},
 {'text': "and it's the core invention underlying the current boom in AI.",
  'start': 27.571,
  'duration': 3.429},
 {'text': 'What I want to do with this v

## Step 1b - Indexing (Text Splitting)

In [7]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.create_documents([transcript])

In [8]:
len(chunks)

37

In [11]:
chunks[20]

Document(metadata={}, page_content="running here, if I run a search for all the words whose embeddings are closest to that of tower, you'll notice how they all seem to give very similar tower-ish vibes. And if you want to pull up some Python and play along at home, this is the specific model that I'm using to make the animations. It's not a transformer, but it's enough to illustrate the idea that directions in the space can carry semantic meaning. A very classic example of this is how if you take the difference between the vectors for woman and man, something you would visualize as a little vector in the space connecting the tip of one to the tip of the other, it's very similar to the difference between king and queen. So let's say you didn't know the word for a female monarch, you could find it by taking king, adding this woman minus man direction, and searching for the embedding closest to that point. At least, kind of. Despite this being a classic example for the model I'm playing w

## Step 1c & 1d - Indexing (Embedding Generation and Storing in Vector Store)

In [12]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = FAISS.from_documents(chunks, embeddings)

In [13]:
vector_store.index_to_docstore_id

{0: 'c4687d16-638a-48f3-a6ee-b0723e95b25d',
 1: '118bcab1-5bcb-479a-9c2f-164433c94d54',
 2: 'f40a57f5-4922-4d20-aa72-ce3e9bc153eb',
 3: 'be47cc5b-2be7-47f6-a7d6-d68f06cc2326',
 4: '9a3e6818-839f-4746-a85e-3445cd88e3a5',
 5: '763bbdcf-f782-4632-899c-b3ff53ff8316',
 6: '1a80b368-602c-442c-8a92-ee715c7cdd4f',
 7: 'a9ecab53-15c7-44f2-8687-401a51c09cfa',
 8: 'eb350fe8-451e-4545-a3ef-81d3fcd9c02e',
 9: '6da37ec9-38bd-4c2b-90ff-070d3a38ca21',
 10: '68e53d2d-82f9-4810-829e-788d496a600b',
 11: '1dee83e0-15a3-4e0d-b9eb-607ceffe4547',
 12: '4392094b-a7dd-450d-b306-f9441008a761',
 13: '347dbeeb-679c-444d-b353-21f26f408e28',
 14: '440a63b9-94cc-4859-988d-6ec960622587',
 15: '12ff9a52-2ecb-430d-a6c1-8e6d00f9d23f',
 16: 'f4d61c35-415c-4fb8-aa2d-f994a0e8bae1',
 17: '47194614-fb93-466e-83d3-4cbd6f5315aa',
 18: '9fddcc08-073d-4755-b377-6d30fbd18021',
 19: '629a87c3-91e3-4c98-a120-d00a35d371f7',
 20: 'a36a40c1-2316-46d2-9a32-deabcc7c17a2',
 21: '571d9a10-1d80-4b3a-8fe5-19ce7381f72d',
 22: '1a52855f-69e2-

In [14]:
vector_store.get_by_ids(['118bcab1-5bcb-479a-9c2f-164433c94d54'])

[Document(id='118bcab1-5bcb-479a-9c2f-164433c94d54', metadata={}, page_content="of models that you can build using transformers. Some models take in audio and produce a transcript. This sentence comes from a model going the other way around, producing synthetic speech just from text. All those tools that took the world by storm in 2022 like DALL-E and Midjourney that take in a text description and produce an image are based on transformers. Even if I can't quite get it to understand what a pi creature is supposed to be, I'm still blown away that this kind of thing is even remotely possible. And the original transformer introduced in 2017 by Google was invented for the specific use case of translating text from one language into another. But the variant that you and I will focus on, which is the type that underlies tools like ChatGPT, will be a model that's trained to take in a piece of text, maybe even with some surrounding images or sound accompanying it, and produce a prediction for 

## Step 2 - Retrieval

In [16]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})

In [17]:
retriever

VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000002EAE4927380>, search_kwargs={'k': 4})

In [18]:
retriever.invoke('What is tranformer')

[Document(id='c4687d16-638a-48f3-a6ee-b0723e95b25d', metadata={}, page_content="The initials GPT stand for Generative Pretrained Transformer. So that first word is straightforward enough, these are bots that generate new text. Pretrained refers to how the model went through a process of learning from a massive amount of data, and the prefix insinuates that there's more room to fine-tune it on specific tasks with additional training. But the last word, that's the real key piece. A transformer is a specific kind of neural network, a machine learning model, and it's the core invention underlying the current boom in AI. What I want to do with this video and the following chapters is go through a visually-driven explanation for what actually happens inside a transformer. We're going to follow the data that flows through it and go step by step. There are many different kinds of models that you can build using transformers. Some models take in audio and produce a transcript. This sentence com

## Step 3 - Augmentation

In [19]:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

In [20]:
prompt = PromptTemplate(
    template="""
      You are a helpful assistant.
      Answer ONLY from the provided transcript context.
      If the context is insufficient, just say you don't know.

      {context}
      Question: {question}
    """,
    input_variables = ['context', 'question']
)

In [55]:
question = "what is softmax"
retrieved_docs = retriever.invoke(question)

In [56]:
retrieved_docs

[Document(id='d74856dc-dcea-48be-a67c-8fb0145b8222', metadata={}, page_content="matrix, just with the order swapped, so it adds another 617 million parameters to the network, meaning our count so far is a little over a billion, a small but not wholly insignificant fraction of the 175 billion we'll end up with in total. As the very last mini-lesson for this chapter, I want to talk more about this softmax function, since it makes another appearance for us once we dive into the attention blocks. The idea is that if you want a sequence of numbers to act as a probability distribution, say a distribution over all possible next words, then each value has to be between 0 and 1, and you also need all of them to add up to 1. However, if you're playing the deep learning game where everything you do looks like matrix-vector multiplication, the outputs you get by default don't abide by this at all. The values are often negative, or much bigger than 1, and they almost certainly don't add up to 1. So

In [57]:
context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
context_text

"matrix, just with the order swapped, so it adds another 617 million parameters to the network, meaning our count so far is a little over a billion, a small but not wholly insignificant fraction of the 175 billion we'll end up with in total. As the very last mini-lesson for this chapter, I want to talk more about this softmax function, since it makes another appearance for us once we dive into the attention blocks. The idea is that if you want a sequence of numbers to act as a probability distribution, say a distribution over all possible next words, then each value has to be between 0 and 1, and you also need all of them to add up to 1. However, if you're playing the deep learning game where everything you do looks like matrix-vector multiplication, the outputs you get by default don't abide by this at all. The values are often negative, or much bigger than 1, and they almost certainly don't add up to 1. Softmax is the standard way to turn an arbitrary list of numbers into a valid\n\n

In [58]:
final_prompt = prompt.invoke({"context": context_text, "question": question})

In [59]:
final_prompt

StringPromptValue(text="\n      You are a helpful assistant.\n      Answer ONLY from the provided transcript context.\n      If the context is insufficient, just say you don't know.\n\n      matrix, just with the order swapped, so it adds another 617 million parameters to the network, meaning our count so far is a little over a billion, a small but not wholly insignificant fraction of the 175 billion we'll end up with in total. As the very last mini-lesson for this chapter, I want to talk more about this softmax function, since it makes another appearance for us once we dive into the attention blocks. The idea is that if you want a sequence of numbers to act as a probability distribution, say a distribution over all possible next words, then each value has to be between 0 and 1, and you also need all of them to add up to 1. However, if you're playing the deep learning game where everything you do looks like matrix-vector multiplication, the outputs you get by default don't abide by thi

## Step 4 - Generation

In [61]:
answer = llm.invoke(final_prompt)
print(answer.content)

Softmax is a function that turns an arbitrary list of numbers into a valid probability distribution, where each value is between 0 and 1 and all values add up to 1. It works by first raising e to the power of each number to create a list of positive values, then taking the sum of those values and dividing each term by that sum to normalize it. This results in larger input values dominating the output distribution, making it more likely to sample the maximizing input while still allowing for softer probabilities for other values.


## Building a Chain

In [62]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

In [63]:
def format_docs(retrieved_docs):
  context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
  return context_text

In [64]:
parallel_chain = RunnableParallel({
    'context': retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
})

In [69]:
parallel_chain.invoke('who is llm')

{'context': "layer which you consider the output. For example, the final layer in our text processing model is a list of numbers representing the probability distribution for all possible next tokens. In deep learning, these model parameters are almost always referred to as weights, and this is because a key feature of these models is that the only way these parameters interact with the data being processed is through weighted sums. You also sprinkle some non-linear functions throughout, but they won't depend on parameters. Typically, though, instead of seeing the weighted sums all naked and written out explicitly like this, you'll instead find them packaged together as various components in a matrix vector product. It amounts to saying the same thing, if you think back to how matrix vector multiplication works, each component in the output looks like a weighted sum. It's just often conceptually cleaner for you and me to think about matrices that are filled with tunable parameters that

In [66]:
parser = StrOutputParser()

In [67]:
main_chain = parallel_chain | prompt | llm | parser

In [68]:
main_chain.invoke('Can you summarize the video')

'The video is part of a mini-series on deep learning, focusing on the details of transformers. It discusses the initial and final stages of a transformer network, emphasizing the importance of background knowledge for understanding the model. The video will cover attention blocks, multi-layer perceptron blocks, and the training process. It explains how the model predicts the next token in a sequence by sampling from a probability distribution, which can be used to generate text, similar to early demos of GPT-3. The video also touches on the concept of using a system prompt to create a chatbot experience.'