# Build a Youtube RAG application

#### Load the environment variables

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Add the link to the YouTube video you're going to use.
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=3qHkcs3kG44"

## Set up the model
Define the LLM model that you will use

In [2]:
from langchain_openai.chat_models import ChatOpenAI

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-4o-mini")

Test the model by asking a simple question.

In [3]:
model.invoke("What is the capital of France?")

AIMessage(content='The capital of France is Paris.', response_metadata={'token_usage': {'completion_tokens': 8, 'prompt_tokens': 14, 'total_tokens': 22, 'prompt_tokens_details': {'cached_tokens': 0, 'audio_tokens': 0}, 'completion_tokens_details': {'reasoning_tokens': 0, 'audio_tokens': 0, 'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0}}, 'model_name': 'gpt-4o-mini', 'system_fingerprint': 'fp_71b02749fa', 'finish_reason': 'stop', 'logprobs': None})

The model returns an `AIMessage` object that contains the response. To get the actual text answer, we can use an [output parser](https://python.langchain.com/docs/modules/model_io/output_parsers/) to convert the message into a usable format.

In this case, we'll use the `StrOutputParser` to extract the response as a plain string.

In [10]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("What is the capital of France?")

'The capital of France is Paris.'

## Transcribing the YouTube Video

The context we want to send the model comes from a YouTube video. Let's download the video and transcribe it using [OpenAI's Whisper](https://openai.com/research/whisper).

In [None]:
import tempfile
import whisper
from pytube import YouTube


# Run for the first download of the video
if not os.path.exists("transcription.txt"):
    youtube = YouTube(YOUTUBE_VIDEO)
    audio = youtube.streams.filter(only_audio=True).first()

    whisper_model = whisper.load_model("base")

    with tempfile.TemporaryDirectory() as tmpdir:
        file = audio.download(output_path=tmpdir)
        transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()

        with open("transcription.txt", "w") as file:
            file.write(transcription)

Ensure you have downloaded the transcript and it works correctly.

In [8]:
with open("transcription.txt") as file:
    transcription = file.read()

transcription[:241]

"Two, one. Boom, all right, well, thank you very much for doing this, man. I really appreciate it. I've been absorbing your information and listening to you talk for quite a while now, so it's great to actually meet you. Thanks for having me."

## Prompt templates
To help the model generate better responses, we combine context with the user’s question.
[Prompt templates](https://python.langchain.com/docs/modules/model_io/prompts/quick_start) make it easy to define, manage, and reuse these prompts in a consistent way.

In [11]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

formatted_prompt = prompt.format(
    context="In the episode, Naval talks about the power of leverage through code and media.",
    question="What type of leverage does Naval mention?"
)

print(formatted_prompt)


# Chain the prompt with the model and the output parser
chain = prompt | model | parser

response = chain.invoke({
    "context": "In the episode, Naval talks about the power of leverage through code and media.",
    "question": "What type of leverage does Naval mention?"
})

print(f"Answer: {response}")

Human: 
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: In the episode, Naval talks about the power of leverage through code and media.

Question: What type of leverage does Naval mention?

Answer: Naval mentions leverage through code and media.


## Split the transcription

Because the full transcription is too large to be provided as context for the model, a common approach is to divide it into smaller, manageable segments. 

This way, the model can focus only on the most relevant parts when responding to a specific question.


In [41]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("transcription.txt")
text_documents = loader.load()
text_documents[0]

Document(page_content="Two, one. Boom, all right, well, thank you very much for doing this, man. I really appreciate it. I've been absorbing your information and listening to you talk for quite a while now, so it's great to actually meet you. Thanks for having me. My pleasure, my pleasure. You are one of the rare guys that is, you're a big investor, you're deep in the tech world, but yet you seem to have a very balanced perspective in terms of how to live life, as opposed to not just be entirely focused on success and financial success and tech investing, but rather how to live your life in a happy way. That's a, it's a, it's not balance. Yeah, you know, I think the reason why people like hearing me is because like if it's like, if you go to a circus and you see a bear, right, that's kind of interesting, but not that much. If you see a unicycle, that's interesting, but you see a bear on a unicycle, that's really interesting, right? So when you combine things, you're not supposed to com

There are various techniques for breaking up a document. In this example, we’ll use a straightforward method that divides the text into fixed-size chunks. 

If you're curious about other strategies, check out the [Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/) guide.


In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text_splitter.split_documents(text_documents)[:2]

[Document(page_content='Two, one. Boom, all right, well, thank you very much for doing this, man. I really appreciate it.', metadata={'source': 'transcription.txt'}),
 Document(page_content="appreciate it. I've been absorbing your information and listening to you talk for quite a while", metadata={'source': 'transcription.txt'})]

In [17]:
documents = text_splitter.split_documents(text_documents)

## Identify most relevant chunks

To answer a specific question, we need to pinpoint the most relevant parts of the transcription to pass to the model. This is where **embeddings** become useful.

You can use the [Cohere's Embed Playground](https://dashboard.cohere.com/playground/embed) to visualize embeddings in two dimensions.

Compute similarity between embeddings and then top-macthing will be used as context.



In [19]:
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embedded_query = embeddings.embed_query("What is the capital of France?")

print(f"Embedding length: {len(embedded_query)}")
print(embedded_query[:10])

Embedding length: 1536
[0.02451018613922683, -0.011081933963308487, -0.0013095540911784872, -0.025292292210096785, -0.017471233364042407, 0.015541198646709315, -0.015806104970870788, -0.00674250860905136, 0.0016572442819246874, -0.03451357033119359]


To illustrate how embeddings work, let's first generate the embeddings for two different sentences.

We can now compute the similarity between the query and each of the two sentences. The closer the embeddings are, the more similar the sentences will be.

We can use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to calculate the similarity between the query and each of the sentences:

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

sentence1 = embeddings.embed_query("Paris")
sentence2 = embeddings.embed_query("Athens")


query_sentence1_similarity = cosine_similarity([embedded_query], [sentence1])[0][0]
query_sentence2_similarity = cosine_similarity([embedded_query], [sentence2])[0][0]
 
query_sentence1_similarity, query_sentence2_similarity

(0.8350762426685766, 0.7912412151713869)

Even though both are capital cities, Paris is more semantically aligned with the question about France, so it scores slightly higher. 

The fact that both scores are relatively high shows that embeddings capture the semantic relationship (i.e., both are capitals), but the higher score for Paris confirms that it's more relevant to the specific query.

## Set up a Vector Store

We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a **vector store**.

A vector store is a database of embeddings that specializes in fast similarity searches. 


In [26]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(documents, embeddings)

In [27]:
vectorstore.similarity_search_with_score(query="What is leverage?", k=3)

[(Document(page_content='wants. Apply some leverage, put your name on it. So you take the risks, but you gain the rewards,', metadata={'source': 'transcription.txt'}),
  0.8367189667737388),
 (Document(page_content='because of the accountability that you have with your name because of leverage that you have', metadata={'source': 'transcription.txt'}),
  0.8152599168369862),
 (Document(page_content='at was looking at businesses and figuring out the point of maximum leverage to actually create', metadata={'source': 'transcription.txt'}),
  0.8102432310140415)]

## Connect the vector store to the chain

To identify the most relevant parts of the transcription for the model, we’ll connect a vector store to our processing chain.

This involves setting up a [Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/), which performs a similarity search on the vector store and returns the top-matching chunks.


Since our prompt expects two inputs, "context" and "question", we need a way to supply both. To do this, we’ll use the [`RunnableParallel`](https://python.langchain.com/docs/expression_language/how_to/map) and [`RunnablePassthrough`](https://python.langchain.com/docs/expression_language/how_to/passthrough) classes from LangChain’s expression language. 

These utilities help us structure the inputs into a dictionary with the appropriate keys for the prompt.


In [28]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

chain = (
    {"context": vectorstore.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("What is leverage?")

"Leverage refers to the use of various methods or strategies to amplify one's potential gains or benefits. In the context provided, it suggests applying one's name or accountability to take risks in order to achieve greater rewards, particularly in business and technology. It indicates a strategic advantage that can lead to significant outcomes when properly utilized."

## Set up Pinecone

In-memory vector store is fine for small-scale examples.

However, in real-world applications, you need a vector store that can scale with large datasets and support fast, efficient similarity searches.

 For this purpose, we'll use [Pinecone](https://www.pinecone.io/), a managed vector database designed for high-performance retrieval tasks.

Create a Pinecone account, set up an index, get an API key, and set it as an environment variable `PINECONE_API_KEY`.


In [None]:
from langchain_pinecone import PineconeVectorStore

index_name = YOUR_INDEX_NAME

pinecone = PineconeVectorStore.from_documents(
    documents, embeddings, index_name=index_name
)

Let's now run a similarity search on pinecone to make sure everything works:

In [39]:
pinecone.similarity_search("What are the best tips to be happy?")[:1]

[Document(page_content="linger. So if you interpret the positive and everything very quickly, you let it go, right? You let it go much faster. Simple hacks get more sunlight, right? Learn to smile more, learn to hug more. These things actually release serotonin in reverse. They aren't just outward signals of being happy. They're actually feedback loops to being happy. Spend more time in nature. These are obvious. Watch your mind. Watch your mind all day long. Watch what it does. Not judge it. Not try to control it. But you can meditate 24-7. Meditation is not a sit down, close your eyes activity. Meditation is just basically watching your own thoughts like you would watch anything else in the outside world and say, why am I having that thought? Does that serve me anymore? Is that conditioning from when I was 10 years old? For example, getting ready for this podcast. You got ready? I didn't. Oh, good. I did. But I did. But I did. I couldn't help it. And what happened was the few days le

Let's setup the new chain using Pinecone as the vector store:

In [40]:
chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

chain.invoke("What are the best tips to be happy?")

'The best tips to be happy mentioned in the context are:\n\n1. Interpret the positive quickly and let go of negativity.\n2. Engage in simple hacks to boost your mood, such as getting more sunlight.\n3. Learn to smile and hug more, as these actions can release serotonin and create feedback loops to happiness.\n4. Spend more time in nature.\n5. Watch your mind and observe your thoughts without judgment or trying to control them.\n6. Practice continuous meditation by being aware of your thoughts and questioning their relevance and origins.'