# TranscriptIQ RAG

**Skills: OpenAI, OpenRouter, LangChain, Pinecone**


**Other Resources:**
- [Get your OpenAI API Key](https://platform.openai.com/settings/profile?tab=api-keys)
- [Get your Pinecone API Key](https://www.pinecone.io/)
- [Get your OpenRouter API Key](https://openrouter.ai/settings/keys)
- [JavaScript Code for RAG](https://js.langchain.com/v0.2/docs/tutorials/rag)
- [RAG with an in-memory database in Next.js](https://sdk.vercel.ai/examples/node/generating-text/rag)


### What is RAG anyway?


Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.

#### Install relevant libraries

In [1]:
! pip install langchain langchain-community openai tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference youtube-transcript-api pytube sentence-transformers

Collecting langchain
  Downloading langchain-0.2.15-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.2.14-py3-none-any.whl.metadata (2.7 kB)
Collecting openai
  Downloading openai-1.43.0-py3-none-any.whl.metadata (22 kB)
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.1.3-py3-none-any.whl.metadata (1.7 kB)
Collecting unstructured
  Downloading unstructured-0.15.8-py3-none-any.whl.metadata (29 kB)
Collecting pdfminer==20191125
  Downloading pdfminer-20191125.tar.gz (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pdfminer.six==20221105
  Downloading pdf

In [1]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os

# Initilize Your Pinecone,OpenAI, OpenRouter API Keys through keys sectiion in .
pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key


#For OpenAI

# openai_api_key = userdata.get("OPENAI_API_KEY")
# os.environ['OPENAI_API_KEY'] = openai_api_key



# Initialize the OpenAI client

In [None]:
# embeddings = OpenAIEmbeddings()
# embed_model = "text-embedding-3-small"
# openai_client = OpenAI()

# Use HuggingFace & OpenRouter if you don't have an OpenAI account with credits



In [16]:
# HuggingFace Embeddings
# Use this instead of OpenAI embeddings if you don't have an OpenAI account with credits

text = "This is a test document."

hf_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
query_result = hf_embeddings.embed_query(text)

In [3]:
query_result

[-0.038338541984558105,
 0.12346471846103668,
 -0.02864297851920128,
 0.05365270376205444,
 0.008845366537570953,
 -0.03983934596180916,
 -0.07300589233636856,
 0.04777132719755173,
 -0.030462471768260002,
 0.05497974902391434,
 0.08505292981863022,
 0.03665666654706001,
 -0.005319973453879356,
 -0.002233141800388694,
 -0.06071099638938904,
 -0.027237920090556145,
 -0.01135166734457016,
 -0.042437683790922165,
 0.00912993960082531,
 0.10081552714109421,
 0.07578728348016739,
 0.06911715865135193,
 0.009857431054115295,
 -0.0018377641681581736,
 0.02624903991818428,
 0.03290243074297905,
 -0.07177437096834183,
 0.028384247794747353,
 0.06170954555273056,
 -0.052529532462358475,
 0.033661652356386185,
 0.07446812838315964,
 0.07536034286022186,
 0.03538404777646065,
 0.06713404506444931,
 0.010798045434057713,
 0.08167017996311188,
 0.016562897711992264,
 0.03283063694834709,
 0.036325663328170776,
 0.0021727988496422768,
 -0.09895738214254379,
 0.0050467848777771,
 0.05089650675654411,


In [4]:
# Free Llama 3.1 API via OpenRouter (Aug-2024)
# Use this instead of OpenAI if you don't have an OpenAI account with credits

openrouter_client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=userdata.get("OPENROUTER_API_KEY")
)

## Initialize our text splitter
This is how we will chunk up the text to be retrieved during the RAG process

In [5]:
tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2000,
        chunk_overlap=100,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
)

# Understanding Embeddings

In [None]:
# Understanding using OpenAI (Run only if OpenAI Client Initilized)
def get_embedding(text, model="text-embedding-3-small"):
    # Call the OpenAI API to get the embedding for the text
    response = openai_client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

def cosine_similarity_between_words(sentence1, sentence2):
    # Get embeddings for both words
    embedding1 = np.array(get_embedding(sentence1))
    embedding2 = np.array(get_embedding(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]


# Example usage
sentence1 = "I like walking to the park"
sentence2 = "I like walking to the office"


similarity = cosine_similarity_between_words(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")


# Load in a YouTube video and get its transcript

In [6]:
# Load in a YouTube video's transcript
loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=e-gwvmhyU7A", add_video_info=True)
data = loader.load()

print(data)



In [7]:
texts = text_splitter.split_documents(data)

In [8]:
texts

[Document(metadata={'source': 'e-gwvmhyU7A', 'title': 'Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet | Lex Fridman Podcast #434', 'description': 'Unknown', 'view_count': 659232, 'thumbnail_url': 'https://i.ytimg.com/vi/e-gwvmhyU7A/hq720.jpg', 'publish_date': '2024-06-19 00:00:00', 'length': 10936, 'author': 'Lex Fridman Podcast'}, page_content='- Can you have a conversation with an AI where it feels like you\ntalk to Einstein or Feynman where you ask them a hard question, they\'re like, "I don\'t know." And then after a week they\ndid a lot of research- - They disappear and come back. Yeah.\n- And they come back and just blow your mind. If we can achieve that, that amount of inference compute where it leads to a\ndramatically better answer as you apply more inference compute, I think that will be the beginning of, like, real reasoning breakthroughs. (graphic whooshing) - The following is a conversation with Aravind Srinivas, CEO of Perplexity, a company that a

# Initialize Pinecone





### For this to work you have to initilize pinecone index with OpenAI model if using OpenAI, for OpenRouter create index dimension accordingly. Here, a 384 dimension index is created for the OpenRouter model used via Hugging Face.

In [17]:

vectorstore = PineconeVectorStore(index_name="demo2", embedding=hf_embeddings)

# vectorstore = PineconeVectorStore(index_name="headstarter-demo", embedding=embeddings)

index_name = "demo2"

namespace = "youtube-videos"

# Insert data into Pinecone

Documentation: https://docs.pinecone.io/integrations/langchain#key-concepts

In [18]:
for document in texts:
    print("\n\n\n\n----")

    print(document.metadata, document.page_content)
    vectorstore_from_texts = PineconeVectorStore.from_texts([f"Source: {t.metadata['source']}, Title: {t.metadata['title']} \n\nContent: {t.page_content}" for t in texts], hf_embeddings, index_name=index_name, namespace="youtube-videos")

    print('\n\n\n\n----')





----
{'source': 'e-gwvmhyU7A', 'title': 'Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet | Lex Fridman Podcast #434', 'description': 'Unknown', 'view_count': 659232, 'thumbnail_url': 'https://i.ytimg.com/vi/e-gwvmhyU7A/hq720.jpg', 'publish_date': '2024-06-19 00:00:00', 'length': 10936, 'author': 'Lex Fridman Podcast'} - Can you have a conversation with an AI where it feels like you
talk to Einstein or Feynman where you ask them a hard question, they're like, "I don't know." And then after a week they
did a lot of research- - They disappear and come back. Yeah.
- And they come back and just blow your mind. If we can achieve that, that amount of inference compute where it leads to a
dramatically better answer as you apply more inference compute, I think that will be the beginning of, like, real reasoning breakthroughs. (graphic whooshing) - The following is a conversation with Aravind Srinivas, CEO of Perplexity, a company that aims to revolutionize how we hum

In [None]:
# Vector Insertion Code Snippet -->

# vectorstore_from_texts = PineconeVectorStore.from_texts([f"Source: {t.metadata['source']}, Title: {t.metadata['title']} \n\nContent: {t.page_content}" for t in texts], hf_embeddings, index_name=index_name, namespace=namespace)

# Perform RAG

In [19]:
from pinecone import Pinecone

In [20]:
# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)

# Connect to your Pinecone index
pinecone_index = pc.Index("demo2")

In [21]:
query = "What does Aravind mention about pre-training and why it is important?"

In [None]:
# For OpenAI run this cell.
raw_query_embedding = openai_client.embeddings.create(
    input=[query],
    model="text-embedding-3-small"
)

query_embedding = raw_query_embedding.data[0].embedding

In [23]:
# For OpenRouter and  hf_embedding (Hugging Fcae Embedding)
text = query

hf_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
query_embedding = hf_embeddings.embed_query(text)

In [24]:
query_embedding

[-0.04498419538140297,
 0.03252768889069557,
 -0.006729174870997667,
 0.05888822302222252,
 0.010554458945989609,
 0.10118735581636429,
 0.03971084579825401,
 -0.025677254423499107,
 -0.07760795950889587,
 -0.017693305388092995,
 -0.02531895786523819,
 0.12688948214054108,
 -0.07823604345321655,
 0.03553185611963272,
 -0.011518123559653759,
 -0.014532060362398624,
 0.032681189477443695,
 -0.0062502590008080006,
 -0.04025233909487724,
 -0.06671677529811859,
 -0.016978591680526733,
 -0.032142385840415955,
 0.09440036118030548,
 0.0009648544364608824,
 -0.022813627496361732,
 -0.025571594014763832,
 -0.003024167148396373,
 -0.021131504327058792,
 0.07538026571273804,
 0.04592401534318924,
 0.028599247336387634,
 0.01566319540143013,
 0.12457882612943649,
 0.030227180570364,
 -0.16030460596084595,
 0.12474517524242401,
 0.045387908816337585,
 0.036697398871183395,
 -0.029672522097826004,
 0.08723264932632446,
 0.0037641802337020636,
 -0.0701531171798706,
 -0.07088879495859146,
 -0.01577283

In [25]:
top_matches = pinecone_index.query(vector=query_embedding, top_k=10, include_metadata=True, namespace=namespace)

In [26]:
top_matches

{'matches': [{'id': '6e4c2f70-6137-4637-ba1f-f3154f124e47',
              'metadata': {'text': 'Source: e-gwvmhyU7A, Title: Aravind '
                                   'Srinivas: Perplexity CEO on Future of AI, '
                                   'Search & the Internet | Lex Fridman '
                                   'Podcast #434 \n'
                                   '\n'
                                   'Content: maybe in the 10th or 9th, you '
                                   'feed it in the model, it can still know '
                                   'that that was more\n'
                                   'relevant than the first. So that '
                                   'flexibility allows\n'
                                   'you to, like, rethink where to put your '
                                   'resources in, in terms of whether you '
                                   'want\n'
                                   'to keep making the model better or whether 

In [27]:
# Get the list of retrieved texts
contexts = [item['metadata']['text'] for item in top_matches['matches']]

In [28]:
contexts

['Source: e-gwvmhyU7A, Title: Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet | Lex Fridman Podcast #434 \n\nContent: maybe in the 10th or 9th, you feed it in the model, it can still know that that was more\nrelevant than the first. So that flexibility allows\nyou to, like, rethink where to put your resources in, in terms of whether you want\nto keep making the model better or whether you wanna make\nthe retrieval stage better. It\'s a trade off. And computer science\nis all about trade-offs right at the end. - So one of the things you\nshould say is that the model, this is that pre-trained LLM is something that you can swap out in Perplexity. So it could be GPT-4o, it could be Claude 3, it can be Llama, something based on Llama 3.\n- Yeah. That\'s the model we train ourselves. We took Llama 3 and we post-trained it to be very good at few skills like summarization, referencing\ncitations, keeping context and longer context support. So that\'s called Sonar. - You

In [29]:
augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

In [30]:
print(augmented_query)

<CONTEXT>
Source: e-gwvmhyU7A, Title: Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet | Lex Fridman Podcast #434 

Content: maybe in the 10th or 9th, you feed it in the model, it can still know that that was more
relevant than the first. So that flexibility allows
you to, like, rethink where to put your resources in, in terms of whether you want
to keep making the model better or whether you wanna make
the retrieval stage better. It's a trade off. And computer science
is all about trade-offs right at the end. - So one of the things you
should say is that the model, this is that pre-trained LLM is something that you can swap out in Perplexity. So it could be GPT-4o, it could be Claude 3, it can be Llama, something based on Llama 3.
- Yeah. That's the model we train ourselves. We took Llama 3 and we post-trained it to be very good at few skills like summarization, referencing
citations, keeping context and longer context support. So that's called Sonar. - You can 

In [None]:
# Modify the prompt below as need to improve the response quality

primer = f"""You are a personal assistant. Answer any questions I have about the Youtube Video provided.
"""

res = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)

openai_answer = res.choices[0].message.content

In [None]:
print(openai_answer)

Aravind Srinivas highlights the significance of pre-training in the development of large language models (LLMs). Here are the key points he makes about pre-training and why it is important:

1. **Common Sense Acquisition**:
   Pre-training is crucial for acquiring general common sense. Without a solid foundation of common sense knowledge obtained during pre-training, the models would lack the essential knowledge needed for various tasks later on.

2. **Foundation for Post-Training**:
   Aravind refers to the phases of model development as pre-train and post-train. The pre-training phase involves raw scaling on compute, where the model learns from vast amounts of data. This phase gives the model the necessary base understanding of language and common sense required for it to be effective in subsequent post-training phases, such as reinforcement learning from human feedback (RLHF) or supervised fine-tuning.

3. **Brute Force but Necessary**:
   He acknowledges that pre-training can seem 

# Using OpenRouter

In [51]:
 # Check out different models here: https://openrouter.ai/docs/models

primer = f"""You are a personal assistant. Answer any questions I have about the Youtube Video provided.
"""

res = openrouter_client.chat.completions.create(
    #model="mistralai/mistral-nemo",
    model="meta-llama/llama-3.1-8b-instruct:free",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)

answer = res.choices[0].message.content

In [52]:
print(answer)

According to the conversation, Aravind mentions that pre-training a model is an important step, but it's the post-training that's crucial for fine-tuning the model to specific skills. Specifically, he mentions that they took Llama 3 and post-trained it to be very good at specific skills like summarization, referencing citations, keeping context, and longer context support. This post-training process is what allows their model, Sonar, to be effective in those areas.


# Putting it all together

## Open AI Rag





In [None]:
## Run only for openai
def perform_rag(query):
    raw_query_embedding = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    )

    query_embedding = raw_query_embedding.data[0].embedding

    top_matches = pinecone_index.query(vector=query_embedding, top_k=10, include_metadata=True, namespace=namespace)

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are an expert personal assistant. Answer any questions I have about the Youtube Video provided. You always answer questions based only on the context that you have been provided.
    """

    res = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return res.choices[0].message.content


## Open Router RAG

In [48]:
def perform_rag(query):
    # Initialize the embedding model
    hf_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    # Generate the embedding for the query
    query_embedding = hf_embeddings.embed_query(query)

    # Query Pinecone index for top 10 matches
    top_matches = pinecone_index.query(vector=query_embedding, top_k=10, include_metadata=True, namespace=namespace)

    # Extract the contexts from the retrieved documents
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    # Construct the augmented query with the retrieved contexts
    augmented_query = (
        "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[:10]) +
        "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query
    )

    # Define the system prompt for the chat model
    system_prompt = (
        "You are an expert personal assistant. Answer any questions I have about the YouTube Video provided. "
        "You always answer questions based only on the context that you have been provided."
    )

    # Interact with the OpenAI model using the augmented query
    res = openrouter_client.chat.completions.create(
        model="mistralai/mistral-nemo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    # Return the content of the model's response
    return res.choices[0].message.content

In [49]:
perform_rag("What does Aravind mention about pre-training and why it is important?")

'Aravind Srinivas, the CEO of Perplexity, mentions in the podcast episode that the model they have is called Sonar Large 32K. He explains, "Advanced model trained by Perplexity." Perplexity\'s model undergoes post-training on large language models like Llama 3, which have been post-trained to get better at several skills: summarization, referencing and citing work, keeping long context and keeping context relevant. Post-training an already-existing pre-trained LLMs improves these skills drastically. Thus, pre-training plays an essential initial role for fine-tuning and skill-building. He also said they are improving their RLHF ( Reinforcement Learning with Human Feedbacks), further enhancing it. They plan to iterate and release models that cater to new improvements. Post-training pre-trained, large LLMs is valuable for a start-up with limited access to large budgets for brand-new architectures, thus building upon models like the LLMs trained by entities such as those at the leading edg

In [50]:
perform_rag("What advantages does Perplexity have over other AI companies?")

"According to Aravind Srinivas, the CEO of Perplexity, his company has several advantages over other AI companies:\n\n1. **Addressing underserved markets**: Perplexity focuses on search over things that people couldn't search before, such as relational databases, making it unique in its approach to AI applications.\n2. **Practical and user-centric approach**: By starting with a practical, user-centric product like searching over relational databases, Perplexity gained users' trust and attention.\n3. **Identifying a wedge opportunity**: Rather than trying to tackle AI applications directly, Perplexity identified a specific opportunity (searching over relational databases) and started there, gaining a foothold in the AI space.\n4. **Recruiting top talent**: Backing from prominent individuals and their willingness to listen to a recruiting pitch helped Perplexity bring on board high-quality team members.\n5. **Initial viral success**: Perplexity's initial product, which allowed users to s

# RAG over a PDF

In [None]:
loader = PyPDFLoader("/content/Harry Potter and the Sorcerers Stone.pdf") # Insert the path to a PDF here
data = loader.load()

print(data)

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2000,
        chunk_overlap=100,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
    )

texts = text_splitter.split_documents(data)

# Insert all the chunks from the PDF into Pinecone
vectorstore_from_texts = PineconeVectorStore.from_texts([f"Source: {t.metadata['source']}, Title: {t.metadata['title']} \n\nContent: {t.page_content}" for t in texts], embeddings, index_name=index_name, namespace=namespace)

# After this, all the code is the same from the Perform RAG section of this notebook
# Since the data from the PDF is now stored in Pinecone, you can perform RAG over it the same way as the YouTube video