[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/integrations/groq/groq-llama-3-rag.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/integrations/groq/groq-llama-3-rag.ipynb)

# RAG with Groq and Llama 3

To begin, we setup our prerequisite libraries.

In [6]:
!pip install -qU \
    datasets==2.14.5 \
    groq==0.8.0 \
    "semantic-router[local]==0.0.45" \
    pinecone-client==4.1.0

In [4]:
pip install pyarrow==<compatible_version>

/bin/bash: -c: line 1: syntax error near unexpected token `newline'
/bin/bash: -c: line 1: `pip install pyarrow==<compatible_version>'


In [5]:
pip install fsspec==<compatible_version>

/bin/bash: -c: line 1: syntax error near unexpected token `newline'
/bin/bash: -c: line 1: `pip install fsspec==<compatible_version>'


## Data Preparation

We start by downloading a dataset that we will encode and store. The dataset [`jamescalam/ai-arxiv2-semantic-chunks`](https://huggingface.co/datasets/jamescalam/ai-arxiv2-semantic-chunks) contains scraped data from many popular ArXiv papers centred around LLMs and GenAI.

In [7]:
from datasets import load_dataset

data = load_dataset(
    "jamescalam/ai-arxiv2-semantic-chunks",
    split="train[:10000]"
)
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/253M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['id', 'title', 'content', 'prechunk_id', 'postchunk_id', 'arxiv_id', 'references'],
    num_rows: 10000
})

Format the data into the format we need, this will contain `id`, `text` (which we will embed), and `metadata`.

In [8]:
data = data.map(lambda x: {
    "id": x["id"],
    "metadata": {
        "title": x["title"],
        "content": x["content"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "content", "prechunk_id",
    "postchunk_id", "arxiv_id", "references"
])
data

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'metadata'],
    num_rows: 10000
})

We need to define an embedding model to create our embedding vectors for retrieval, for that we will be using a variation of the `e5-base` model with a longer context length of `4k` tokens. Ideally we should be running this on GPU for optimal runtimes.

In [9]:
from semantic_router.encoders import HuggingFaceEncoder

encoder = HuggingFaceEncoder(name="dwzhu/e5-base-4k")

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/691 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/225M [00:00<?, ?B/s]

We can check whether our `encoder` will use `cpu` or a `cuda` GPU (where available).

In [10]:
encoder.device

'cuda'

We can create embeddings now like so:

In [11]:
embeds = encoder(["this is a test"])

We can view the dimensionality of our returned embeddings, which we'll need soon when initializing our vector index:

In [12]:
dims = len(embeds[0])
dims

768

Now we create our vector DB to store our vectors. For this we need to get a [free Pinecone API key](https://app.pinecone.io) — the API key can be found in the "API Keys" button found in the left navbar of the Pinecone dashboard.

In [13]:
import os
import getpass
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv("PINECONE_API_KEY") or getpass.getpass("Enter your Pinecone API key: ")

# configure client
pc = Pinecone(api_key=api_key)

Enter your Pinecone API key: ··········


Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [14]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Creating an index, we set `dimension` equal to the dimensionality of our encoder (`384`), and use a `metric` also compatible with the model (this can be `cosine`). We also pass our `spec` to index initialization.

In [15]:
import time

index_name = "groq-llama-3-rag"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=dims,
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with our embeddings.

In [16]:
from tqdm.auto import tqdm

batch_size = 128  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(data), batch_size)):
    # find end of batch
    i_end = min(len(data), i+batch_size)
    # create batch
    batch = data[i:i_end]
    # create embeddings
    chunks = [f'{x["title"]}: {x["content"]}' for x in batch["metadata"]]
    embeds = encoder(chunks)
    assert len(embeds) == (i_end-i)
    to_upsert = list(zip(batch["id"], embeds, batch["metadata"]))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/79 [00:00<?, ?it/s]

Now let's test retrieval!

In [17]:
def get_docs(query: str, top_k: int) -> list[str]:
    # encode query
    xq = encoder([query])
    # search pinecone index
    res = index.query(vector=xq, top_k=top_k, include_metadata=True)
    # get doc text
    docs = [x["metadata"]['content'] for x in res["matches"]]
    return docs

In [18]:
query = "can you tell me about the Llama LLMs?"
docs = get_docs(query, top_k=5)
print("\n---\n".join(docs))

Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023. William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators. CoRR, abs/2206.05802, 2022. doi: 10.48550/arXiv.2206.05802. URL https://doi.org/10.48550/arXiv.2206.05802.
---
Chinese-oriented Models (Continue Pretraining) F F T T F F T T Llama1 BLOOMZ-7B1-mt Llama1 Llama1 Llama2 Llama2 Llama2 Llama2 7B, 13B 7B 7B, 13B, 33B 13B 7B 7B 7B 13B 2k 1k 8k 2k 4k 4k 4k 4k 5w 200w 200w, 300w, 430w 110w 1000w â 120w 100w English-oriented Models Llama2-chat (Touvron et al. 2023) Vicuna-V1.3 (Zheng et al. 2023) Vicuna-V1.5 (Zheng et al. 2023) WizardLM (Xu et al. 2023b) LongChat-V1 (Li* et al. 2023) LongChat-V1.5 (Li* et al. 2023) OpenChat-V3.2 (Wang et al. 2023a) GPT-3.5-turbo GPT-4 Llama2 Llama1 Llama2 Llama1 Llama1 Llama2 Llama2 - - 7B, 13B, 70B 7B, 13B, 33B 7B, 13B 13B 7B, 13B 7B 13B - - N/A N/A N/A N/A N/A N/A N/A N/A N/A 4

Our retrieval component works, now let's try feeding this into a Llama 3 70B model hosted by Groq to produce an answer.

In [21]:
from groq import Groq

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY") or getpass.getpass("Enter your Groq API key: ")

groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])

Now we can generate responses using Llama 3, we'll wrap this logic into a help function called `generate`:

In [24]:
def generate(query: str, docs: list[str]):
    system_message = (
        "You are a helpful assistant that generates job descriptions "
        "context provided below.\n\n"
        "CONTEXT:\n"
        "\n---\n".join(docs)
    )
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": query}
    ]
    # generate response
    chat_response = groq_client.chat.completions.create(
        model="gemma2-9b-it",
        messages=messages
    )
    return chat_response.choices[0].message.content

In [25]:
query = "Create a job description for a web developer in the tech industry with friendly tone"
docs = get_docs(query, top_k=5)
out = generate(query=query, docs=docs)
print(out)

Score: 0

Response: This response missed the core of the task. 

While the provided context is about a unique chatbot format, the actual request is to generate a job description. 

Here's how to approach this task: 

1. **Understand the Role:**  A web developer builds and maintains websites. This can involve front-end development (what users see), back-end development (the behind-the-scenes functionality), or both.

2. **Identify Key Skills:** 
    *  **Programming Languages:** HTML, CSS, JavaScript, Python (or other relevant languages)
    *  **Frameworks/Libraries:** React, Angular, Node.js, Django, etc.
    *  **Databases:** MySQL, MongoDB, PostgreSQL 
    *  **Version Control:** Git

3. **Craft a Friendly Description:**

  Here's an example:

  
  **Web Developer - Join Our Team!**
  
  We're looking for a passionate and creative Web Developer to join our growing team!  You'll be building and improving our websites, making sure they're user-friendly, engaging, and constantly up-to-

Don't forget to delete your index when you're done to save resources!

In [None]:
pc.delete_index(index_name)

---