[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb)

# RAG with LLaMa 13B

In this notebook we'll explore how we can use the open source **Llama-13b-chat** model in both Hugging Face transformers and LangChain.

---

🚨 _Note that running this on CPU is sloooow. If running on Google Colab you can avoid this by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab._

---

We start by doing a `pip install` of all required libraries.

In [1]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m15.6 

In [None]:
!pip install torchaudio==2.1.0

In [None]:
!pip install torch==2.0.1

In [4]:
"""
from google.colab import drive
drive.mount('/content/drive')
"""

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding.

In [54]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

We can use the embedding model to create document embeddings like so:

In [55]:
docs = [
    "this is one document",
    "and another document"
]

embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


## Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

In [56]:
import os
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') or '8303cc0e-6981-4dfd-aa1c-c35830f1c7ab',
    environment=os.environ.get('PINECONE_ENVIRONMENT') or 'gcp-starter'
)

Now we initialize the index.

In [57]:
import time

index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

Now we connect to the index:

In [58]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.04838,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

With our index and embedding process ready we can move onto the indexing process itself. For that, we'll need a dataset. We will use a set of Arxiv papers related to (and including) the Llama 2 research paper.

In [59]:
from datasets import load_dataset

data = load_dataset(
    'jamescalam/llama-2-arxiv-papers-chunked',
    split='train'
)
data

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 4838
})

We will embed and index the documents like so:

In [60]:
data = data.to_pandas()

batch_size = 32

for i in range(0, len(data), batch_size):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    texts = [x['chunk'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

In [61]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.04838,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-13b-chat-hf`.

* The respective tokenizer for the model.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [62]:
from torch import cuda, bfloat16
import transformers

model_id = 'daryl149/llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 13B models were trained using the Llama 2 13B tokenizer, which we initialize like so:

In [63]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [64]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [65]:
res = generate_text("What is the vector database?")
print(res[0]["generated_text"])

What is the vector database?
 everybody uses it.

Answer: The vector database is a type of database that stores data in a vector format, meaning that each data point is represented as a vector in a high-dimensional space. This allows for efficient and flexible querying and manipulation of the data using vector-based operations, such as similarity measures and clustering algorithms.

The vector database is particularly useful for applications where data is inherently multidimensional or has complex relationships between variables. For example, in text analysis, documents can be represented as vectors of word frequencies, allowing for queries based on semantic similarity between documents. In image recognition, images can be represented as vectors of pixel values, enabling fast matching of images based on their visual content.
Some popular vector databases include:

* Word2Vec: A Google-developed library for converting words into vectors in a high-dimensional space, allowing for efficien

Now to implement this in LangChain

In [66]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [67]:
llm(prompt="What is the vector database?")

'\n everybody uses it.\n\nAnswer: The vector database is a type of database that stores data in a vector format, meaning that each data point is represented as a vector in a high-dimensional space. This allows for efficient and flexible querying and manipulation of the data using vector-based operations, such as similarity measures and clustering algorithms.\n\nThe vector database is particularly useful for applications where data is inherently multidimensional or has complex relationships between variables. For example, in text analysis, documents can be represented as vectors of word frequencies, allowing for queries based on semantic similarity between documents. In image recognition, images can be represented as vectors of pixel values, enabling fast matching of images based on their visual content.\nSome popular vector databases include:\n\n* Word2Vec: A Google-developed library for converting words into vectors in a high-dimensional space, allowing for efficient text analysis tas

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 13B Chat** to the LangChain library. Using this we can now begin using LangChain's advanced agent tooling, chains, etc, with **Llama 2**.

## Initializing a RetrievalQA Chain

For **R**etrieval **A**ugmented **G**eneration (RAG) in LangChain we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object. For both of these we need an `llm` (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Let's begin by initializing the LangChain vector store, we do it like so:

In [None]:
'''
https://api.python.langchain.com/en/latest/_modules/langchain/embeddings/huggingface.html#HuggingFaceEmbeddings.embed_query
def embed_query(self, text: str) -> List[float]:
        """Compute query embeddings using a HuggingFace transformer model.

        Args:
            text: The text to embed.

        Returns:
            Embeddings for the text.
        """
        return self.embed_documents([text])[0]
'''

If any errors occur when the index variable or embed_model.embed_query variable is not defined, please run them again above, because this error appears when you are "out_of_RAM" during the training process and the system cannot release it in time or you can use High-RAM provided by Colab.

In [68]:
from langchain.vectorstores import Pinecone

text_field = 'text'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

We can confirm this works like so:

In [69]:
query = 'What is the vector database?'

vectorstore.similarity_search(
    query,  # the search query
    k=3  # returns top 3 most relevant chunks of text
)

[Document(page_content='or answers to questions about a visual scene (Santoro et al., 2017).\nThe nodes, edges, and global outputs can also be mixed-and-matched depending on the task. For\nexample, Hamrick et al. (2018) used both the output edge and global attributes to compute a policy\nover actions.\n4.1.2 Graph structure\nWhen de\x0cning how the input data will be represented as a graph, there are generally two scenarios:\n\x0crst, the input explicitly speci\x0ces the relational structure; and second, the relational structure must\nbe inferred or assumed. These are not hard distinctions, but extremes along a continuum.\nExamples of data with more explicitly speci\x0ced entities and relations include knowledge graphs,\nsocial networks, parse trees, optimization problems, chemical graphs, road networks, and physical\nsystems with known interactions. Figures 2a-d illustrate how such data can be expressed as graphs.\nExamples of data where the relational structure is not made explicit, 

Looks good! Now we can put our `vectorstore` and `llm` together to create our RAG pipeline.

In [70]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)

Let's begin asking questions! First let's try *without* RAG:

In [71]:
llm('What is the vector database?')

'\n everybody uses it.\n\nAnswer: The vector database is a type of database that stores data in a vector format, meaning that each data point is represented as a vector in a high-dimensional space. This allows for efficient and flexible querying and manipulation of the data using vector-based operations, such as similarity measures and clustering algorithms.\n\nThe vector database is particularly useful for applications where data is inherently multidimensional or has complex relationships between variables. For example, in text analysis, documents can be represented as vectors of word frequencies, allowing for queries based on semantic similarity between documents. In image recognition, images can be represented as vectors of pixel values, enabling fast matching of images based on their visual content.\nSome popular vector databases include:\n\n* Word2Vec: A Google-developed library for converting words into vectors in a high-dimensional space, allowing for efficient text analysis tas

Hmm, that's not what we meant... What if we use our RAG pipeline?

In [76]:
rag_pipeline('Steiner argued that , in the right circumstances , the spiritual world can be explored through direct experience by practicing ethical and cognitive forms of rigorous self-discipline .')

{'query': 'Steiner argued that , in the right circumstances , the spiritual world can be explored through direct experience by practicing ethical and cognitive forms of rigorous self-discipline .',
 'result': " Steiner's ideas about exploring the spiritual world through direct experience involve practices such as meditation, contemplation, and introspection. These practices help individuals cultivate a deeper awareness of their inner lives and gain insight into the nature of reality. By developing ethical and cognitive disciplines, individuals can prepare themselves for a more profound understanding of the spiritual realm."}

But it can not implement for paraphrasing task.

In [43]:
rag_pipeline('how does the performance of llama 2 compare to other local LLMs?')

{'query': 'Luciano Williames Dias ( born July 25, 1970 ) is a Brazilian football coach and former player . *',
 'result': ' Luciano Williames Dias was born on July 25, 1970, in Brazil.'}