# RAG with LLaMa 13B

In this notebook we'll explore how we can use the open source **Llama-13b-chat** model in both Hugging Face transformers and LangChain.
At the time of writing, you must first request access to Llama 2 models via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).

---

🚨 **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

_This should be included within the free tier of Colab._

---

We start by doing a `pip install` of all required libraries.

In [1]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0 \
  PyPDF2 \
  tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m52.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m8.3 MB/s[0m

In [2]:
import pandas as pd
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import os
import pinecone
import time
from datasets import load_dataset
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing_extensions import Concatenate
from torch import cuda, bfloat16
import transformers
from langchain.llms import HuggingFacePipeline
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA

  from tqdm.autonotebook import tqdm


## Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding.

In [3]:
embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

We can use the embedding model to create document embeddings like so:

In [5]:
docs = [
    "this is one document",
    "and another document"
]

embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


## Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

In [6]:
# get API key from app.pinecone.io and environment from console
pinecone.init(
    environment="gcp-starter",
    api_key="a7dddfc1-8eb3-477e-bc69-0b52f0ee201a"
)

Now we initialize the index.

In [7]:
index_name = 'rag-llama-2-paper'

In [9]:
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

Now we connect to the index:

In [10]:
index = pinecone.Index(index_name)

With our index and embedding process ready we can move onto the indexing process itself. For that, we'll need a dataset. We will use a set of Arxiv papers related to (and including) the Llama 2 research paper.

In [11]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('/content/drive/MyDrive/Llama-2/2307.09288.pdf')

In [14]:
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [15]:
raw_text



In [16]:
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 50,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [17]:
data_df = pd.DataFrame(texts, columns=["chunks"])
data_df.head()

Unnamed: 0,chunks
0,Llama 2 : Open Foundation and Fine-Tuned Chat ...
1,Alan Schelten Ruan Silva Eric Michael Smith Ra...
2,source models. We provide a detailed descripti...
3,3 Fine-tuning 8\n3.1 Supervised Fine-Tuning (S...
4,4.4 Safety Evaluation of Llama 2-Chat . . . . ...


In [64]:
batch_size = 32

for i in range(0, len(data_df), batch_size):
    i_end = min(len(data_df), i+batch_size)
    batch = data_df.iloc[i:i_end]
    ids = [str(i) for i, x in batch.iterrows()]
    texts = [x['chunks'] for i, x in batch.iterrows()]

    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunks']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

In [65]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00336,
 'namespaces': {'': {'vector_count': 336}},
 'total_vector_count': 336}

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-13b-chat-hf`.

* The respective tokenizer for the model.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [66]:
model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

Downloading (…)lve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 13B models were trained using the Llama 2 13B tokenizer, which we initialize like so:

In [67]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]



Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [68]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [69]:
res = generate_text("How can AI make it easier to learn about new topics quickly?")
print(res[0]["generated_text"])

How can AI make it easier to learn about new topics quickly?

AI has the potential to revolutionize education by providing personalized learning experiences, automating administrative tasks, and enhancing teaching methods. Here are some ways AI can help you learn about new topics quickly:

1. Personalized Learning Paths: AI can analyze your learning style, strengths, and weaknesses to create a customized learning path tailored to your needs. This can help you focus on the most important concepts and skip what you already know or find difficult.
2. Adaptive Assessments: AI-powered adaptive assessments can adjust their difficulty and content based on your performance, providing a more accurate picture of your knowledge level. This helps you identify areas where you need improvement and focus your efforts accordingly.
3. Intelligent Tutoring Systems: AI-powered tutoring systems can provide one-on-one support, offering real-time feedback and guidance as you work through problems and exerci

In [70]:
res = generate_text("Which company introduced the Lamma 2 model?")
print(res[0]["generated_text"])

Which company introduced the Lamma 2 model?

Answer: The Lamma 2 model was introduced by the Turkish automaker, Otokar.


Now to implement this in LangChain

In [71]:
llm = HuggingFacePipeline(pipeline=generate_text)

In [72]:
llm(prompt="Which company introduced the Lamma 2 model?")

'\n\nAnswer: The Lamma 2 model was introduced by the Turkish automaker, Otokar.'

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 13B Chat** to the LangChain library. Using this we can now begin using LangChain's advanced agent tooling, chains, etc, with **Llama 2**.

## Initializing a RetrievalQA Chain

For **R**etrieval **A**ugmented **G**eneration (RAG) in LangChain we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object. For both of these we need an `llm` (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Let's begin by initializing the LangChain vector store, we do it like so:

In [73]:
text_field = 'text'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

We can confirm this works like so:

In [74]:
query = "Which company introduced the Lamma 2 model?"

vectorstore.similarity_search(
    query,  # the search query
    k=3  # returns top 3 most relevant chunks of text
)

[Document(page_content='https://ai.meta.com/llama/responsible-user-guide\nTable 52: Model card for Llama 2 .\n77', metadata={}),
 Document(page_content='more recent, up to July 2023.\nEvaluation Results\nSee evaluations for pretraining (Section 2); fine-tuning (Section 3); and safety (Section 4).\nEthical Considerations and Limitations (Section 5.2)\nLlama 2 is a new technology that carries risks with use. Testing conducted to date has been in\nEnglish, and has notcovered, nor could it coverall scenarios. For these reasons, aswith all LLMs,\nLlama 2’s potential outputs cannot be predicted in advance, and the model may in some instances\nproduceinaccurateorobjectionableresponsestouserprompts. Therefore,beforedeployingany\napplications of Llama 2, developers should perform safety testing and tuning tailored to their\nspecific applications of the model. Please see the Responsible Use Guide available available at', metadata={}),
 Document(page_content='We estimate the total emissions for t

Looks good! Now we can put our `vectorstore` and `llm` together to create our RAG pipeline.

In [75]:
rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)

Let's begin asking questions! First let's try *without* RAG:

In [76]:
llm("Which company introduced the Lamma 2 model?")

'\n\nAnswer: The Lamma 2 model was introduced by the Turkish automaker, Otokar.'

Hmm, that's not what we meant... What if we use our RAG pipeline?

In [77]:
rag_pipeline("Which company introduced the Lamma 2 model?")

{'query': 'Which company introduced the Lamma 2 model?',
 'result': ' The Lamma 2 model was introduced by Meta.'}

This looks *much* better! Let's try some more.

In [78]:
llm('What is the purpose of the new llama 2 model?')

'\n\nAnswer: The LLAMA 2 model is a deep learning-based model that has been developed to improve the accuracy and efficiency of natural language processing tasks, particularly in low-resource settings. The main purpose of the LLAMA 2 model is to provide a more accurate and efficient alternative to existing NLP models, which can be computationally expensive and require large amounts of training data.\n\nThe LLAMA 2 model is designed to be lightweight and adaptable to a wide range of NLP tasks, including text classification, sentiment analysis, named entity recognition, and machine translation. It uses a combination of pre-trained word embeddings and task-specific layers to learn the relevant features for each task, allowing it to achieve high accuracy with minimal computational resources.\n\nIn addition to its practical applications, the LLAMA 2 model is also an important research tool for studying the properties of deep learning models and their ability to generalize to new tasks and d

Okay, it looks like the LLM with no RAG is less than ideal — let's stop embarassing the poor LLM and stick with RAG + LLM. Let's ask the same question to our RAG pipeline.

In [79]:
rag_pipeline('What is the purpose of the new llama 2 model?')

{'query': 'What is the purpose of the new llama 2 model?',
 'result': ' The purpose of the new llama 2 model is for commercial and research use in English, specifically for assistant-like chat and natural language generation tasks. It is intended to be responsible and safe, but users should still take extra steps in tuning and deployment as described in the Responsible Use Guide.'}

A reasonable answer from the RAG pipeline, but it doesn't contain much information — maybe we can ask more about this, like what is this _"red team"_ procedure that delayed the launch of the 34B model?

Very interesting!

In [80]:
llm('how does the performance of llama 2 compare to other local LLMs?')

"\n\nWe evaluate the performance of Llama 2 on several benchmark datasets and compare it to other local LLMs. Our results show that Llama 2 achieves competitive performance compared to other state-of-the-art local LLMs, such as DeBERTa and RoBERTa. Specifically, on the GLUE benchmark, Llama 2 achieves an average score of 87.3, which is only 1.5 points behind the best-performing model, DeBERTa. On the SuperGLUE benchmark, Llama 2 achieves an average score of 79.4, which is within 2 points of the best-performing model, RoBERTa.\n\nHowever, we also find that Llama 2 has some limitations compared to other local LLMs. For example, it may not perform as well on tasks that require a high degree of contextual understanding or common sense. Additionally, Llama 2's performance may degrade more quickly over time as the training data becomes less relevant or outdated.\n\nOverall, our results suggest that Llama 2 is a strong contender in the field of local LLMs, but it may not be the best choice fo

In [81]:
rag_pipeline('how does the performance of llama 2 compare to other local LLMs?')

{'query': 'how does the performance of llama 2 compare to other local LLMs?',
 'result': ' according to the evaluation results in section 3, llama 2 outperforms other local LLMs such as mpt and falcon on all categories of benchmarks. Additionally, llama 2 70b model outperforms all open-source models.'}

# Gradio Interface

In [11]:
import gradio as gr

# Function to generate text using the model
def answer(Question):
    return rag_pipeline(Question)['result']


# Create a Gradio interface
iface = gr.Interface(
    fn=answer,
    inputs=gr.Textbox(Question="Ask your query"),
    outputs=gr.Textbox(),
    title="Know Llama-2",
    description="Ask the Llama-2-13b model anything about itself.",
)

# Launch the Gradio app
iface.launch()

  inputs=gr.Textbox(Question="Ask your query"),


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

