
# RAG with LLaMa 13B

In this notebook we'll explore how we can use the open source **Llama-13b-chat** model in both Hugging Face transformers and LangChain.
At the time of writing, you must first request access to Llama 2 models via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) (access is typically granted within a few hours). If you need guidance on getting access please refer to the beginning of this [article](https://www.pinecone.io/learn/llama-2/) or [video](https://youtu.be/6iHVJyX2e50?t=175).

---

🚨 _Note that running this on CPU is sloooow. If running on Google Colab you can avoid this by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab._

---

We start by doing a `pip install` of all required libraries.

In [3]:
!pip list

Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
aiohttp                          3.9.1
aiosignal                        1.3.1
alabaster                        0.7.13
albumentations                   1.3.1
altair                           4.2.2
anyio                            3.7.1
appdirs                          1.4.4
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array-record                     0.5.0
arviz                            0.15.1
astropy                          5.3.4
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.0
attrs                            23.1.0
audioread                        3.0.1
autograd                         1.6.2
Babel                            2.14.0
backcall                         0.2.0
beautifulsoup4                   4.11.2
bidict                           0.22.1
b

In [2]:
!pip show pydantic

Name: pydantic
Version: 1.10.13
Summary: Data validation and settings management using python type hints
Home-page: https://github.com/pydantic/pydantic
Author: Samuel Colvin
Author-email: s@muelcolvin.com
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: typing-extensions
Required-by: confection, inflect, lida, llmx, spacy, thinc


In [None]:
import sys
print(sys.version)


3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]


In [None]:
!pip show python3

[0m

In [None]:
# Nos aseguramos de tener una GPU en la instancia gratuita de colab (T4)
!nvidia-smi

Tue Dec 19 08:24:19 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
from google.colab import userdata
userdata.get('secretName')

SecretNotFoundError: ignored

In [None]:
import langchain.vectorstores

In [None]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m10.4 M

## Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding.

In [None]:
#! Creamos el pipeline de embedings a guardar en nuestra base de datos vectorial
#! (pinecone)

from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

We can use the embedding model to create document embeddings like so:

In [None]:
#! Vemos el vector de embedings de N dimensionalidad (hiper-parámetro de
#! reproyección)

docs = [
    "this is one document",
    "and another document"
]

# IMPORTANTE LA DIMENSIÓN DE LA BASE DE DATOS VECTORIAL DEBE SER LA MISMA
# QUE LOS EMBEDINGS
# se puede especificar la dimensión (a mayor dimensión más costo y precisión)
#TODO: optimizar este hp
embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


## Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

In [None]:
import os
import pinecone
from google.colab import userdata

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=userdata.get('pinecone-api-key') or '*******************************',
    environment=userdata.get('pinecone-enviroment') or '**********'
)

In [None]:
pinecone.list_indexes()

['santander-public-web']

Now we initialize the index.

In [None]:
import time

index_name = 'santander-public-web'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

Nos conectamos al index, el cual está vacío

In [None]:
index = pinecone.Index(index_name)
index.delete(delete_all=True)

{}

In [None]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

With our index and embedding process ready we can move onto the indexing process itself.

En este caso voy simplemente a meter htmls de la web,
lo ideal sería en este proceso también buscar pdfs y transformarlo a texto plano

# htmls saved in Pinecone

In [None]:
%cd /

In [None]:
# Partimos de una página web raíz (HTML)
# cuidado con sobrecargar el servidor (delito)

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def is_valid_url(url, domain):
    """ Verifica si la URL es válida y pertenece al mismo dominio """
    parsed_url = urlparse(url)
    return bool(parsed_url.scheme) and bool(parsed_url.netloc) and parsed_url.netloc == domain

def scrape_recursive(url, domain, max_depth, current_depth=0, scraped_pages=None):
    if scraped_pages is None:
        scraped_pages = []

    # Limitar la profundidad de la recursión
    if current_depth > max_depth:
        return

    try:
        response = requests.get(url)
        if response.status_code != 200:
            return

        # Agregar HTML raw de la página actual a la lista
        scraped_pages.append(response.text)

        soup = BeautifulSoup(response.content, 'html.parser')

        # Encontrar todos los enlaces en la página
        for link in soup.find_all('a', href=True):
            full_url = urljoin(url, link['href'])
            if is_valid_url(full_url, domain) and full_url not in scraped_pages:
                scrape_recursive(full_url, domain, max_depth, current_depth + 1, scraped_pages)
    except requests.RequestException:
        pass

    return scraped_pages

# URL inicial
start_url = 'https://www.bancosantander.es/particulares/hipotecas'
domain = urlparse(start_url).netloc
max_depth = 1  # Profundidad máxima de la búsqueda (árbol)

html_pages = scrape_recursive(start_url, domain, max_depth)

# Imprimir resultados
for html in html_pages:
    print(html[:500], "\n---\n")  # Imprimir los primeros 500 caracteres de cada página para demostración


<!DOCTYPE html>
<html lang="es" itemscope itemtype="http://schema.org/WebPage">
<head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
    <script type="text/javascript" src="/particulares/ruxitagentjs_ICA27NVdfgjqrux_10263230921131557.js" data-dtconfig="app=815f357dd7c341d1|cuc=awlzi8m1|mel=100000|featureHash=ICA27NVdfgjqrux|dpvc=1|ssv=4|lastMo 
---

<!DOCTYPE html>
<html lang="es" itemscope itemtype="http://schema.org/WebPage">
<head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
    <script type="text/javascript" src="/mypoc/ruxitagentjs_ICA27NVdfgjqrux_10263230921131557.js" data-dtconfig="app=815f357dd7c341d1|cuc=awlzi8m1|mel=100000|featureHash=ICA27NVdfgjqrux|dpvc=1|ssv=4|lastMo

Deberíamos guardar en fichero todos los htmls generados, para cargarlos directamente y no ejecutar este proceso, que es muy costoso y estático, nos dividimos la ejecución y paralalelizamos recursos en colab, y podemos mandarnos estos datos al banco por correo

In [None]:
#TODO: guardar la lista de htmls
html_pages.__len__() # 200 webs con profundidad 1 XD

195

In [None]:
#TODO:

# Buscar pdfs de acceso público sobre hipotecas del banco santander
## Seria top que busques mediante regex ficheros source que acaben en .pdf dentro de los
## htmls, y de manera directa con request te los descarges y uses pdfreader para pasarlo a texto
## string
# usar pdfreader para extraer el texto bruto
# generar el hash criptográfico en base al texto bruto
# indexar el texto embedido en la base de datos de pinecone
# indexar en un diccionario de python (simulando la base de datos blob)
# Probar a hacer queries, sobre hipotecas para ver si el html que devuelve
# es correcto (usar regex para buscar la URI)
## PARTIR EN CHUNCKS LOS HTMLS PARA QUE QUEPAN EN PINECONE
## HACER EL SPLITEO DE MANERA QUE NUNCA DEJE UNA LABEL A LA MITAD DE HTML
### Y QUE NO SE PASE DEL TOPE DE CARACTERES 32K

IMPORTANTE, EN EL METADATADO SOLAMENTE GUARDAMOS EL ID (HASH AL HTML), DADO QUE PINECONE TIENE UN MÁXIMO DE CARÁCTERES A INSERTAR Y GUARDAR, SOLAMENTE NOS INTERESA GUARDAR EL EMBEDING Y DESPUES ACCEDER A UNA BASE DE DATOS (BLOLB EN AZURE) INDEXADA. aquí usaré un diccionario tonto

In [None]:
from hashlib import md5

# Asumiendo que html_pages es la lista de páginas HTML scrapeadas
# y que index es tu instancia de Pinecone configurada

batch_size = 32

# Función para generar un hash MD5 como ID único
def generate_id(html_content):
    return md5(html_content.encode('utf-8')).hexdigest()

for i in range(0, len(html_pages), batch_size):
    i_end = min(len(html_pages), i + batch_size)
    batch = html_pages[i:i_end]

    # Generar IDs y preparar textos y metadatos
    ids = [generate_id(page) for page in batch]
    texts = batch  # En este caso, los textos son el HTML
    embeds = embed_model.embed_documents(texts)  # Vectorización

    # Los metadatos podrían incluir solo el ID en este caso
    metadata = [{'id': id} for id in ids]

    # Almacenar en Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))


In [None]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00147,
 'namespaces': {'': {'vector_count': 147}},
 'total_vector_count': 147}

In [None]:
# Indexamos el diccionario (base de datos con todos los htmls blob), separamos
# embedings y de los propios htmls al ser muy pesado
from hashlib import md5

# Función para generar un hash MD5 como ID único
def generate_id(html_content):
    return md5(html_content.encode('utf-8')).hexdigest()

# Crear un diccionario donde la clave es el ID y el valor es el HTML
html_dict = {generate_id(page): page for page in html_pages}

# Ahora html_dict es un diccionario con el ID como clave y el HTML como valor


In [None]:
from langchain.vectorstores import Pinecone

id_field = 'id'  # Campo que contiene el id del html con la mayor similitud

vectorstore = Pinecone(
    index, embed_model.embed_query, id_field
)

Comprobamos que funcione y busque el html correcto

In [None]:
query = 'hipoteca variable'

documents = vectorstore.similarity_search(
    query,  # the search query
    k=3  # returns top 3 most relevant chunks of text
)

documents

[Document(page_content='b4534b49f1ed2193bab4120ebf14304e', metadata={}),
 Document(page_content='a8dae369936442111f9a687ce0085607', metadata={}),
 Document(page_content='406e57c8c72c79efa6d60ce418cb0f46', metadata={})]

In [None]:
html_dict[documents[0].page_content]

'<!DOCTYPE html>\r\n<html lang="es" itemscope itemtype="http://schema.org/WebPage">\r\n<head>\r\n    <meta charset="UTF-8" />\r\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">\r\n    <script type="text/javascript" src="/renting/ruxitagentjs_ICA27NVdfgjqrux_10263230921131557.js" data-dtconfig="app=815f357dd7c341d1|cuc=awlzi8m1|mel=100000|featureHash=ICA27NVdfgjqrux|dpvc=1|ssv=4|lastModification=1702915806524|vcv=2|tp=500,50,0,1|rdnt=1|uxrgce=1|bp=3|agentUri=/renting/ruxitagentjs_ICA27NVdfgjqrux_10263230921131557.js|reportUrl=/renting/rb_a9b631ff-5285-454a-9b3e-4355527a91fd|rid=RID_1737146567|rpid=1397429758|domain=bancosantander.es"></script><link rel="stylesheet" href="/contenthandler/!ut/p/digest!cjbMzMF7cWkBQXTBVhUKKA/sp/mashup:ra:collection?soffset=0&amp;eoffset=16&amp;themeID=ZJ_K85IGKC0N0DE30QE5ULS0M29H2&amp;locale=es&amp;locale=en&amp;mime-type=text%2Fcss&amp;

In [None]:
'hipoteca' in html_dict[documents[0].page_content]

True

In [None]:
'variable' in html_dict[documents[0].page_content]

True

In [None]:
# Guardar diccionario de ficheros
import json

out_file = open("myfile_depth_1_hipotecas.json", "w")

json.dump(html_dict, out_file, indent=4)

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-13b-chat-hf`.

* The respective tokenizer for the model.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '*******************************'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

PackageNotFoundError: ignored

The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 13B models were trained using the Llama 2 13B tokenizer, which we initialize like so:

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [None]:
res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])

Now to implement this in LangChain

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [None]:
llm(prompt="Explain to me the difference between nuclear fission and fusion.")

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 13B Chat** to the LangChain library. Using this we can now begin using LangChain's advanced agent tooling, chains, etc, with **Llama 2**.

## Initializing a RetrievalQA Chain

For **R**etrieval **A**ugmented **G**eneration (RAG) in LangChain we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object. For both of these we need an `llm` (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Let's begin by initializing the LangChain vector store, we do it like so:

In [None]:
from langchain.vectorstores import Pinecone

text_field = 'text'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

We can confirm this works like so:

In [None]:
query = 'what makes llama 2 special?'

vectorstore.similarity_search(
    query,  # the search query
    k=3  # returns top 3 most relevant chunks of text
)

[Document(page_content='51f6c9b0a453f9672c0ce5f067d8084a', metadata={}),
 Document(page_content='936164847ab7c1987f60f47e93bca075', metadata={}),
 Document(page_content='48373ccc673c9bf86466db063d274afe', metadata={})]

Looks good! Now we can put our `vectorstore` and `llm` together to create our RAG pipeline.

In [None]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)

Let's begin asking questions! First let's try *without* RAG:

In [None]:
llm('what is so special about llama 2?')

Hmm, that's not what we meant... What if we use our RAG pipeline?

In [None]:
rag_pipeline('what is so special about llama 2?')

{'query': 'what is so special about llama 2?',
 'result': ' Llama 2 is a collection of large language models (LLMs) that have been developed and released by Meta AI. These models range in scale from 7 billion to 70 billion parameters and are optimized for dialogue use cases. The authors claim that their ﬁne-tuned LLMs, called L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc, outperform open-source chat models on most benchmarks they tested and provide a detailed description of their approach to ﬁne-tuning and safety. They also mention that their models enable interaction with humans through intuitive chat interfaces, which has led to rapid and widespread adoption among the general public.'}

This looks *much* better! Let's try some more.

In [None]:
llm('what safety measures were used in the development of llama 2?')

"\n nobody knows.\n\nBut I can tell you that the llama 2 was developed by a team of experienced software developers who have a proven track record of creating high-quality, secure software. They used a variety of techniques and tools to ensure that the llama 2 was as safe and secure as possible, including:\n\n* Code reviews: The development team thoroughly reviewed each other's code to identify any potential security vulnerabilities.\n* Testing: The team conducted extensive testing to ensure that the llama 2 functioned correctly and did not contain any security flaws.\n* Security audits: Independent security experts conducted regular security audits to identify any potential weaknesses in the llama 2.\n* Penetration testing: The team simulated attacks on the llama 2 to identify any potential vulnerabilities and fix them before they could be exploited by attackers.\n\nOverall, the development team took a comprehensive approach to ensuring the security and safety of the llama 2, using a 

Okay, it looks like the LLM with no RAG is less than ideal — let's stop embarassing the poor LLM and stick with RAG + LLM. Let's ask the same question to our RAG pipeline.

In [None]:
rag_pipeline('what safety measures were used in the development of llama 2?')

{'query': 'what safety measures were used in the development of llama 2?',
 'result': ' The safety measures used in the development of Llama 2 include:\n\n* Ethical considerations and limitations: We considered the ethical implications of developing a language model and took steps to mitigate any potential risks.\n* Responsible release strategy: We developed a responsible release strategy that includes releasing the model under a license and providing an acceptable use policy for users.\n* Safety tuning: We performed safety tuning to ensure that the model does not produce inaccurate or objectionable responses to user prompts.\n* Design input: We received design input from early reviewers of the paper to improve the quality of the figures in the paper.\n* Red teaming: We delayed the release of the 34B model due to a lack of time to sufficiently red team the model.\n* Publicly available resources: We used publicly available online sources for pretraining the model.\n* Safety testing and 

A reasonable answer from the RAG pipeline, but it doesn't contain much information — maybe we can ask more about this, like what is this _"red team"_ procedure that delayed the launch of the 34B model?

In [None]:
rag_pipeline('what red teaming procedures were followed for llama 2?')

{'query': 'what red teaming procedures were followed for llama 2?',
 'result': ' According to the paper, the authors followed a responsible release strategy and delayed the release of the 34B model due to a lack of time to sufficiently red team. They also mention that they performed multiple rounds of red teaming over several months to measure the robustness of each new model as it was released internally. Additionally, they devised a metric called "robustness" to quantify the model\'s ability to resist violating responses triggered by red teaming exercises executed by a set of experts.'}

Very interesting!

In [None]:
rag_pipeline('how does the performance of llama 2 compare to other local LLMs?')

{'query': 'how does the performance of llama 2 compare to other local LLMs?',
 'result': " The paper provides a comparison of the performance of Llama 2 with other local LLMs in terms of token sampling latency and human evaluation scores. According to the paper, Llama 2 achieves lower token sampling latency than other local LLMs on 16 TPU v4s, while providing similar or better human evaluation scores. Specifically, Llama 2 achieves a mean token sampling latency of 14.1ms on 16 TPU v4s, which is faster than the next best local LLM by 19%. Additionally, Llama 2 performs similarly or better than other local LLMs on human evaluation tasks, such as ROUGE-2 and human evaluation (100 shot).\n\nUnhelpful Answer: I don't know the answer to your question because I don't have access to the specific information you are looking for."}