### Part 1: Creating the Vector Database with ChromaDB and Hugging Face Embeddings
**Introduction:**  
In this part, we will create a vector database using Chroma DB to store embeddings generated by Hugging Face's embedding models. This vector database will serve as the foundation for the retrieval component of our RAG system.

In [1]:
# All packages are in requirements.txt

#!pip install -r requirements.txt

In [2]:
import arxiv
from PyPDF2 import PdfReader

In [3]:
CHUNK_SIZE = 350
CHUNK_OVERLAP = 15

#### 1. Download an example PDF from arXiv
For this RAG example we are using the Language Models are Few-Shot Learners paper

In [4]:
client = arxiv.Client()
search = arxiv.Search(id_list=['2005.14165'])

paper = next(arxiv.Client().results(search))
path = paper.download_pdf() 
print(paper.title)

print(paper.entry_id)


Language Models are Few-Shot Learners
http://arxiv.org/abs/2005.14165v4


#### 2. Convert the PDF to LlamaIndex Documents
For this example we will be using the Document format.
This allows us to include the page_content and pass our metadata which is uses for citing sources

In [5]:
from llama_index import Document

In [6]:
reader = PdfReader(path)
doc = []
for idx, page in enumerate(reader.pages):
    doc.append(Document(text=page.extract_text(),
                        metadata={'source': f'{paper.title}', 'page': f'{idx+1}', 'link':f'{paper.entry_id}'},
                        excluded_llm_metadata_keys=['link'],
                        excluded_embed_metadata_keys=['source', 'page', 'link']
))

print(f'Number of pages {len(doc)}')

Number of pages 75


#### 3. Convert Documents into LlamaIndex Nodes
We split our documents into 'chunks' to be embedded.  
Each chunk is what LlamaIndex calls a **Node**. 

In [7]:
from llama_index.node_parser import SimpleNodeParser

parser = SimpleNodeParser.from_defaults(include_metadata = True, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

nodes = parser.get_nodes_from_documents(doc)

print(f'Parsed the {len(doc)} pages into {len(nodes)} nodes')

Parsed the 75 pages into 249 nodes


In [8]:
from llama_index.schema import MetadataMode

In [9]:
# This prints what the LLM sees
print (nodes[50].get_content (metadata_mode=MetadataMode.LLM))

source: Language Models are Few-Shot Learners
page: 16

Setting Winograd Winogrande (XL)
Fine-tuned SOTA 90.1a84.6b
GPT-3 Zero-Shot 88.3* 70.2
GPT-3 One-Shot 89.7* 73.2
GPT-3 Few-Shot 88.6* 77.7
Table 3.5: Results on the WSC273 version of Winograd schemas and the adversarial Winogrande dataset. See Section
4 for details on potential contamination of the Winograd test set.a[SBBC19]b[LYN+20]
Figure 3.5: Zero-, one-, and few-shot performance on the adversarial Winogrande dataset as model capacity scales.
Scaling is relatively smooth with the gains to few-shot learning increasing with model size, and few-shot GPT-3 175B
is competitive with a ﬁne-tuned RoBERTA-large.
each translation task improves performance by over 7 BLEU and nears competitive performance with prior work.
GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior
unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. 

In [10]:
# This prints what the embedding sees, you can see excluding source and page worked
print (nodes[50].get_content (metadata_mode=MetadataMode.EMBED))

Setting Winograd Winogrande (XL)
Fine-tuned SOTA 90.1a84.6b
GPT-3 Zero-Shot 88.3* 70.2
GPT-3 One-Shot 89.7* 73.2
GPT-3 Few-Shot 88.6* 77.7
Table 3.5: Results on the WSC273 version of Winograd schemas and the adversarial Winogrande dataset. See Section
4 for details on potential contamination of the Winograd test set.a[SBBC19]b[LYN+20]
Figure 3.5: Zero-, one-, and few-shot performance on the adversarial Winogrande dataset as model capacity scales.
Scaling is relatively smooth with the gains to few-shot learning increasing with model size, and few-shot GPT-3 175B
is competitive with a ﬁne-tuned RoBERTA-large.
each translation task improves performance by over 7 BLEU and nears competitive performance with prior work.
GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior
unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the
three input languages studied, GPT-3 signiﬁcantl

In [11]:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

#### Setting Device:  
If you are using a Mac or an Nvidia GPU and installed PyTorch correctly the below will use the correct device  
Otherwise it will default to using the CPU

For details on how to install PyTorch for CUDA see the [Get Started page](https://pytorch.org/get-started/locally/)  
If you are not using CUDA with an Nvidia GPU you can uncomment the line below:

In [12]:
# Install PyTorch for Mac or Windows PC without Nvidia GPU 
# !pip install torch torchvision torchaudio 
# !pip install transformers

In [13]:
import torch 
# Detect hardware acceleration device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available(): 
    device = 'mps'
else:
    device = 'cpu'

print(f'Using device: {device}')

Using device: mps


**Load Embedding Model:**  
A good place to start when choosing and embedding model is the [MTEB English Leaderboard](https://huggingface.co/BAAI/bge-small-en)

At time of writing, the [BAAI/bge-small-en-v1.5'model](https://huggingface.co/spaces/mteb/leaderboard) is the best small model according to the leaderboard

In [14]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model_name = 'BAAI/bge-small-en-v1.5'
# Import embedding model from HuggingFace 
embed_model = HuggingFaceEmbedding(
    model_name=embed_model_name,
    device = device,
    normalize='True', 
    )

### 4. Create and store the Vector DB
* This will use the bge-small-en embeddings model to embed our chunked text into vectors
* Then save those vectors into a ChromaDB named "RAG_VectorDB" 

**Note**: If a DB with that name already exists, it will append, otherwise it creates it

In [15]:
import chromadb

db = chromadb.PersistentClient(path='./RAG_VectorDB')

collection_metadata = {
    'embedding_used':embed_model_name,
    'Included Papers':paper.title}
chroma_collection = db.get_or_create_collection('arxiv_PDF_DB', metadata=collection_metadata)

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

service_context = ServiceContext.from_defaults(embed_model=embed_model,
                                                llm = None, # We will set the LLM when we open the DB
                                                chunk_size=CHUNK_SIZE,
                                                chunk_overlap=CHUNK_OVERLAP
                                                )

vector_store_index = VectorStoreIndex(nodes=nodes,
                                    storage_context=storage_context, 
                                    service_context=service_context,
                                    show_progress=True)

print('Completed')

LLM is explicitly disabled. Using MockLLM.


Generating embeddings:   0%|          | 0/249 [00:00<?, ?it/s]

Completed
