### Part 1: Creating the Vector Database with ChromaDB and Hugging Face Embeddings
**Introduction:**  
In this part, we will create a vector database using Chroma DB to store embeddings generated by Hugging Face's embedding models. This vector database will serve as the foundation for the retrieval component of our RAG system.

In [1]:
# !pip install chromadb
# !pip install arxiv
# !pip install PyPDF2
# !pip install llama-index

In [2]:
import arxiv
from PyPDF2 import PdfReader

In [3]:
CHUNK_SIZE = 800
CHUNK_OVERLAP = 15

#### 1. Download an example PDF from arXiv
For this RAG example we are using the Language Models are Few-Shot Learners paper

In [4]:
client = arxiv.Client()
search = arxiv.Search(id_list=['2005.14165'])

paper = next(arxiv.Client().results(search))
path = paper.download_pdf() 
print(paper.title)


Language Models are Few-Shot Learners


#### 2. Convert the PDF to LlamaIndex Documents
For this example we will be using the Document format.
This allows us to include the page_content and pass our metadata which is uses for citing sources

In [5]:
from llama_index import Document

In [6]:
reader = PdfReader(path)
doc = []
for idx, page in enumerate(reader.pages):
    doc.append(Document(text=page.extract_text(),
                        metadata={'source': f'{paper.title}', 'page': f'{idx+1}'},
                        excluded_llm_metadata_keys=[],
                        excluded_embed_metadata_keys=['source', 'page']
))

print(f'Number of pages {len(doc)}')

Number of pages 75


#### 3. Convert Documents into LlamaIndex Nodes
We split our documents into 'chunks' to be embedded.  
Each chunk is what LlamaIndex calls a **Node**. 

In [7]:
from llama_index.node_parser import SimpleNodeParser

parser = SimpleNodeParser.from_defaults(include_metadata = True, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

nodes = parser.get_nodes_from_documents(doc)

print(f'Parsed the {len(doc)} pages into {len(nodes)} nodes')

Parsed the 75 pages into 124 nodes


In [8]:
from llama_index.schema import MetadataMode

In [9]:
# This prints what the LLM sees
print (nodes[50].get_content (metadata_mode=MetadataMode.LLM))

source: Language Models are Few-Shot Learners
page: 31

Figure 4.1: GPT-3 Training Curves We measure model performance during training on a deduplicated validation
split of our training distribution. Though there is some gap between training and validation performance, the gap grows
only minimally with model size and training time, suggesting that most of the gap comes from a difference in difﬁculty
rather than overﬁtting.
although models did perform moderately better on data that overlapped between training and testing, this did not
signiﬁcantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).
GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of
magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential
for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT

In [10]:
# This prints what the embedding sees, you can see excluding source and page worked
print (nodes[50].get_content (metadata_mode=MetadataMode.EMBED))

Figure 4.1: GPT-3 Training Curves We measure model performance during training on a deduplicated validation
split of our training distribution. Though there is some gap between training and validation performance, the gap grows
only minimally with model size and training time, suggesting that most of the gap comes from a difference in difﬁculty
rather than overﬁtting.
although models did perform moderately better on data that overlapped between training and testing, this did not
signiﬁcantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).
GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of
magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential
for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B
does not overﬁt its training set by a signiﬁcant

In [11]:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

#### Setting Device:  
If you are using a Mac or an Nvidia GPU and installed PyTorch correctly the below will use the correct device  
Otherwise it will default to using the CPU

For details on how to install PyTorch for CUDA see the [Get Started page](https://pytorch.org/get-started/locally/)  
If you are not using CUDA with an Nvidia GPU you can uncomment the line below:

In [12]:
# Install PyTorch for Mac or Windows PC without Nvidia GPU 
# !pip install torch torchvision torchaudio 
# !pip install transformers

In [13]:
import torch 
# Detect hardware acceleration device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available(): 
    device = 'mps'
else:
    device = 'cpu'

print(f'Using device: {device}')

Using device: mps


**Load Embedding Model:**  
A good place to start when choosing and embedding model is the [MTEB English Leaderboard](https://huggingface.co/BAAI/bge-small-en)

At time of writing, the [BAAI/bge-small-en-v1.5'model](https://huggingface.co/spaces/mteb/leaderboard) is the best small model according to the leaderboard

In [14]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model_name = 'BAAI/bge-small-en-v1.5'
# Import embedding model from HuggingFace 
embed_model = HuggingFaceEmbedding(
    model_name=embed_model_name,
    device = device,
    normalize='True', 
    )

### 4. Create and store the Vector DB
* This will use the bge-small-en embeddings model to embed our chunked text into vectors
* Then save those vectors into a ChromaDB named "RAG_VectorDB" 

**Note**: If a DB with that name already exists, it will append, otherwise it creates it

In [15]:
import chromadb

db = chromadb.PersistentClient(path='./RAG_VectorDB')

db_metadata = {
    'embedding_used':embed_model_name,
    'Included Papers':paper.title}
chroma_collection = db.get_or_create_collection('arxiv_PDF_DB', metadata=db_metadata)

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

service_context = ServiceContext.from_defaults(embed_model=embed_model,
                                                llm = None, # We will set the LLM when we open the DB
                                                chunk_size=CHUNK_SIZE,
                                                chunk_overlap=CHUNK_OVERLAP
                                                )

vector_store_index = VectorStoreIndex(nodes=nodes,
                                    storage_context=storage_context, 
                                    service_context=service_context,
                                    show_progress=True)

print('Completed')

LLM is explicitly disabled. Using MockLLM.


Generating embeddings:   0%|          | 0/124 [00:00<?, ?it/s]

Completed
