### Part 1: Creating the Vector Database with ChromaDB and Hugging Face Embeddings
**Introduction:**  
In this part, we will create a vector database using Chroma DB to store embeddings generated by Hugging Face's embedding models. This vector database will serve as the foundation for the retrieval component of our RAG system.

In [1]:
# Below are the necessary libraries, uncomment the ones you need:
# !pip install langchain
# !pip install chromadb
# !pip install arxiv
# !pip install PyPDF2

In [2]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import arxiv
from PyPDF2 import PdfReader

### 1. Download an example PDF from arXiv
For this RAG example we are using the Language Models are Few-Shot Learners paper

In [3]:
client = arxiv.Client()
search = arxiv.Search(id_list=['2005.14165'])

paper = next(arxiv.Client().results(search))
print(paper.title)

Language Models are Few-Shot Learners


##### Download the PDF locally

In [4]:
path = paper.download_pdf() 

### 2. Convert the PDF to LangChain Documents
For this example we will be using the Document format.
This allows us to include the page_content and pass our metadata which is uses for citing sources

In [5]:
from langchain.docstore.document import Document

In [6]:
reader = PdfReader(path)
doc = []
for idx, page in enumerate(reader.pages):
    doc.append(Document(page_content=page.extract_text(), metadata={'source': f'{paper.title}', 'page':f'{idx+1}'}))

print(f'Number of pages {len(doc)}')

Number of pages 75


### 3. Prepare the documents by splitting the data 
Now we will split the 75 pages into chucks to be vectorized 

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=15)
texts = text_splitter.split_documents(doc)

print(f'Split into {len(texts)} chunks')

Split into 974 chunks


#### Setting Device:  
If you are using a Mac or an Nvidia GPU and installed PyTorch correctly the below will use the correct device  
Otherwise it will default to using the CPU

For details on how to install PyTorch for CUDA see the [Get Started page](https://pytorch.org/get-started/locally/)  
If you are not using CUDA with an Nvidia GPU you can uncomment the line below:

In [8]:
# Install PyTorch for Mac or Windows PC without Nvidia GPU 
# !pip install torch torchvision torchaudio 

# !pip install sentence_transformers

In [9]:
import torch 
# Detect hardware acceleration device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f'Using device: {device}')

Using device: mps


#### Load embedding model:
A good place to start when choosing and embedding model is the [MTEB English Leaderboard](https://huggingface.co/BAAI/bge-small-en)

At time of writing, the [BAAI/bge-small-en-v1.5'model](https://huggingface.co/spaces/mteb/leaderboard) is the best small model according to the leaderboard

In [10]:
from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = 'BAAI/bge-small-en-v1.5'  # Using open source embedding model

embedding_function = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': device},
    encode_kwargs={'normalize_embeddings': True} #normalizes the vectors
)
print(f'Loaded {model_name} from HuggingFace')

Loaded BAAI/bge-small-en-v1.5 from HuggingFace


### 4. Create and store the Vector DB
* This will use the bge-small-en embeddings model to embed our chunked text into vectors
* Then save those vectors into a ChromaDB named "LC_VectorDB" 

**Note**: If a DB with that name already exists, it will append, otherwise it creates it

In [11]:
persist_directory = 'LC_VectorDB' # Name of the DB

vectordb = Chroma.from_documents(
    documents=doc,
    embedding=embedding_function,
    persist_directory=persist_directory # This line saves the db to disk
    )
print("DB write complete!")

DB write complete!
