<a href="https://colab.research.google.com/github/John-J-Riehl/sample-rag-app/blob/lesson-3-complete/SUNLight.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Lesson 3: Vector Database


##Getting Started
Your code and data will run in the `/content` directory. Create a subdirectory there called `context_data` and upload the [context documents for the course]() into that directory. If you're new to Google Colab, download our [Getting Started with Colab]() guide.

You'll also need an API key from Hugging Face. Go to their [signup page](https://huggingface.co/join), enter your email and a password, then complete your profile. Once you have an account and are signed in, go to [Settings | Access Tokens](https://huggingface.co/settings/tokens) then select "New token". You only need a read-type token (write tokens allow you to post to Hugging Face).

Once you have your token, enter it below and run the code in the cell. Note that a command at the shell prompt (like `export`) is preceded with a bang `!`.

In [39]:
!export HF_API_TOKEN="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"

LangChain touches all aspects of this app, so let's go ahead and install it now.

In [10]:
!pip install langchain



##Loading Context Documents
The first step in building the vector database is to load the context documents. Load them into a variable named `context_data`.

In [12]:
!pip install pypdf
from langchain_community.document_loaders import PyPDFDirectoryLoader
context_folder = "./context_data"
loader = PyPDFDirectoryLoader(context_folder)
context_data = loader.load()

Collecting pypdf
  Downloading pypdf-4.1.0-py3-none-any.whl (286 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.1/286.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.1.0


Now let's verify that the documents loaded by printing the content of each page. Scroll to the end of a line to see what metadata the document loader includes.

In [28]:
for page in context_data:
  print(page)

page_content='Théâtre D\'opéra Spatial, an imagegenerated by MidjourneyGenerative artificial intelligenceGenerative artificial intelligence (generative AI, GAI, orGenAI[1]) is artificial intelligence capable of generating text,images, or other media, using generative models.[2][3][4]Generative AI models learn the patterns and structure of theirinput training data and then generate new data that has similarcharacteristics.[5][6]In the early 2020s, advances in transformer-based deepneural networks enabled a number of generative AI systemsnotable for accepting natural language prompts as input.These include large language model (LLM) chatbots such asChatGPT, Copilot, Bard, and LLaMA, and text-to-imageartificial intelligence art systems such as Stable Diffusion,Midjourney, and DALL-E.[7][8][9]Generative AI has uses across a wide range of industries, including art, writing, script writing, softwaredevelopment, product design, healthcare, finance, gaming, marketing, and fashion.[10][11][12] 

##Chunking
Now it's time to split the documents into chunks that will work with the LLM's context window. Store them in a variable named `chunks`.

In [24]:
!pip install langchain-text-splitters
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False
)
chunks = text_splitter.split_documents(context_data)



Verify it worked by exploring how the documents were chunked.

In [30]:
print(f"Total Document Chunks: {len(chunks)}\n")
print(chunks[0].metadata)
print(chunks[0].page_content)

print("Length of each chunk:")

for num, chunk in enumerate(chunks):
  print(f"Chunk {num} (from page {chunk.metadata['page'] + 1}): {len(chunk.page_content)} characters")

Total Document Chunks: 36

{'source': 'context_data/Generative_artificial_intelligence.pdf', 'page': 0}
Théâtre D'opéra Spatial, an imagegenerated by MidjourneyGenerative artificial intelligenceGenerative artificial intelligence (generative AI, GAI, orGenAI[1]) is artificial intelligence capable of generating text,images, or other media, using generative models.[2][3][4]Generative AI models learn the patterns and structure of theirinput training data and then generate new data that has similarcharacteristics.[5][6]In the early 2020s, advances in transformer-based deepneural networks enabled a number of generative AI systemsnotable for accepting natural language prompts as input.These include large language model (LLM) chatbots such asChatGPT, Copilot, Bard, and LLaMA, and text-to-imageartificial intelligence art systems such as Stable Diffusion,Midjourney, and DALL-E.[7][8][9]Generative AI has uses across a wide range of industries, including art, writing, script writing, softwaredevel

Now it's time to set up the embedding function. Assign it to a variable named `embedding_function`.

In [33]:
!pip install sentence_transformers
from langchain_community.embeddings import HuggingFaceEmbeddings
embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

Collecting sentence_transformers
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Make sure your model works by finding the embedding for a test sentence.

In [36]:
embedding = embedding_function.embed_query("This is a test sentence.")
print(f"Embedding length: {len(embedding)}")
print(f"{embedding[:3]}, ... , {embedding[-3:]}")

Embedding length: 384
[0.08429648727178574, 0.05795368552207947, 0.0044933464378118515], ... , [0.004571131430566311, 0.08188024908304214, -0.09904709458351135]


Now it's time for the vector store. Assign it the name `chromadb`.

In [37]:
!pip install chromadb

from langchain_community.vectorstores import Chroma
chromadb = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_function,
    persist_directory='./chromadb'
)
chromadb.persist()

Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.0-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.28.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2

Now test it by executing a similarity search.

In [38]:
retrieved_chunks = chromadb.similarity_search("What are different modalities supported for Generative Artificial Intelligence?")
print(f"Query retrieved {len(retrieved_chunks)} chunks.")
for chunk in retrieved_chunks:
  print(f"Chunk content: {chunk.page_content}")
  print(f"Chunk metadata: {chunk.metadata}")

Query retrieved 4 chunks.
Chunk content: 2023.[34]A generative AI system is constructed by applying unsupervised or self-supervised machine learning to adata set. The capabilities of a generative AI system depend on the modality or type of the data set used.Generative AI can be either unimodal or multimodal; unimodal systems take only one type of input,whereas multimodal systems can take more than one type of input.[35] For example, one version ofOpenAI's GPT-4 accepts both text and image inputs.[36]Modalities
Chunk metadata: {'page': 1, 'source': 'context_data/Generative_artificial_intelligence.pdf'}
Chunk content: Théâtre D'opéra Spatial, an imagegenerated by MidjourneyGenerative artificial intelligenceGenerative artificial intelligence (generative AI, GAI, orGenAI[1]) is artificial intelligence capable of generating text,images, or other media, using generative models.[2][3][4]Generative AI models learn the patterns and structure of theirinput training data and then generate new dat