# LLM + RAG

Now we're ready to create an LLM + RAG Pipeline! A large portion of this code was adapted from pixegami, specifically the following two videos:

https://www.youtube.com/watch?v=tcqEUSNCn8I

https://www.youtube.com/watch?v=2TJxpyO3ei4

## Importing Packages
We will first start by importing all of the relevant packages!

In [161]:
import argparse
import os
import shutil
import chromadb

from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb.db.base import UniqueConstraintError
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain.vectorstores.chroma import Chroma
from langchain.evaluation import load_evaluator
from typing import List, Dict, Any

# from langchain_community.embeddings.ollama import OllamaEmbeddings
from langchain.embeddings import HuggingFaceBgeEmbeddings

chroma_path="chroma"

## Loading our Cleaned Data

We'll load our cleaned data in the data folder. In our case, we'll be loading in slang data from urban dictionary as a CSV. We encourage you to check out the data to get a sense of how it's laid out.

In [2]:
loader = CSVLoader(file_path='./data/cleaned_slang_data.csv')
slang_document = loader.load()

Now that we've loaded in the data, we can take a quick peek at it to see what we're working with!

In [3]:
print(slang_document[0])
print(type(slang_document))
print(type(slang_document[0]))
len(slang_document)


page_content='word: Janky
definition: Undesirable; less-than optimum.' metadata={'source': './data/cleaned_slang_data.csv', 'row': 0}
<class 'list'>
<class 'langchain_core.documents.base.Document'>


2580653

We can see that we have over 2.5 Million Slang items! So we're working with quite a large amount of data!

## Chunking our data

Now let's go ahead and chunk our data. Remember that this is cutting up the data into manageable chunks so we can fit it into our vector database (cheatsheet for the LLM)!

In [4]:
# Only necessary if we have too much data to add to the context.
def split_documents(documents: list[Document]):
  text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=80,
    length_function=len,
    is_separator_regex=False,
  )
  return text_splitter.split_documents(documents)

chunks = split_documents(slang_document)

Let's see what the chunks look like. They should be pretty similar but it does definitely help for some words that have very long definitions!

In [6]:
print(chunks[0])
print(len(chunks))

page_content='word: Janky
definition: Undesirable; less-than optimum.' metadata={'source': './data/cleaned_slang_data.csv', 'row': 0}
2706525


## Creating Our Embedding Function

Let's start by creating our embedding function. In our case, since we're using llama3.1, we can use an existing embedding function suited for this from Ollama itself.

In [172]:
def get_embedding_function():
#   embeddings = OllamaEmbeddings(model="llama3.1")
  model_name = "BAAI/bge-small-en"
  model_kwargs = {"device": "cpu"}
  encode_kwargs = {"normalize_embeddings": True}
  embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
  )
  return embeddings

Let's see what an embedding looks like for reference with our sample chunk!

In [173]:
embedding_function = get_embedding_function()

chunk = chunks[0].page_content

vector = embedding_function.embed_query(chunk)

print(f'Vector for chunk "{chunk}" is: {vector}')

Vector for chunk "word: Janky
definition: Undesirable; less-than optimum." is: [0.02469354122877121, -0.04069596901535988, -0.010370255447924137, 0.03439134731888771, 0.010892847552895546, -0.01080682035535574, 0.04962347447872162, 0.023051155731081963, -0.008888516575098038, -0.012985138222575188, 0.0037654428742825985, -0.0407717302441597, 0.03734917566180229, 0.04705151543021202, -0.002829703502357006, 0.0016476493328809738, 0.009689309634268284, 0.03862926736474037, -0.03908330202102661, 0.023559609428048134, 0.03988025337457657, -0.049310773611068726, 0.0049852472729980946, -0.07738950848579407, 0.05985085666179657, 4.8618494474794716e-05, 0.0031844598706811666, -0.003866560524329543, -0.01783439889550209, -0.16132646799087524, -0.016389980912208557, -0.03130568191409111, 0.0541437529027462, 0.0018702696543186903, -0.01178800594061613, 0.005979408044368029, 0.017880897969007492, -0.01939062587916851, -0.0687340795993805, 0.008134184405207634, -0.008391705341637135, 0.0516829304397

But what does this embedding actually mean? How do we interpret this vector?

Remember that this number put in an arbitrary space. It's most useful to think of it as a concept, and if we compare it to other concepts, then the vector difference between the two concepts shows us how similar the two objects are.

note: the closer to 0 the evaluation is, the closer the two concepts are!

To solidify this concept, let's start by comparing our chunk 0 to various other concepts. Feel free to experiment around as well!

In [174]:
evaluator = load_evaluator("pairwise_string_distance")

print(evaluator.evaluate_string_pairs(prediction="Janky", prediction_b=chunk)) # This should be somewhat close to 0.0

print(evaluator.evaluate_string_pairs(prediction=chunk, prediction_b=chunk)) # This should be 0.0 or very close to it

print(evaluator.evaluate_string_pairs(prediction="pristine", prediction_b=chunk)) # This should be further from 0.0

print(evaluator.evaluate_string_pairs(prediction="brother", prediction_b=chunk)) # This should be even further from 0.0

{'score': 0.3030303030303031}
{'score': 0.0}
{'score': 0.42781385281385287}
{'score': 0.5316017316017316}


## Creating the Vector Database

Now we want to start creating our vector database. This is our LLM's cheatsheet of information that it will use in the future to respond to user queries.

We will do this by using [Chromadb](https://www.trychroma.com/), which is a vector database!

Let's first set up some variables and clear out any existing items in it (you only need to do this if you're doing a fresh run).

In [175]:
# Clear the database for our initial run in case it exists.
if os.path.exists(chroma_path):
  shutil.rmtree(chroma_path)

Next, let's start up our chroma db! For this you need to run this command in the terminal!

`chroma run --host localhost --port 8000 --path ./chroma`

In [181]:
# Initialize our chromadb client locally with special port number so we don't conflict with other things running
client = chromadb.HttpClient(host='localhost', port=8000)
collection_name = "llm_rag_collection"

collection = client.get_or_create_collection(name=collection_name, embedding_function=get_embedding_function())

TypeError: Client.get_or_create_collection() got an unexpected keyword argument 'database'

Next Let's actually add our chunks to chroma! We'll start by calculating chunk ids so we can update our data at any time.

In [126]:
def calculate_chunk_ids(chunks):
    # This will calculate 
    last_page_id = None
    current_chunk_index = 0

    for chunk in chunks:
        source = chunk.metadata.get("source")
        row = chunk.metadata.get("row")
        current_page_id = f"{source}:{row}"

        # If the page ID is the same as the last one, increment the index.
        if current_page_id == last_page_id:
            current_chunk_index += 1
        else:
            current_chunk_index = 0

        # Calculate the chunk ID.
        chunk_id = f"{current_page_id}:{current_chunk_index}"
        last_page_id = current_page_id

        # Add it to the page meta-data.
        chunk.metadata["id"] = chunk_id

    return chunks

def add_to_chroma(chunks: list[Document]):
    # Calculate Page IDs
    chunks_with_ids = calculate_chunk_ids(chunks)

    # Retrieve existing IDs from the collection
    existing_items = collection.get(include=["ids"])
    existing_ids = set(existing_items["ids"])
    print(f"Number of existing documents in collection: {len(existing_ids)}")

    # Prepare data for new documents
    new_chunk_ids = []
    new_documents = []
    new_metadatas = []

    for chunk in chunks_with_ids:
        chunk_id = chunk.metadata["id"]
        if chunk_id not in existing_ids:
            new_chunk_ids.append(chunk_id)
            new_documents.append(chunk.page_content)
            new_metadatas.append(chunk.metadata)

    if new_chunk_ids:
        print(f"👉 Adding new documents: {len(new_chunk_ids)}")
        # Add new documents to the collection
        collection.add(
            ids=new_chunk_ids,
            documents=new_documents,
            metadatas=new_metadatas
            # embeddings will be computed using the embedding function provided when creating the collection
        )
    else:
        print("✅ No new documents to add")

add_to_chroma(chunks)

NameError: name 'collection' is not defined