# LLM + RAG

Now we're ready to create an LLM + RAG Pipeline! A large portion of this code was adapted from pixegami, specifically the following two videos:

https://www.youtube.com/watch?v=tcqEUSNCn8I

https://www.youtube.com/watch?v=2TJxpyO3ei4

## Importing Packages
We will first start by importing all of the relevant packages!

In [1]:
import argparse
import os
import shutil
import chromadb

from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain.evaluation import load_evaluator
from typing import List, Dict, Any

from langchain_chroma import Chroma
import chromadb.utils.embedding_functions as embedding_functions

chroma_path="chroma"

## Loading our Cleaned Data

We'll load our cleaned data in the data folder. In our case, we'll be loading in slang data from urban dictionary as a CSV. We encourage you to check out the data to get a sense of how it's laid out.

In [2]:
loader = CSVLoader(file_path='./data/cleaned_slang_data.csv')
slang_document = loader.load()

Now that we've loaded in the data, we can take a quick peek at it to see what we're working with!

In [3]:
print(slang_document[0])
print(type(slang_document))
print(type(slang_document[0]))
len(slang_document)


page_content='word: Janky
definition: Undesirable; less-than optimum.' metadata={'source': './data/cleaned_slang_data.csv', 'row': 0}
<class 'list'>
<class 'langchain_core.documents.base.Document'>


2580653

We can see that we have over 2.5 Million Slang items! So we're working with quite a large amount of data!

## Chunking our data

Now let's go ahead and chunk our data. Remember that this is cutting up the data into manageable chunks so we can fit it into our vector database (cheatsheet for the LLM)!

Since we're processing many elements, this may take a minute!

In [4]:
# Only necessary if we have too much data to add to the context.
def split_documents(documents: list[Document]):
  text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=80,
    length_function=len,
    is_separator_regex=False,
  )
  return text_splitter.split_documents(documents)

chunks = split_documents(slang_document)

Let's see what the chunks look like. They should be pretty similar but it does definitely help for some words that have very long definitions!

In [5]:
print(chunks[0])
print(len(chunks))

page_content='word: Janky
definition: Undesirable; less-than optimum.' metadata={'source': './data/cleaned_slang_data.csv', 'row': 0}
2706525


## Creating Our Embedding Function

Let's start by creating our embedding function. In our case, we want to use a specialized embedding model so it's fast and efficient to get embeddings. Since this embedding model is different from our current model, we need to pull it using `ollama pull nomic-embed-text`.

In [6]:
def get_embedding_function():
  embeddings = embedding_functions.OllamaEmbeddingFunction(
      url="http://localhost:11434/api/embeddings",
      model_name="nomic-embed-text",
  )
  
  return embeddings


Let's see what an embedding looks like for reference with our sample chunk!

In [7]:
embedding_function = get_embedding_function()

chunk = chunks[0].page_content

print(embedding_function([chunk]))

# print(f'Vector for chunk "{chunk}" is: {vector}')

[array([ 6.40811622e-01,  7.72351563e-01, -3.19940400e+00, -8.55043650e-01,
        3.97146285e-01,  1.19626164e+00, -6.61623418e-01, -2.91547894e-01,
       -6.05375804e-02,  1.68385841e-02, -5.78635812e-01,  1.05480552e+00,
        1.58162522e+00,  1.44126129e+00, -1.82115376e-01, -1.23256886e+00,
        1.08977544e+00, -6.58981979e-01,  1.10475874e+00,  4.84398939e-02,
        7.75346994e-01, -5.62737167e-01,  8.28783929e-01, -3.86232316e-01,
        1.06826174e+00, -1.19985424e-01,  6.36491120e-01, -1.86059564e-01,
        6.89530134e-01,  1.11838269e+00, -3.04865837e-01, -3.87093574e-02,
       -3.71003091e-01, -9.85167250e-02, -1.99460161e+00, -5.66446304e-01,
        7.54378855e-01,  2.36689553e-01,  2.77079225e-01, -9.25087571e-01,
       -4.70041811e-01, -2.03075811e-01,  1.39579701e+00, -1.37997413e+00,
       -2.65024185e-01,  1.86201707e-02,  3.67940933e-01,  8.55515778e-01,
        8.92761469e-01, -6.19878054e-01,  8.07923734e-01, -1.41326535e+00,
        8.92637908e-01, 

But what does this embedding actually mean? How do we interpret this vector?

Remember that this number put in an arbitrary space. It's most useful to think of it as a concept, and if we compare it to other concepts, then the vector difference between the two concepts shows us how similar the two objects are.

note: the closer to 0 the evaluation is, the closer the two concepts are!

To solidify this concept, let's start by comparing our chunk 0 to various other concepts. Feel free to experiment around as well!

In [8]:
evaluator = load_evaluator("pairwise_string_distance")

print(evaluator.evaluate_string_pairs(prediction="Janky", prediction_b=chunk)) # This should be somewhat close to 0.0

print(evaluator.evaluate_string_pairs(prediction=chunk, prediction_b=chunk)) # This should be 0.0 or very close to it

print(evaluator.evaluate_string_pairs(prediction="pristine", prediction_b=chunk)) # This should be further from 0.0

print(evaluator.evaluate_string_pairs(prediction="brother", prediction_b=chunk)) # This should be even further from 0.0

{'score': 0.3030303030303031}
{'score': 0.0}
{'score': 0.42781385281385287}
{'score': 0.5316017316017316}


## Creating the Vector Database

Now we want to start creating our vector database. This is our LLM's cheatsheet of information that it will use in the future to respond to user queries.

We will do this by using [Chromadb](https://www.trychroma.com/), which is a vector database!

Let's first set up some variables and clear out any existing items in it (you only need to do this if you're doing a fresh run).

In [14]:
# Clear the database for our initial run in case it exists.
# if os.path.exists(chroma_path):
#   shutil.rmtree(chroma_path)

Next, let's start up our chroma db! For this you need to run this command in the terminal!

`chroma run --host localhost --port 8000 --path ./chroma`

In [15]:
# Initialize our chromadb client locally with special port number so we don't conflict with other things running
client = chromadb.HttpClient(host='localhost', port=8000)
collection_name = "llm_rag_collection"

collection = client.get_or_create_collection(name=collection_name, embedding_function=get_embedding_function())

Next Let's actually add our chunks to chroma! We'll start by calculating chunk ids so we can update our data at any time. It takes ~8 seconds for 500 records to be embedded and placed into our database. Since our dataset contains over 2.7 million chunks, this would take ~12 hours! Instead, we'll only add 500 documents, but you can feel free to adjust this number as you deem fit!

In [16]:
def calculate_chunk_ids(chunks):
    chunks_with_id = []
    for chunk in chunks:
        # Calculate the chunk ID.
        chunk_id = str(hash(chunk.page_content))
        
        # Add it to the page meta-data.
        chunk.metadata["id"] = chunk_id
        chunks_with_id.append(chunk)

    return chunks_with_id


def add_to_chroma(chunks: list[Document]):
    chunks_with_ids = calculate_chunk_ids(chunks)
    
    # Retrieve existing IDs from the collection
    existing_items = collection.get(include=[])
    existing_ids = set(existing_items["ids"])
    print(f"Number of existing documents in collection: {len(existing_ids)}")

    # Prepare data for new documents
    new_chunk_ids = []
    new_documents = []
    new_metadatas = []

    for chunk in chunks_with_ids:
        chunk_id = chunk.metadata["id"]
        if chunk_id not in existing_ids:
            new_chunk_ids.append(chunk_id)
            new_documents.append(chunk.page_content)
            new_metadatas.append(chunk.metadata)

    if new_chunk_ids:
        print(f"👉 Adding new documents: {len(new_chunk_ids)}")
        collection.add(
            ids=new_chunk_ids,
            documents=new_documents,
            metadatas=new_metadatas
        )
    else:
        print("✅ No new documents to add")

# Adjust this number if you want more data!
how_many_documents_to_add = 500
add_to_chroma(chunks[:how_many_documents_to_add])

Number of existing documents in collection: 0
👉 Adding new documents: 500


Now we're done with adding data to our vector database! Hooray! Let's return to the README for further instructions