In [1]:
# First, let's make sure we're using the right Python environment
import sys
import os

print("Python Information:")
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")
print(f"Current working directory: {os.getcwd()}")

# Check if we're in the right virtual environment
if '.venv' in sys.executable and 'summer_2025' in sys.executable:
    print("✓ Using project virtual environment")
else:
    print("⚠️  Not using project virtual environment!")
    print("   To fix: Select the correct kernel in Jupyter (Kernel → Change Kernel → ME344 RAG (Python))")
    print("   Or restart with: ./run_part1.sh")

Python Information:
Python executable: /Users/sujeethjinesh/Desktop/ME344/summer_2025/.venv/bin/python3.11
Python version: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
Current working directory: /Users/sujeethjinesh/Desktop/ME344/summer_2025
✓ Using project virtual environment


In [2]:
import argparse
import os
import shutil
import chromadb
import pprint
import hashlib

from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain.evaluation import load_evaluator
from typing import List, Dict, Any

from langchain_chroma import Chroma
import chromadb.utils.embedding_functions as embedding_functions

chroma_path="chroma"

# RAG System Setup Notebook

## ⚠️ IMPORTANT: Before Running This Notebook

1. **Select the correct kernel**: Kernel → Change Kernel → **ME344 RAG (Python)**
2. **Make sure all services are running**: Run `./run_part1.sh` in the terminal
3. **Run cells in order**: Execute each cell sequentially from top to bottom

If you see any import errors or "module not found" errors, you're likely using the wrong kernel!

In [3]:
# Test imports to ensure we have the right environment
import sys
print(f"Python: {sys.executable}")

try:
    import chromadb
    print(f"✅ ChromaDB version: {chromadb.__version__}")
except ImportError as e:
    print(f"❌ ChromaDB import failed: {e}")
    
try:
    from langchain_community.document_loaders.csv_loader import CSVLoader
    print("✅ LangChain community imports working")
except ImportError as e:
    print(f"❌ LangChain community import failed: {e}")
    
try:
    import requests
    print("✅ Requests library available")
except ImportError as e:
    print(f"❌ Requests import failed: {e}")

Python: /Users/sujeethjinesh/Desktop/ME344/summer_2025/.venv/bin/python3.11
✅ ChromaDB version: 0.4.24
✅ LangChain community imports working
✅ Requests library available


## Loading our Cleaned Data

We'll load our cleaned data in the data folder. In our case, we'll be loading in slang data from urban dictionary as a CSV. We encourage you to check out the data to get a sense of how it's laid out.

If you want to use your own custom, this is where you'd make that change. Just point the file path to your own data, and change out CSVLoader with whatever loader works best for your use case. You can see the different types of loaders at [LangChain](https://python.langchain.com/docs/integrations/document_loaders/#common-file-types).

In [4]:
# Check if data file exists
data_file_path = './data/cleaned_slang_data.csv'
if not os.path.exists(data_file_path):
    raise FileNotFoundError(f"Data file not found: {data_file_path}")

loader = CSVLoader(file_path=data_file_path)
try:
    slang_document = loader.load()
    if not slang_document:
        raise ValueError("No documents loaded from CSV file")
    print(f"Successfully loaded {len(slang_document)} documents")
except Exception as e:
    print(f"Error loading data: {e}")
    raise

Successfully loaded 123172 documents


Now that we've loaded in the data, we can take a quick peek at it to see what we're working with!

In [5]:
print(slang_document[0])
print(type(slang_document))
print(type(slang_document[0]))
print("There are", len(slang_document), "documents")


page_content='word: bank
definition: another cash money word' metadata={'source': './data/cleaned_slang_data.csv', 'row': 0}
<class 'list'>
<class 'langchain_core.documents.base.Document'>
There are 123172 documents


We can see that we have over 600k Slang items! So we're working with quite a large amount of data!

## Chunking our data

Now let's go ahead and chunk our data. Remember that this is cutting up the data into manageable chunks so we can fit it into our vector database (cheatsheet for the LLM)!

Since we're processing over 600k elements, this may take a minute!

In [6]:
# Only necessary if we have too much data to add to the context.
def split_documents(documents: list[Document]):
  text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=80,
    length_function=len,
    is_separator_regex=False,
  )
  return text_splitter.split_documents(documents)

chunks = split_documents(slang_document)

Let's see what the chunks look like. They should be pretty similar but it does definitely help for some words that have very long definitions!

In [7]:
print(chunks[0])
print(len(chunks))

page_content='word: bank
definition: another cash money word' metadata={'source': './data/cleaned_slang_data.csv', 'row': 0}
124060


## Creating Our Embedding Function

Let's start by creating our embedding function. In our case, we want to use a specialized embedding model so it's fast and efficient to get embeddings. Since this embedding model is different from our LLM inference model, we need to pull it using `ollama pull nomic-embed-text`, which you should have already done from the README.

In [8]:
# Alternative approach using LangChain's Ollama embeddings
def get_langchain_embedding_function():
    """Get embedding function using LangChain's Ollama integration"""
    try:
        from langchain_community.embeddings import OllamaEmbeddings
        print("Using LangChain's OllamaEmbeddings")
        
        embeddings = OllamaEmbeddings(
            model="nomic-embed-text",
            base_url="http://localhost:11434"
        )
        
        # Wrap it to be compatible with ChromaDB's expected interface
        class ChromaDBCompatibleEmbeddings:
            def __init__(self, langchain_embeddings):
                self.embeddings = langchain_embeddings
            
            def __call__(self, input):
                if isinstance(input, str):
                    return [self.embeddings.embed_query(input)]
                else:
                    return self.embeddings.embed_documents(input)
        
        return ChromaDBCompatibleEmbeddings(embeddings)
        
    except ImportError as e:
        print(f"LangChain Ollama embeddings not available: {e}")
        print("Install with: pip install langchain-community")
        raise

# Uncomment this line to use LangChain embeddings instead:
# get_embedding_function = get_langchain_embedding_function

### Alternative: Using LangChain Ollama Embeddings

If you're having issues with ChromaDB's embedding functions, you can also use LangChain's Ollama embeddings which are more compatible across versions:

In [9]:
def get_embedding_function():
    """Get Ollama embedding function with fallback for different ChromaDB versions"""
    try:
        # Try ChromaDB 0.5.x+ style first
        import chromadb.utils.embedding_functions as ef
        if hasattr(ef, 'OllamaEmbeddingFunction'):
            print("Using ChromaDB's OllamaEmbeddingFunction")
            embeddings = ef.OllamaEmbeddingFunction(
                url="http://localhost:11434/api/embeddings",
                model_name="nomic-embed-text",
            )
            return embeddings
    except Exception as e:
        print(f"ChromaDB OllamaEmbeddingFunction not available: {e}")
    
    # Fallback: Use custom implementation
    print("Using custom Ollama embedding function")
    import requests
    import numpy as np
    
    class CustomOllamaEmbeddings:
        def __init__(self, url="http://localhost:11434", model_name="nomic-embed-text"):
            self.url = url
            self.model_name = model_name
        
        def __call__(self, input):
            """
            Generate embeddings for input text(s)
            Args:
                input: single string or list of strings
            Returns:
                list of embeddings (each embedding is a list of floats)
            """
            if isinstance(input, str):
                texts = [input]
            else:
                texts = input
                
            embeddings = []
            for text in texts:
                try:
                    response = requests.post(
                        f"{self.url}/api/embeddings",
                        json={"model": self.model_name, "prompt": text},
                        timeout=30
                    )
                    if response.status_code == 200:
                        embedding = response.json()["embedding"]
                        embeddings.append(embedding)
                    else:
                        raise Exception(f"Ollama API error: {response.status_code} - {response.text}")
                except Exception as e:
                    print(f"Error getting embedding for text: {e}")
                    # Return zero vector as fallback
                    embeddings.append([0.0] * 768)  # nomic-embed-text produces 768-dim embeddings
                    
            return embeddings
    
    return CustomOllamaEmbeddings()

In [10]:
# Use the robust embedding function
embedding_function = get_embedding_function()

chunk = chunks[0].page_content

# Test the embedding function
embeddings = embedding_function([chunk])
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"First 10 values: {embeddings[0][:10]}")

Using custom Ollama embedding function
Embedding dimension: 768
First 10 values: [1.518409252166748, 1.6257004737854004, -3.643442153930664, -0.38027846813201904, 0.4524954557418823, -0.8147395849227905, 0.3174838125705719, -0.46331557631492615, 0.07126598060131073, -0.31656473875045776]


## Testing Embeddings

Let's see what an embedding looks like for reference with our sample chunk!

In [11]:
evaluator = load_evaluator("pairwise_string_distance")

print(evaluator.evaluate_string_pairs(prediction="Janky", prediction_b=chunk)) # This should be somewhat close to 0.0

print(evaluator.evaluate_string_pairs(prediction=chunk, prediction_b=chunk)) # This should be 0.0 or very close to it

print(evaluator.evaluate_string_pairs(prediction="pristine", prediction_b=chunk)) # This should be further from 0.0

print(evaluator.evaluate_string_pairs(prediction="brother", prediction_b=chunk)) # This should be even further from 0.0

{'score': 0.4449275362318841}
{'score': 0.0}
{'score': 0.4842995169082126}
{'score': 0.525672877846791}


## Creating the Vector Database

Now we want to start creating our vector database. This is our LLM's cheatsheet of information that it will use in the future to respond to user queries.

We will do this by using [Chromadb](https://www.trychroma.com/), which is a vector database!

Let's first set up some variables and clear out any existing items in it (you only need to do this if you're doing a fresh run with brand new data, otherwise we can keep this code commented out).

In [12]:
# Clear the database for our initial run in case it exists.
# if os.path.exists(chroma_path):
#   shutil.rmtree(chroma_path)

Next, let's start up our chroma db! For this you should have already run this command in the terminal from the README!

`chroma run --host localhost --port 8000 --path ./chroma`

In [13]:
# Initialize our chromadb client locally with special port number so we don't conflict with other things running
try:
    client = chromadb.HttpClient(host='localhost', port=8000)
    # Test the connection
    client.heartbeat()
    print("✅ Successfully connected to ChromaDB")
except Exception as e:
    print(f"❌ Failed to connect to ChromaDB: {e}")
    print("Make sure ChromaDB is running with: chroma run --host localhost --port 8000 --path ./chroma")
    raise

collection_name = "llm_rag_collection"

# Create collection WITHOUT embedding function for now (we'll handle embeddings manually)
try:
    collection = client.get_or_create_collection(name=collection_name)
    print(f"✅ Collection '{collection_name}' ready")
except Exception as e:
    print(f"❌ Failed to create/get collection: {e}")
    raise

✅ Successfully connected to ChromaDB
✅ Collection 'llm_rag_collection' ready


## Adding Data to ChromaDB

Next Let's actually add our chunks to chroma! We'll start by calculating chunk ids so we can update our data at any time. It takes ~8 seconds for 500 records to be embedded and placed into our database. Since our dataset contains over 600k chunks, this would take ~2 to 3 hours! Instead, we'll only add 500 documents, but you can feel free to adjust this number as you deem fit!

In [14]:
def calculate_chunk_ids(chunks):
    """Calculate deterministic chunk IDs using SHA256 hash"""
    chunks_with_id = []
    for chunk in chunks:
        # Use SHA256 for deterministic, secure hashing
        chunk_content = chunk.page_content.encode('utf-8')
        chunk_id = hashlib.sha256(chunk_content).hexdigest()
        
        # Add it to the page meta-data.
        chunk.metadata["id"] = chunk_id
        chunks_with_id.append(chunk)

    return chunks_with_id


def add_to_chroma(chunks: list[Document], embedding_func):
    """Add documents to ChromaDB with error handling and batch processing"""
    if not chunks:
        print("⚠️ No chunks provided to add")
        return
        
    chunks_with_ids = calculate_chunk_ids(chunks)
    
    try:
        # Retrieve existing IDs from the collection
        existing_items = collection.get(include=[])
        existing_ids = set(existing_items["ids"])
        print(f"Number of existing documents in collection: {len(existing_ids)}")

        # Prepare data for new documents
        new_chunk_ids = []
        new_documents = []
        new_metadatas = []
        new_embeddings = []

        # Process chunks in batches for embedding
        batch_size = 10
        for i in range(0, len(chunks_with_ids), batch_size):
            batch_chunks = chunks_with_ids[i:i+batch_size]
            
            # Get texts to embed
            texts_to_embed = []
            chunks_to_add = []
            
            for chunk in batch_chunks:
                chunk_id = chunk.metadata["id"]
                if chunk_id not in existing_ids:
                    texts_to_embed.append(chunk.page_content)
                    chunks_to_add.append(chunk)
            
            if texts_to_embed:
                # Get embeddings for this batch
                try:
                    batch_embeddings = embedding_func(texts_to_embed)
                    
                    # Add to our lists
                    for j, chunk in enumerate(chunks_to_add):
                        new_chunk_ids.append(chunk.metadata["id"])
                        new_documents.append(chunk.page_content)
                        new_metadatas.append(chunk.metadata)
                        new_embeddings.append(batch_embeddings[j])
                        
                except Exception as e:
                    print(f"❌ Error getting embeddings for batch {i//batch_size + 1}: {e}")
                    continue

        if new_chunk_ids:
            print(f"👉 Adding new documents: {len(new_chunk_ids)}")
            # Add documents in batches to avoid memory issues
            add_batch_size = 100
            for i in range(0, len(new_chunk_ids), add_batch_size):
                batch_ids = new_chunk_ids[i:i+add_batch_size]
                batch_docs = new_documents[i:i+add_batch_size]
                batch_metas = new_metadatas[i:i+add_batch_size]
                batch_embeds = new_embeddings[i:i+add_batch_size]
                
                collection.add(
                    ids=batch_ids,
                    documents=batch_docs,
                    metadatas=batch_metas,
                    embeddings=batch_embeds
                )
                print(f"✅ Added batch {i//add_batch_size + 1}: {len(batch_ids)} documents")
        else:
            print("✅ No new documents to add")
            
    except Exception as e:
        print(f"❌ Error adding documents to ChromaDB: {e}")
        raise

# Make this configurable - can be adjusted based on requirements
how_many_documents_to_add = int(os.getenv('DOCUMENTS_TO_ADD', '500'))
print(f"Processing {how_many_documents_to_add} documents...")

# Process documents
print("📚 Processing documents...")
try:
    # Make sure collection and embedding function are defined
    if 'collection' not in globals():
        print("❌ Collection not defined! Please run the ChromaDB connection cell first.")
    elif 'embedding_function' not in globals():
        print("❌ Embedding function not defined! Please run the embedding function cell first.")
    else:
        add_to_chroma(chunks[:how_many_documents_to_add], embedding_function)
except Exception as e:
    print(f"Failed to add documents: {e}")
    raise

Processing 500 documents...
📚 Processing documents...
Number of existing documents in collection: 900
✅ No new documents to add


## Verify Data in Vector Database

Let's check that our data was successfully added to the vector database:

In [15]:
# Verify the collection has data
collection_count = collection.count()
print(f"Total documents in collection: {collection_count}")

# Test a query
if collection_count > 0:
    test_query = "What does 'jank' mean?"
    print(f"\nTesting query: '{test_query}'")
    
    # Get embedding for query
    query_embedding = embedding_function([test_query])[0]
    
    # Search for similar documents
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3
    )
    
    print("\nTop 3 results:")
    for i, doc in enumerate(results['documents'][0]):
        print(f"\n{i+1}. {doc[:200]}...")
else:
    print("\n⚠️ No documents in collection yet. Please run the data loading cells above.")

Total documents in collection: 900

Testing query: 'What does 'jank' mean?'

Top 3 results:

1. word: Janky
definition: Undesirable; less-than optimum....

2. word: Janky
definition: Far from perfect; messed up...

3. word: Janky\ndefinition: Far from perfect; messed up...


## 🎉 Congratulations!

You've successfully set up your RAG system! The vector database is now populated with slang definitions.

### Next Steps:
1. You can now close this notebook
2. Open the React frontend at http://localhost:3000
3. Start asking questions about slang terms!

### Tips:
- The more documents you add (by increasing `DOCUMENTS_TO_ADD`), the better the RAG system will perform
- You can re-run the data loading cell to add more documents
- The embeddings are cached, so duplicate documents won't be added twice