# üáÆüá≥ Marathi Dictionary RAG - Phase 1
## Embeddings and Vector Search

**What we'll do in this notebook:**
1. Load the MahaSBERT model (the "brain" that understands Marathi)
2. See how words become numbers (embeddings)
3. Load your dictionary data
4. Create embeddings for all entries
5. Store them in ChromaDB
6. Search and find words!

Let's go! üöÄ

---
## Step 1: Check Everything is Installed

Run this cell first. If you see errors, go back to the terminal and run:
```
pip install -r requirements.txt
```

In [None]:
# Let's check all our packages are installed
import sys
print(f"Python version: {sys.version}")

# These should all work without errors
import torch
print(f"‚úÖ PyTorch version: {torch.__version__}")

import sentence_transformers
print(f"‚úÖ Sentence Transformers version: {sentence_transformers.__version__}")

import chromadb
print(f"‚úÖ ChromaDB version: {chromadb.__version__}")

import json
print(f"‚úÖ JSON module ready")

from tqdm import tqdm
print(f"‚úÖ TQDM (progress bars) ready")

print("\nüéâ Everything is installed! Let's continue.")

---
## Step 2: Load MahaSBERT Model

### What's happening here?

MahaSBERT is like a translator that converts words into numbers. It was trained on millions of Marathi sentences, so it "understands" Marathi.

**First time running this?** It will download the model (~400MB). This only happens once - after that, it's saved on your computer.

‚òï This might take 1-2 minutes the first time.

In [None]:
from sentence_transformers import SentenceTransformer

# This is the magic line - loading the Marathi-understanding model
print("Loading MahaSBERT model... (this takes a minute the first time)")

model = SentenceTransformer('l3cube-pune/marathi-sentence-similarity-sbert')

print("‚úÖ Model loaded!")
print(f"   Model creates vectors with {model.get_sentence_embedding_dimension()} dimensions")

---
## Step 3: See How Embeddings Work

Let's turn some Marathi words into numbers and see what happens!

### The Big Idea:
- Similar words ‚Üí Similar numbers
- Different words ‚Üí Different numbers

In [None]:
# Let's embed a single word
word = "‡§™‡§æ‡§£‡•Ä"

# Turn it into numbers!
embedding = model.encode(word)

print(f"Word: {word}")
print(f"Embedding shape: {embedding.shape}")  # Should be (768,)
print(f"\nFirst 10 numbers: {embedding[:10]}")
print(f"\nThis word is now represented by {len(embedding)} numbers!")

In [None]:
# Now let's compare similar vs different words
# We'll use "cosine similarity" - a score from -1 to 1
# 1 = identical, 0 = unrelated, -1 = opposite

from sentence_transformers import util

# Water-related words (should be similar)
water_words = ["‡§™‡§æ‡§£‡•Ä", "‡§ú‡§≤", "‡§™‡§æ‡§ä‡§∏", "‡§®‡§¶‡•Ä"]

# Unrelated word
unrelated = "‡§Æ‡§æ‡§Ç‡§ú‡§∞"  # cat

# Get embeddings for all
water_embeddings = model.encode(water_words)
cat_embedding = model.encode(unrelated)

print("üåä Comparing water-related words to '‡§™‡§æ‡§£‡•Ä' (water):\n")

pani_embedding = water_embeddings[0]  # ‡§™‡§æ‡§£‡•Ä

for i, word in enumerate(water_words):
    similarity = util.cos_sim(pani_embedding, water_embeddings[i]).item()
    bar = "‚ñà" * int(similarity * 20)
    print(f"  ‡§™‡§æ‡§£‡•Ä ‚Üî {word:8} : {similarity:.3f} {bar}")

print("\nüê± Comparing to unrelated word '‡§Æ‡§æ‡§Ç‡§ú‡§∞' (cat):\n")
similarity = util.cos_sim(pani_embedding, cat_embedding).item()
bar = "‚ñà" * int(similarity * 20)
print(f"  ‡§™‡§æ‡§£‡•Ä ‚Üî ‡§Æ‡§æ‡§Ç‡§ú‡§∞   : {similarity:.3f} {bar}")

print("\nüëÜ See how water words have HIGH similarity (close to 1.0)?")
print("   But 'cat' has LOWER similarity? That's embeddings working!")

---
## Step 4: Load Your Dictionary

Now let's load the Berntsen dictionary you processed.

**Make sure** you've copied `berntsen_dictionary_processed.json` to the `data/` folder!

In [None]:
import json
from pathlib import Path

# Load the dictionary
data_path = Path("../data/berntsen_dictionary_processed.json")

# Check if file exists
if not data_path.exists():
    print(f"‚ùå File not found at: {data_path.absolute()}")
    print("\nüìÅ Please copy your berntsen_dictionary_processed.json to the data/ folder")
else:
    with open(data_path, 'r', encoding='utf-8') as f:
        dictionary = json.load(f)
    
    print(f"‚úÖ Loaded dictionary with {len(dictionary):,} entries!")
    print(f"\nüìñ First entry looks like this:\n")
    print(json.dumps(dictionary[0], indent=2, ensure_ascii=False))

In [None]:
# Let's see what kinds of entries we have
entry_types = {}
for entry in dictionary:
    t = entry.get('entry_type', 'unknown')
    entry_types[t] = entry_types.get(t, 0) + 1

print("üìä Entry types in your dictionary:\n")
for entry_type, count in entry_types.items():
    print(f"   {entry_type}: {count:,}")

---
## Step 5: Create Embeddings for ALL Entries

Now the real work! We'll:
1. Take each dictionary entry
2. Use the `search_text` field (which has Devanagari + romanized + definitions)
3. Turn it into an embedding

**This will take a few minutes** for 5,000 entries. You'll see a progress bar!

In [None]:
from tqdm import tqdm

# We'll embed the 'search_text' field - it contains the most useful info
# Let's first check a few examples

print("üìù Examples of 'search_text' we'll embed:\n")
for entry in dictionary[:3]:
    print(f"  ‚Ä¢ {entry['search_text'][:80]}...\n")

In [None]:
# Now let's create ALL embeddings
# We'll process in batches for efficiency

print("üîÑ Creating embeddings for all dictionary entries...")
print("   (This takes 2-5 minutes depending on your computer)\n")

# Extract all search texts
search_texts = [entry['search_text'] for entry in dictionary]

# Create embeddings in batches (faster than one at a time)
batch_size = 64  # Process 64 entries at a time

all_embeddings = []

for i in tqdm(range(0, len(search_texts), batch_size), desc="Embedding batches"):
    batch = search_texts[i:i + batch_size]
    batch_embeddings = model.encode(batch, show_progress_bar=False)
    all_embeddings.extend(batch_embeddings)

print(f"\n‚úÖ Created {len(all_embeddings):,} embeddings!")
print(f"   Each embedding has {len(all_embeddings[0])} dimensions")

---
## Step 6: Store in ChromaDB

Now we'll put everything in ChromaDB - our vector database.

Think of ChromaDB like a super-organized library where:
- Each book (dictionary entry) has a location based on its meaning
- We can instantly find books that are "nearby" (similar meaning)

In [None]:
import chromadb
from chromadb.config import Settings

# Create a ChromaDB client that saves to disk
# This means your database persists even after you close the notebook!

chroma_path = "../chroma_db"

client = chromadb.PersistentClient(path=chroma_path)

print(f"‚úÖ ChromaDB client created!")
print(f"   Data will be saved to: {chroma_path}")

In [None]:
# Create (or get) a collection for our dictionary
# A "collection" is like a folder that holds related items

# Delete existing collection if it exists (so we can start fresh)
try:
    client.delete_collection(name="berntsen_dictionary")
    print("üóëÔ∏è  Deleted existing collection to start fresh")
except:
    pass

# Create new collection
collection = client.create_collection(
    name="berntsen_dictionary",
    metadata={"description": "Berntsen Marathi-English Dictionary"}
)

print(f"‚úÖ Created collection: 'berntsen_dictionary'")

In [None]:
# Now add all entries to the collection
# We'll include metadata so we can filter and display results nicely

print("üì• Adding entries to ChromaDB...\n")

# Prepare data for ChromaDB
ids = []
embeddings_list = []
documents = []
metadatas = []

for i, entry in enumerate(tqdm(dictionary, desc="Preparing entries")):
    ids.append(entry['entry_id'])
    embeddings_list.append(all_embeddings[i].tolist())  # Convert numpy to list
    documents.append(entry['search_text'])
    
    # Metadata - extra info we want to store and filter by
    metadata = {
        'headword': entry['headword_devanagari'],
        'romanized': entry.get('headword_romanized', ''),
        'entry_type': entry['entry_type'],
        'source_page': entry['source_page'],
        'full_entry': entry['full_entry'],
        'source': 'berntsen'  # Will be useful when we add more dictionaries!
    }
    
    # Add part of speech if available
    if entry.get('definitions') and len(entry['definitions']) > 0:
        first_def = entry['definitions'][0]
        if first_def.get('pos'):
            metadata['pos'] = first_def['pos']
        if first_def.get('gender'):
            metadata['gender'] = first_def['gender']
    
    metadatas.append(metadata)

print(f"\n‚úÖ Prepared {len(ids):,} entries")

In [None]:
# Add everything to ChromaDB
# We'll do this in batches because ChromaDB has limits

batch_size = 500  # ChromaDB works well with batches of 500

print("üì• Uploading to ChromaDB...\n")

for i in tqdm(range(0, len(ids), batch_size), desc="Uploading batches"):
    end_idx = min(i + batch_size, len(ids))
    
    collection.add(
        ids=ids[i:end_idx],
        embeddings=embeddings_list[i:end_idx],
        documents=documents[i:end_idx],
        metadatas=metadatas[i:end_idx]
    )

print(f"\n‚úÖ Successfully added {collection.count():,} entries to ChromaDB!")

---
## Step 7: Let's Search! üîç

The exciting part! Let's test our system.

We'll:
1. Take a Marathi word
2. Convert it to an embedding
3. Find similar entries in ChromaDB
4. Display the results!

In [None]:
def search_dictionary(query, n_results=5):
    """
    Search the dictionary for entries similar to the query.
    
    Args:
        query: A Marathi word or phrase to search for
        n_results: How many results to return (default 5)
    
    Returns:
        Results from ChromaDB with entries and similarity scores
    """
    # Step 1: Convert query to embedding
    query_embedding = model.encode(query).tolist()
    
    # Step 2: Search ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=['documents', 'metadatas', 'distances']
    )
    
    return results


def display_results(query, results):
    """
    Display search results in a nice format.
    """
    print(f"\nüîç Search: '{query}'")
    print("=" * 60)
    
    if not results['ids'][0]:
        print("No results found.")
        return
    
    for i, (id, metadata, distance) in enumerate(zip(
        results['ids'][0],
        results['metadatas'][0],
        results['distances'][0]
    )):
        # Convert distance to similarity (lower distance = higher similarity)
        # ChromaDB uses L2 distance by default
        similarity = 1 / (1 + distance)  # Simple conversion to 0-1 range
        
        print(f"\n{i+1}. {metadata['headword']}")
        if metadata.get('romanized'):
            print(f"   ({metadata['romanized']})")
        print(f"   üìñ {metadata['full_entry']}")
        print(f"   üìä Match score: {similarity:.2%}")
        print(f"   üìÑ Source: {metadata['source']}, page {metadata['source_page']}")

print("‚úÖ Search functions ready!")

In [None]:
# TEST 1: Simple word lookup
query = "‡§™‡§æ‡§£‡•Ä"
results = search_dictionary(query)
display_results(query, results)

In [None]:
# TEST 2: Semantic search - find related words!
query = "water"  # English query - will it find Marathi water words?
results = search_dictionary(query)
display_results(query, results)

In [None]:
# TEST 3: Try a concept
query = "‡§ñ‡§æ‡§£‡•á"  # eating
results = search_dictionary(query)
display_results(query, results)

In [None]:
# TEST 4: Your turn! Try any word
query = "‡§Ü‡§à"  # mother - change this to anything!
results = search_dictionary(query)
display_results(query, results)

---
## üéâ Phase 1 Complete!

### What you built:
1. ‚úÖ Loaded MahaSBERT - a model that understands Marathi
2. ‚úÖ Created embeddings for 5,000+ dictionary entries
3. ‚úÖ Stored everything in ChromaDB (saved to disk!)
4. ‚úÖ Built a working search function

### What's saved:
- Your ChromaDB database is saved in the `chroma_db/` folder
- You can close this notebook and the data persists!

### What's next (Phase 2):
- Add an LLM (Claude Haiku) to make responses smarter
- Handle morphology (‡§™‡§æ‡§£‡•ç‡§Ø‡§æ‡§≤‡§æ ‚Üí ‡§™‡§æ‡§£‡•Ä)
- Better formatting of results

---

## Bonus: Interactive Search Cell

Run this cell and type any word to search!

In [None]:
# Interactive search - run this and enter words!
while True:
    query = input("\nüîç Enter a Marathi word (or 'quit' to exit): ")
    if query.lower() == 'quit':
        print("üëã Goodbye!")
        break
    results = search_dictionary(query)
    display_results(query, results)