# Email Wizard Assistant: Model Implementation

This notebook demonstrates the implementation of the RAG model for the Email Wizard Assistant. We'll embed the preprocessed emails, set up the retrieval system, and implement the response generation.

In [6]:
# Install missing libraries
%pip install -q chromadb faiss-cpu sentence-transformers onnxruntime


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
pip install onnxruntime-directml


Collecting onnxruntime-directml
  Downloading onnxruntime_directml-1.21.1-cp312-cp312-win_amd64.whl.metadata (4.9 kB)
Downloading onnxruntime_directml-1.21.1-cp312-cp312-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/24.0 MB ? eta -:--:--
    --------------------------------------- 0.5/24.0 MB 989.2 kB/s eta 0:00:24
   - -------------------------------------- 0.8/24.0 MB 1.0 MB/s eta 0:00:23
   - -------------------------------------- 1.0/24.0 MB 1.1 MB/s eta 0:00:21
   -- ------------------------------------- 1.3/24.0 MB 1.2 MB/s eta 0:00:20
   -- ------------------------------------- 1.6/24.0 MB 1.2 MB/s eta 0:00:19
   --- ------------------------------------ 1.8/24.0 MB 1.2 MB/s eta 0:00:19
   --- ------------------------------------ 2.1/24.0 MB 1.3 MB/s eta 0:00:18
   --- ------------------------------------ 2.4/24.0 MB


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
# Install required packages if not installed
try:
    import chromadb
    import faiss
    import sentence_transformers
except ImportError:
    !pip install chromadb faiss-cpu sentence-transformers

# Standard imports
import os
import sys
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import time

# Add project root to path to access local modules
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))

# Local project imports
try:
    from src.data.dataset import load_dataset
    from src.model.embeddings import EmailEmbedder, ChromaDBStore
    from src.model.retriever import EmailRetriever, ChromaDBRetriever
    from src.model.generator import ResponseGenerator, RAGPipeline
    from src.utils.helpers import time_function, save_json, load_json
except ModuleNotFoundError as e:
    raise ModuleNotFoundError(
        f"Could not import project modules. Make sure you are running this notebook from within the project structure. Error: {e}"
    )


## 1. Load Preprocessed Emails

First, let's load the preprocessed emails from the previous notebook.

In [3]:
# Load preprocessed emails
processed_emails = load_dataset(
    "../data/processed/processed_emails.json",
    is_processed=True
)

print(f"Loaded {len(processed_emails)} preprocessed emails")

Loaded 59 preprocessed emails


## 2. Embed Emails

Now, let's embed the preprocessed emails using a pre-trained Sentence Transformer model.

In [4]:
# Initialize the embedder
embedder = EmailEmbedder(model_name="all-MiniLM-L6-v2")

# Embed the emails
@time_function
def embed_emails(emails):
    return embedder.embed_emails(emails)

emails_with_embeddings = embed_emails(processed_emails)

# Save the embeddings
os.makedirs("../data/embeddings", exist_ok=True)
embedder.save_embeddings(
    emails_with_embeddings,
    "../data/embeddings/email_embeddings.json"
)

print(f"Embedded and saved {len(emails_with_embeddings)} emails")



Function embed_emails took 9.9967 seconds to execute
Saved embeddings for 59 emails to ../data/embeddings/email_embeddings.json
Embedded and saved 59 emails


Let's examine the embeddings to understand their structure:

In [5]:
# Examine the embeddings
sample_email = emails_with_embeddings[0]

print(f"Email ID: {sample_email['id']}")
print(f"Embedding shape: {np.array(sample_email['embedding']).shape}")

if 'chunk_embeddings' in sample_email:
    print(f"Number of chunks: {len(sample_email['chunks'])}")
    print(f"Chunk embeddings shape: {np.array(sample_email['chunk_embeddings']).shape}")

Email ID: email_2
Embedding shape: (384,)
Number of chunks: 1
Chunk embeddings shape: (1, 384)


## 3. Set Up ChromaDB for Vector Storage

Now, let's set up ChromaDB for efficient vector storage and retrieval.

In [7]:
# Initialize ChromaDB store
chroma_store = ChromaDBStore(
    collection_name="email_embeddings",
    persist_directory="../data/embeddings/chroma_db",
    embedding_function=embedder.embedding_function
)

# Add emails to ChromaDB
@time_function
def add_emails_to_chroma(emails):
    chroma_store.add_emails(emails)

# Check if collection is empty before adding
collection_stats = chroma_store.get_collection_stats()
print(f"Collection stats: {collection_stats}")

if collection_stats["count"] == 0:
    add_emails_to_chroma(processed_emails)
    print(f"Added {len(processed_emails)} emails to ChromaDB")
else:
    print(f"ChromaDB already contains {collection_stats['count']} emails")

Using existing collection: email_embeddings
Collection stats: {'collection_name': 'email_embeddings', 'count': 10}
ChromaDB already contains 10 emails


## 4. Implement Similarity Search

Let's implement and test the similarity search functionality using both direct vector comparison and ChromaDB.

In [8]:
# Initialize retrievers
vector_retriever = EmailRetriever(
    embedder=embedder,
    use_faiss=True,
    index_path="../data/embeddings/faiss_index.bin"
)

# Build the index
@time_function
def build_index(emails):
    vector_retriever.build_index(emails)

build_index(emails_with_embeddings)

# Initialize ChromaDB retriever
chroma_retriever = ChromaDBRetriever(chroma_store=chroma_store)

Saved FAISS index to ../data/embeddings/faiss_index.bin
Built search index with 59 emails
Function build_index took 0.0112 seconds to execute


Now, let's test the retrieval with some sample queries:

In [9]:
# Test queries
test_queries = [
    "What's the status of the project?",
    "When is the next team meeting?",
    "Can you provide an update on the budget?",
    "Is there any issue with the system?",
    "What are the plans for the weekend?"
]

# Test vector retrieval
print("Vector Retrieval Results:")
for query in test_queries:
    print(f"\nQuery: {query}")
    
    # Retrieve similar emails
    start_time = time.time()
    results = vector_retriever.retrieve(query, top_k=3)
    end_time = time.time()
    
    print(f"Retrieved {len(results)} emails in {end_time - start_time:.4f} seconds")
    
    # Display results
    for i, result in enumerate(results):
        metadata = result.get('metadata', {})
        similarity = result.get('similarity_score', 0.0)
        print(f"Result {i+1}: {metadata.get('subject', '')} (Similarity: {similarity:.4f})")

Vector Retrieval Results:

Query: What's the status of the project?
Retrieved 3 emails in 0.1001 seconds
Result 1: Project Status Update (Similarity: 0.9916)
Result 2: Budget Approval Request (Similarity: 0.9896)
Result 3: Budget Approval Request (Similarity: 0.9895)

Query: When is the next team meeting?
Retrieved 3 emails in 0.0236 seconds
Result 1: Meeting Minutes: Strategic Planning (Similarity: 0.9899)
Result 2: Meeting Minutes: Strategic Planning (Similarity: 0.9897)
Result 3: Meeting Minutes: Strategic Planning (Similarity: 0.9882)

Query: Can you provide an update on the budget?
Retrieved 3 emails in 0.0308 seconds
Result 1: Budget Approval Request (Similarity: 0.9928)
Result 2: Budget Approval Request (Similarity: 0.9926)
Result 3: Budget Approval Request (Similarity: 0.9925)

Query: Is there any issue with the system?
Retrieved 3 emails in 0.0243 seconds
Result 1: System Outage Notification (Similarity: 0.9862)
Result 2: System Outage Notification (Similarity: 0.9862)
Result 

In [10]:
# Test ChromaDB retrieval
print("ChromaDB Retrieval Results:")
for query in test_queries:
    print(f"\nQuery: {query}")
    
    # Retrieve similar emails
    start_time = time.time()
    results = chroma_retriever.retrieve(query, top_k=3)
    end_time = time.time()
    
    print(f"Retrieved {len(results)} emails in {end_time - start_time:.4f} seconds")
    
    # Display results
    for i, result in enumerate(results):
        metadata = result.get('metadata', {})
        similarity = result.get('similarity_score', 0.0)
        print(f"Result {i+1}: {metadata.get('subject', '')} (Similarity: {similarity:.4f})")

ChromaDB Retrieval Results:

Query: What's the status of the project?
Retrieved 3 emails in 0.2077 seconds
Result 1: Project Status Update (Similarity: 0.6035)
Result 2: Re: Project Status Update (Similarity: 0.4714)
Result 3: Deployment Schedule Update (Similarity: 0.3296)

Query: When is the next team meeting?
Retrieved 3 emails in 0.0415 seconds
Result 1: Re: Project Status Update (Similarity: 0.5651)
Result 2: Deployment Schedule Update (Similarity: 0.3919)
Result 3: Project Status Update (Similarity: 0.3821)

Query: Can you provide an update on the budget?
Retrieved 3 emails in 0.0322 seconds
Result 1: Budget Approval Request (Similarity: 0.5821)
Result 2: Project Status Update (Similarity: 0.4161)
Result 3: Re: Project Status Update (Similarity: 0.2969)

Query: Is there any issue with the system?
Retrieved 3 emails in 0.0366 seconds
Result 1: System Outage Notification (Similarity: 0.2610)
Result 2: Project Status Update (Similarity: 0.1489)
Result 3: Code Review Feedback (Simila

## 5. Implement Response Generation

Now, let's implement the response generation using a pre-trained language model.

In [11]:
# Initialize the generator
generator = ResponseGenerator(model_name="google/flan-t5-base")

# Test response generation
print("Response Generation:")
for query in test_queries[:2]:  # Use only the first two queries to save time
    print(f"\nQuery: {query}")
    
    # Retrieve similar emails
    retrieved_emails = chroma_retriever.retrieve(query, top_k=3)
    
    # Generate response
    start_time = time.time()
    response = generator.generate_response(query, retrieved_emails)
    end_time = time.time()
    
    print(f"Generated response in {end_time - start_time:.4f} seconds")
    print(f"Response: {response}")



Loaded response generator model: google/flan-t5-base on cpu
Response Generation:

Query: What's the status of the project?


Token indices sequence length is longer than the specified maximum sequence length for this model (625 > 512). Running this sequence through the model will result in indexing errors


Generated response in 6.4100 seconds
Response: Subject: Deployment Schedule Changes

Query: When is the next team meeting?
Generated response in 3.9468 seconds
Response: Subject: Project Status Update


## 6. Implement End-to-End RAG Pipeline

Finally, let's implement the end-to-end RAG pipeline that combines retrieval and generation.

In [12]:
# Initialize the RAG pipeline
rag_pipeline = RAGPipeline(
    retriever=chroma_retriever,
    generator=generator,
    top_k=3
)

# Test the RAG pipeline
print("RAG Pipeline:")
for query in test_queries:
    print(f"\nQuery: {query}")
    
    # Process the query
    start_time = time.time()
    result = rag_pipeline.process_query(query)
    end_time = time.time()
    
    print(f"Processed query in {end_time - start_time:.4f} seconds")
    print(f"Response: {result['response']}")
    print(f"Retrieved {len(result['retrieved_emails'])} emails")

RAG Pipeline:

Query: What's the status of the project?
Processed query in 5.6071 seconds
Response: Subject: Deployment Schedule Changes
Retrieved 3 emails

Query: When is the next team meeting?
Processed query in 5.9591 seconds
Response: Subject: Project Status Update
Retrieved 3 emails

Query: Can you provide an update on the budget?
Processed query in 4.2549 seconds
Response: Subject: Project Status Update
Retrieved 3 emails

Query: Is there any issue with the system?
Processed query in 6.7683 seconds
Response: Subject: Code Review for the New Authentication Module
Retrieved 3 emails

Query: What are the plans for the weekend?
Processed query in 5.5597 seconds
Response: Subject: Deployment Schedule
Retrieved 3 emails


## 7. Save the Models

Let's save the models for later use in the API.

In [13]:
# ChromaDB is already saved in the persist_directory
print(f"ChromaDB is saved in: {chroma_store.persist_directory}")

# FAISS index is already saved
print(f"FAISS index is saved in: {vector_retriever.index_path}")

# The transformer models are cached by the Hugging Face library
print(f"Embedding model: {embedder.model_name}")
print(f"Generator model: {generator.model_name}")

ChromaDB is saved in: ../data/embeddings/chroma_db
FAISS index is saved in: ../data/embeddings/faiss_index.bin
Embedding model: all-MiniLM-L6-v2
Generator model: google/flan-t5-base


## 8. Summary

In this notebook, we've:

1. Loaded the preprocessed emails from the previous notebook
2. Embedded the emails using a pre-trained Sentence Transformer model
3. Set up ChromaDB for efficient vector storage and retrieval
4. Implemented similarity search using both direct vector comparison and ChromaDB
5. Implemented response generation using a pre-trained language model
6. Created an end-to-end RAG pipeline that combines retrieval and generation
7. Saved the models for later use in the API

The RAG pipeline is now ready to be integrated into the API in the next steps.