# Email Wizard Assistant Implementation

This notebook demonstrates the implementation of an Email Wizard Assistant using a Retrieval-Augmented Generation (RAG) model. The assistant helps users find answers to their email queries by retrieving relevant past emails and generating intelligent responses.

## 1. Setup and Dependencies

First, let's import the necessary libraries and set up our environment.

In [None]:
# Install required packages if not already installed
!pip install numpy pandas scikit-learn torch transformers sentence-transformers flask faiss-cpu tqdm python-dotenv

In [None]:
import sys
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
import time

# Add the parent directory to sys.path to import from src
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))

# Import our custom modules
from src.embedding import EmailEmbedder
from src.similarity_search import SimilaritySearch
from src.response_generator import ResponseGenerator

## 2. Load and Explore the Email Dataset

Let's load our sample email dataset and explore its structure.

In [None]:
# Load the sample emails
with open('../data/sample_emails.json', 'r') as f:
    emails = json.load(f)

print(f"Loaded {len(emails)} emails from the dataset.")

# Display the first email as an example
print("\nExample Email:")
example_email = emails[0]
for key, value in example_email.items():
    if key == 'body':
        print(f"{key}: {value[:100]}...")
    else:
        print(f"{key}: {value}")

Let's analyze some basic statistics about our email dataset.

In [None]:
# Convert to DataFrame for easier analysis
emails_df = pd.DataFrame(emails)

# Display basic statistics
print("Email Dataset Statistics:")
print(f"Number of emails: {len(emails_df)}")
print(f"Unique senders: {emails_df['sender'].nunique()}")
print(f"Unique recipients: {emails_df['recipient'].nunique()}")

# Calculate email body lengths
emails_df['body_length'] = emails_df['body'].apply(len)
print(f"Average email body length: {emails_df['body_length'].mean():.2f} characters")
print(f"Min email body length: {emails_df['body_length'].min()} characters")
print(f"Max email body length: {emails_df['body_length'].max()} characters")

# Plot email body length distribution
plt.figure(figsize=(10, 6))
plt.hist(emails_df['body_length'], bins=10, alpha=0.7)
plt.title('Email Body Length Distribution')
plt.xlabel('Number of Characters')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

## 3. Email Embedding

Now, let's embed our emails using a pre-trained model.

In [None]:
# Initialize the email embedder
embedder = EmailEmbedder(model_name='all-MiniLM-L6-v2')

# Check if embeddings already exist
try:
    embeddings = embedder.load_embeddings('../data/embeddings.pkl')
    print(f"Loaded {len(embeddings)} existing embeddings.")
except Exception as e:
    print(f"No existing embeddings found: {e}")
    embeddings = {}

# If no embeddings exist, create them
if not embeddings:
    print("Creating new embeddings...")
    start_time = time.time()
    embeddings = embedder.embed_emails(emails)
    end_time = time.time()
    print(f"Embedding completed in {end_time - start_time:.2f} seconds.")
    
    # Save the embeddings
    embedder.save_embeddings(embeddings, '../data/embeddings.pkl')
    print(f"Saved {len(embeddings)} embeddings to file.")

Let's examine the embeddings we've created.

In [None]:
# Get a sample embedding
sample_email_id = list(embeddings.keys())[0]
sample_embedding = embeddings[sample_email_id]

print(f"Sample embedding for email ID {sample_email_id}:")
print(f"Shape: {sample_embedding.shape}")
print(f"First 10 values: {sample_embedding[:10]}")

# Visualize embedding distribution for the sample
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(sample_embedding, bins=30, alpha=0.7)
plt.title('Embedding Value Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.plot(sample_embedding)
plt.title('Embedding Vector')
plt.xlabel('Dimension')
plt.ylabel('Value')
plt.tight_layout()
plt.show()

## 4. Similarity Search Implementation

Now, let's implement the similarity search functionality.

In [None]:
# Initialize the similarity search
similarity_search = SimilaritySearch(model_name='all-MiniLM-L6-v2')

# Build the search index
similarity_search.build_index(embeddings, emails)
print("Search index built successfully.")

Let's test the similarity search with a few example queries.

In [None]:
# Define some test queries
test_queries = [
    "What's the status of our project?",
    "When is the server maintenance scheduled?",
    "Tell me about the new benefits enrollment",
    "What was the feedback from the client presentation?",
    "Is there a bug in the login page?"
]

# Test each query
for query in test_queries:
    print(f"\nQuery: {query}")
    start_time = time.time()
    results = similarity_search.search(query, k=3)
    end_time = time.time()
    
    print(f"Search completed in {(end_time - start_time) * 1000:.2f} ms")
    print(f"Top {len(results)} results:")
    
    for i, result in enumerate(results):
        print(f"Result {i+1}:")
        print(f"  Email ID: {result['id']}")
        print(f"  Subject: {result['subject']}")
        print(f"  Similarity: {result['similarity']:.4f}")
        print(f"  Snippet: {result['snippet']}")

## 5. Response Generation

Now, let's implement the response generation using our RAG model.

In [None]:
# Initialize the response generator
response_generator = ResponseGenerator(model_name="google/flan-t5-base")
print("Response generator initialized.")

Let's test the response generation with our example queries.

In [None]:
# Test response generation for each query
for query in test_queries:
    print(f"\nQuery: {query}")
    
    # Search for relevant emails
    results = similarity_search.search(query, k=3)
    
    # Generate response
    start_time = time.time()
    response = response_generator.generate_response(results, query)
    end_time = time.time()
    
    print(f"Response generated in {(end_time - start_time):.2f} seconds")
    print(f"Response: {response}")

## 6. Performance Evaluation

Let's evaluate the performance of our Email Wizard Assistant.

In [None]:
# Define a larger set of test queries for performance evaluation
evaluation_queries = [
    "What's the status of our project?",
    "When is the server maintenance scheduled?",
    "Tell me about the new benefits enrollment",
    "What was the feedback from the client presentation?",
    "Is there a bug in the login page?",
    "What are the next steps for the project?",
    "When is the quarterly budget review?",
    "What's the invoice amount due?",
    "What new features are customers requesting?",
    "When is the team lunch scheduled?"
]

# Measure search performance
search_times = []
for query in tqdm(evaluation_queries, desc="Evaluating search performance"):
    start_time = time.time()
    results = similarity_search.search(query, k=3)
    end_time = time.time()
    search_times.append((end_time - start_time) * 1000)  # Convert to milliseconds

# Measure response generation performance
response_times = []
for query in tqdm(evaluation_queries, desc="Evaluating response generation"):
    results = similarity_search.search(query, k=3)
    start_time = time.time()
    response = response_generator.generate_response(results, query)
    end_time = time.time()
    response_times.append(end_time - start_time)  # In seconds

# Display performance metrics
print("\nPerformance Metrics:")
print(f"Average search time: {np.mean(search_times):.2f} ms")
print(f"Min search time: {np.min(search_times):.2f} ms")
print(f"Max search time: {np.max(search_times):.2f} ms")
print(f"\nAverage response generation time: {np.mean(response_times):.2f} seconds")
print(f"Min response generation time: {np.min(response_times):.2f} seconds")
print(f"Max response generation time: {np.max(response_times):.2f} seconds")

# Visualize performance
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.bar(range(len(search_times)), search_times)
plt.title('Search Time by Query')
plt.xlabel('Query Index')
plt.ylabel('Time (ms)')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.bar(range(len(response_times)), response_times)
plt.title('Response Generation Time by Query')
plt.xlabel('Query Index')
plt.ylabel('Time (seconds)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. End-to-End Testing

Let's perform an end-to-end test of our Email Wizard Assistant.

In [None]:
def query_email_wizard(query):
    """End-to-end function to query the Email Wizard Assistant."""
    print(f"Query: {query}")
    
    # Record start time
    total_start_time = time.time()
    
    # Step 1: Search for relevant emails
    search_start_time = time.time()
    retrieved_emails = similarity_search.search(query, k=3)
    search_end_time = time.time()
    
    # Step 2: Generate response
    generation_start_time = time.time()
    response = response_generator.generate_response(retrieved_emails, query)
    generation_end_time = time.time()
    
    # Calculate timings
    total_end_time = time.time()
    search_time = search_end_time - search_start_time
    generation_time = generation_end_time - generation_start_time
    total_time = total_end_time - total_start_time
    
    # Print results
    print("\nRetrieved Emails:")
    for i, email in enumerate(retrieved_emails):
        print(f"Email {i+1}: {email['subject']} (Similarity: {email['similarity']:.4f})")
    
    print("\nGenerated Response:")
    print(response)
    
    print("\nPerformance:")
    print(f"Search time: {search_time*1000:.2f} ms")
    print(f"Response generation time: {generation_time:.2f} seconds")
    print(f"Total processing time: {total_time:.2f} seconds")
    
    return {
        "response": response,
        "retrieved_emails": retrieved_emails,
        "performance": {
            "search_time_ms": search_time*1000,
            "generation_time_sec": generation_time,
            "total_time_sec": total_time
        }
    }

In [None]:
# Test with a few user queries
user_queries = [
    "What's the status of our project?",
    "When is the next team meeting scheduled?",
    "What are the details of the server maintenance?"
]

for query in user_queries:
    print("\n" + "="*80)
    result = query_email_wizard(query)
    print("="*80)

## 8. Conclusion

In this notebook, we've implemented an Email Wizard Assistant using a Retrieval-Augmented Generation (RAG) model. The assistant can:

1. Embed emails into vector representations
2. Retrieve relevant emails based on user queries
3. Generate coherent responses based on the retrieved emails

The implementation demonstrates good performance in terms of search speed and response quality. The Flask API implementation in the `/api` directory provides a web interface for interacting with the assistant.

Future improvements could include:
- Using more advanced embedding models
- Implementing Approximate Nearest Neighbors for better search performance
- Adding more sophisticated response generation techniques
- Expanding the email dataset for better coverage