# EmailSearchAI

A generative search system for emails that helps organisations find and validate past decisions, strategies, and data in a huge corpus of email threads.

Steps:

1. Ingest emails and their metadata into a vector database.
2. Implement a retrieval-augmented generation (RAG) architecture.
3. Use embeddings to find relevant email threads.
4. Generate responses based on retrieved emails.


## The Embedding Layer

### About dataset
- The dataset is provided in CSV and Pickle formats for ease of use.
- Each thread contains multiple emails, allowing for analysis of conversation flow and decision-making processes.
- Human-generated summaries enable quick understanding and validation of thread content.
- Suitable for tasks such as search, summarization, and retrieval-augmented generation in enterprise settings.

**Email Thread Summary Dataset**

**Overview:**  
The Email Thread Dataset consists of two main files: `email_thread_details` and `email_thread_summaries`. These files collectively offer a comprehensive compilation of email thread information alongside human-generated summaries.

**Email Thread Details**  
*Description:*  
The `email_thread_details` file provides a detailed perspective on individual email threads, encompassing crucial information such as subject, timestamp, sender, recipients, and the content of the email.

*Columns:*
- `thread_id`: A unique identifier for each email thread.
- `subject`: Subject of the email thread.
- `timestamp`: Timestamp indicating when the message was sent.
- `from`: Sender of the email.
- `to`: List of recipients of the email.
- `body`: Content of the email message.

*Additional Information:*  
The "to" column is available in both CSV and Pickle (pkl) formats, facilitating convenient access to recipient information as a column of lists of strings.

**Email Thread Summaries**  
*Description:*  
The `email_thread_summaries` file contains concise summaries crafted by human annotators for each email thread, offering a high-level overview of the content.

*Columns:*
- `thread_id`: A unique identifier for each email thread.
- `summary`: A concise summary of the email thread.

**Dataset Structure:**  
The dataset is organized into threads and emails. There are a total of 4,167 threads and 21,684 emails, providing a rich source of information for analysis and research purposes.

- Threads: 4,167 threads  
- Emails: 21,684 emails

**Language:**  
- Languages: English (en)

In [1]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
import chromadb
import openai
import tiktoken
from huggingface_hub import hf_hub_download
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import os
import re

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(


In [None]:
email_thread_details = pd.read_csv("./dataset/csv/trimmed_email_thread_details.csv")
email_thread_summaries = pd.read_csv("./dataset/csv/trimmed_email_thread_summaries.csv")

In [3]:
email_thread_details.head()

Unnamed: 0,thread_id,subject,timestamp,from,to,body
0,1,FW: Master Termination Log,2002-01-29 11:23:42,"Gossett, Jeffrey C. JGOSSET","['Giron', 'Darron C. Dgiron', 'Love', 'Phillip...",\n\n -----Original Message-----\nFrom: =09Ther...
1,1,FW: Master Termination Log,2002-01-31 12:50:00,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Gossett', 'Jeff...",\n\n -----Original Message-----\nFrom: =09Panu...
2,1,FW: Master Termination Log,2002-02-05 15:03:35,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Anderson', 'Dia...",Note to Stephanie Panus....\n\nStephanie...ple...
3,1,FW: Master Termination Log,2002-02-05 15:06:25,"Theriot, Kim S. KTHERIO","['Hall', 'D. Todd Thall', 'Sweeney', 'Kevin Ks...",\n\n -----Original Message-----\nFrom: =09Panu...
4,1,FW: Master Termination Log,2002-05-28 07:20:35,"Kelly, Katherine L. KKELLY","['Germany', 'Chris Cgerman']",\n\n -----Original Message-----\nFrom: =09McMi...


In [4]:
email_thread_summaries['text_length'] = email_thread_summaries['summary'].apply(lambda x: len(x.split(' ')))

In [5]:
# Count number of emails per thread
email_thread_counts = email_thread_details.groupby('thread_id').size()
email_thread_summaries['num_emails'] = email_thread_summaries['thread_id'].map(email_thread_counts)

In [56]:
# Use an LLM to generate a summary of the related emails and people involved in the conversations

def generate_response(related_emails):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that summarizes email threads and identifies key participants."},
            {"role": "user", "content": f"Summarize the following emails and identify the key people involved:\n\n{related_emails}"}
    ]
)

    return response.choices[0].message.content.split('\n')

In [51]:
email_thread_details[email_thread_details['thread_id'] == 2]

Unnamed: 0,thread_id,subject,timestamp,from,to,body
5,2,Credit Group Lunch,2000-01-12 05:26:00,Tana Jones,['Suzanne Adams'],I'll be there...
6,2,Credit Group Lunch,2000-02-15 01:08:00,Tana Jones,['Suzanne Adams'],I will attend.
7,2,Credit Group Lunch,2000-04-18 04:54:00,Carol St Clair,['Suzanne Adams'],Suzanne:\nHere is the complete list of credit ...
8,2,Credit Group Lunch,2000-04-18 06:13:00,Carol St Clair,['Suzanne Adams'],Suzanne:\nCould you please check the names of ...
9,2,Credit Group Lunch,2000-04-18 08:25:00,Mark Taylor,['Suzanne Adams'],I will not be able to attend.\n\n\n\n\nSuzanne...
10,2,Credit Group Lunch,2000-04-18 08:29:00,Sara Shackleton,['Kaye Ellis'],"Gosh, I guessed right!!!!\n\n\n\n\nKaye Ellis\..."
11,2,Credit Group Lunch,2000-04-18 09:18:00,Carol St Clair,['Suzanne Adams'],yes. That's okay.\nCarol


In [54]:
email_thread_details[email_thread_details['thread_id'] == 2]['body'].tolist()

["I'll be there...",
 'I will attend.',
 "Suzanne:\nHere is the complete list of credit folks.  Please send an e-mail to each of \nthem concerning the 5th.  Please include the description that I have bolded.  \nIn our group, you don't need to include Marie or Shari. Thanks.\n\nCarol\n---------------------- Forwarded by Carol St Clair/HOU/ECT on 04/18/2000 \n11:52 AM ---------------------------\n   \n\n\nFrom:  John Suttle                                                            \n 04/18/2000 11:47 AM\t\n\t\n\t\n\t                           \n\t\n\nTo: Carol St Clair/HOU/ECT@ECT\ncc:  \nSubject: Re: Credit Group Lunch  \n\nCarol,\n\nThree more have recently joined our group:\nEd Sacks\nBrad Schneider\nWendy LeBrocq\n\nJS\n\n\n\nCarol St Clair\n04/18/2000 11:43 AM\nTo: John Suttle/HOU/ECT@ECT\ncc:  \nSubject: Credit Group Lunch\n\nJohn:\nSara and I would like to hold another lunch with your group on Friday, May \n5th to go through in detail how the ISDA and CSA Masters and Schedules wo

In [57]:
generate_response(email_thread_details[email_thread_details['thread_id'] == 2]['body'].tolist())

['The email thread discusses the planning of a Credit Group Lunch meeting scheduled for Friday, May 5th, 2000. The main purpose of the lunch is to get into details of how the ISDA and CSA Masters and Schedules work. ',
 '',
 'The key participants in this thread are:',
 '1. Carol St Clair: She initiates the plan for the meeting and communicates with Suzanne about the same. She also confirms the list of attendees with John.',
 '2. John Suttle: He contributes to the discussion by providing the names of additional attendees - Ed Sacks, Brad Schneider, and Wendy LeBrocq.',
 "3. Suzanne Adams: She is responsible for sending the invitations to the meeting and confirms attendance. However, she mentions that she won't be able to attend.",
 "4. Kaye Ellis: She informs about Jeff Sorenson's availability.",
 '',
 'It’s also important to note that Carol requests Suzanne to check and ensure the inclusion of Cathy Tudon and Nidia Martinez in the invitees list.']

In [58]:
email_thread_summaries[email_thread_summaries['thread_id'] == 2]['summary']

1    A lunch meeting has been scheduled for May 5th...
Name: summary, dtype: object

In [59]:
# Add a generated summary as a column in email_thread_summary

email_thread_summaries['generated_summary'] = email_thread_summaries['thread_id'].apply(lambda x: generate_response(email_thread_details[email_thread_details['thread_id'] == x]['body'].tolist()) if x in email_thread_details['thread_id'].values else [])

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens. However, your messages resulted in 8859 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

In [6]:
email_thread_summaries.head()


Unnamed: 0,thread_id,summary,text_length,num_emails
0,1,The email thread discusses the Master Terminat...,74,5
1,2,A lunch meeting has been scheduled for May 5th...,74,7
2,3,Ben is updating a friend on his progress with ...,55,6
3,4,The recipient of the email thread initially ex...,63,4
4,5,The email thread discusses the long form confi...,110,8


In [7]:
email_thread_summaries.info

<bound method DataFrame.info of       thread_id                                            summary  \
0             1  The email thread discusses the Master Terminat...   
1             2  A lunch meeting has been scheduled for May 5th...   
2             3  Ben is updating a friend on his progress with ...   
3             4  The recipient of the email thread initially ex...   
4             5  The email thread discusses the long form confi...   
...         ...                                                ...   
4162       4163  Peter Thompson has sent a memo to Kay Mann and...   
4163       4164  The email thread revolves around the sharing a...   
4164       4165  Susan asks Emily about her plans for the weeke...   
4165       4166  Several employees will be on vacation during d...   
4166       4167  Mat has sent an email to Amanda requesting her...   

      text_length  num_emails  
0              74           5  
1              74           7  
2              55           6  

In [8]:
# We will create 2 tables in chroma db
# First match query with email_thread_summaries
# Then find top k documents from 2nd table in email_thread_details




In [9]:
chroma_client = chromadb.Client()

In [10]:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

In [11]:
chroma_data_path = './chroma_data'

In [12]:
import chromadb

In [13]:
# Call PersistentClient()

client = chromadb.PersistentClient(chroma_data_path)


In [None]:
# Set up the embedding function using the OpenAI embedding model

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
model = "text-embedding-ada-002"
embedding_function = OpenAIEmbeddingFunction(api_key=openai.api_key, model_name=model)

NameError: name 'loadenv' is not defined

In [15]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents

email_summaries_collection = client.get_or_create_collection(name='Email_Summaries', embedding_function=embedding_function)

In [16]:
# Batch the data to avoid exceeding the token limit per request
batch_size = 100  # You can adjust this value if needed

summaries = email_thread_summaries['summary'].tolist()
metadatas = email_thread_summaries[['thread_id', 'num_emails', 'text_length']].to_dict(orient='records')
ids = [str(i) for i in range(len(email_thread_summaries))]

for start_idx in range(0, len(summaries), batch_size):
    end_idx = start_idx + batch_size
    email_summaries_collection.add(
        documents=summaries[start_idx:end_idx],
        metadatas=metadatas[start_idx:end_idx],
        ids=ids[start_idx:end_idx],
    )

In [17]:
email_details_collection = client.get_or_create_collection(name='Email_Details', embedding_function=embedding_function)


In [24]:
# Batch the data to avoid exceeding the token limit per request
details_batch_size = 100  # You can adjust this value if needed

details_documents = email_thread_details['body'].tolist()
details_metadatas = email_thread_details[['thread_id', 'subject', 'from', 'to', 'timestamp']].to_dict(orient='records')
details_ids = [str(i) for i in range(len(email_thread_details))]

# Estimate tokens using tiktoken
encoding = tiktoken.encoding_for_model(model)
max_tokens = 8192

def count_tokens(text):
    return len(encoding.encode(text))

current_batch_docs = []
current_batch_metas = []
current_batch_ids = []
current_tokens = 0

for doc, meta, doc_id in zip(details_documents, details_metadatas, details_ids):
    doc_tokens = count_tokens(doc)
    # Skip documents that exceed the model's max token limit
    if doc_tokens > max_tokens:
        continue
    # If adding this document exceeds the token limit, flush the batch
    if current_tokens + doc_tokens > max_tokens and current_batch_docs:
        email_details_collection.add(
            documents=current_batch_docs,
            metadatas=current_batch_metas,
            ids=current_batch_ids,
        )
        current_batch_docs = []
        current_batch_metas = []
        current_batch_ids = []
        current_tokens = 0
    current_batch_docs.append(doc)
    current_batch_metas.append(meta)
    current_batch_ids.append(doc_id)
    current_tokens += doc_tokens

# Add any remaining documents in the last batch
if current_batch_docs:
    email_details_collection.add(
        documents=current_batch_docs,
        metadatas=current_batch_metas,
        ids=current_batch_ids,
    )


In [25]:
email_details_collection.peek()

{'ids': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 'embeddings': array([[-0.00499778, -0.00385044, -0.01661965, ..., -0.01360464,
         -0.01789694, -0.03531861],
        [-0.01091692,  0.00972126, -0.00879295, ..., -0.01810575,
         -0.02908208, -0.03181502],
        [-0.01916373,  0.00233976, -0.02327538, ..., -0.00999878,
         -0.02973038, -0.01786986],
        ...,
        [-0.0215488 ,  0.00217057,  0.01241323, ..., -0.01609535,
         -0.00323058, -0.02147907],
        [-0.01279095,  0.00497761,  0.00409941, ...,  0.00787703,
         -0.02673496, -0.00519549],
        [-0.02217144, -0.0146973 , -0.00014849, ..., -0.01142736,
         -0.00527792, -0.02707984]], shape=(10, 1536)),
 'documents': ['\n\n -----Original Message-----\nFrom: =09Theriot, Kim S. =20\nSent:=09Tuesday, January 29, 2002 1:23 PM\nTo:=09Richardson, Stacey; Anderson, Diane; Gossett, Jeffrey C.; White, Stac=\ney W.; Murphy, Melissa; Hall, D. Todd; Sweeney, Kevin\nCc:=09Aucoin, Evelyn; Baxt

## The Search Layer

In [38]:
# query = "Who proposed the chosen approach for the data migration and when?"
query = "In which department does Paul Radous work?"

In [44]:
related_thread = email_summaries_collection.query(
    query_texts=query,
    n_results=10
)

In [45]:
related_thread

{'ids': [['1832',
   '2022',
   '3085',
   '1345',
   '2824',
   '597',
   '3986',
   '424',
   '3036',
   '4098']],
 'embeddings': None,
 'documents': [['Paul is sending an email to both Harib and Jassem, attaching the latest update for the DPC project. He mentions that a printed copy of project photos will be sent separately. Paul provides his phone number for any questions. The email is repeated twice, with the same content and recipients.',
   "Laura informs Gregg that the damage appears to be limited and she has spoken with Peoples for cooperation. Peoples is dealing with a union walk-out involving over a thousand employees. Gregg forwards Laura an image. Gerald provides Laura with Paul's phone number and mentions looking up an Acura online. LT Bieraugel sends Gerald a picture from a cruise they took in January.",
   'The email thread involves a conversation between Jeff Dasovich and Paul Kaufman. Jeff initially sends an email stating that he quits, to which Paul responds by firin

In [41]:
related_thread['metadatas'][0][0]['thread_id']

1833

In [42]:
related_emails = email_details_collection.query(
    query_texts=query,
    n_results=10,
    where={"thread_id": related_thread['metadatas'][0][0]['thread_id']}
)

In [43]:
related_emails

{'ids': [['9602', '9603', '9600', '9601']],
 'embeddings': None,
 'documents': [['Jassem,\n\nAttached is the latest update for the DPC project. A printed copy of project photos is being forwarded to you seperately. If you have any questions, please feel free to call me at 713 345-5855.\n\nBest Regards,\n\nPaul\n\n\n ',
   'Harib,\n\nAttached is the latest update for the DPC project. A printed copy of project photos is being forwarded to you seperately. If you have any questions, please feel free to call me at 713 345-5855.\n\nBest Regards,\n\nPaul\n\n\n ',
   'Jassem,\n\nAttached is the latest update for the DPC project. A printed copy of project \nphotos is being forwarded to you seperately. If you have any questions, \nplease feel free to call me at 713 345-5855.\n\nBest Regards,\n\nPaul\n\n',
   'Harib,\n\nAttached is the latest update for the DPC project. A printed copy of project \nphotos is being forwarded to you seperately. If you have any questions, \nplease feel free to call m