# Benchmarking vector search on raw emails vs interrogated emails
Most LLM applications rely on a knowledge base to provide external context, and use vector similarity search to retrieve this knowledge. One of the most impactful ways to improve the reliability of vector search is to preprocess the input documents to remove any uncessary markup, and to change the way information is presented so it is more similar to how users typically query the application.

In this notebook, we compare vector search performance on the [Enron email dataset](https://www.cs.cmu.edu/~enron/) in two scenarios:
1. Indexing emails based on embeddings generated from the raw email
2. Indexing emails based on embeddings generated from a series of question and answer pairs about the emails, which are generated by OpenAI's GPT4 API.

In [None]:
! pip install langchain
! pip install doctran
! pip install pandas

In [29]:
import pandas as pd
import csv
import json
import itertools
from email import message_from_string
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain.vectorstores import FAISS
from doctran import Doctran, ExtractProperty

In [4]:
OPENAI_API_KEY = "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"
OPENAI_MODEL = "gpt-4"
OPENAI_TOKEN_LIMIT = 8000

### Load Email Dataset
The email dataset used in this notebook is not included since it's over 1.4GB. However it can be downloaded from https://www.cs.cmu.edu/~./enron/.

In [5]:
filename = "emails.csv"
start_row = 1200
end_row = 1210

with open(filename, "r") as f:
    reader = csv.reader(f)
    rows = list(itertools.islice(reader, start_row, end_row))
    emails = [message_from_string(row[1]) for row in rows]

print(emails[0])

Message-ID: <12536946.1075855667184.JavaMail.evans@thyme>
Date: Tue, 31 Oct 2000 07:00:00 -0800 (PST)
From: phillip.allen@enron.com
To: david.delainey@enron.com
Subject: 
Cc: john.lavorato@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Bcc: john.lavorato@enron.com
X-From: Phillip K Allen
X-To: David W Delainey
X-cc: John J Lavorato
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\All documents
X-Origin: Allen-P
X-FileName: pallen.nsf

Dave,

The back office is having a hard time dealing with the $11 million dollars 
that is to be recognized as transport expense by the west desk then recouped 
from the Office of the Chairman.    Is your understanding that the West desk 
will receive origination each month based on the schedule below.

 
 The Office of the Chairman agrees to grant origination to the Denver desk as 
follows:

October 2000  $1,395,000
November 2000 $1,350,000
December 2000 $1,395,000
January 2001  $   669,600
Fe

### Run emails through interrogation

In [13]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
doctran = Doctran(openai_api_key=OPENAI_API_KEY)
docs = []
for email in emails:
    text = email.as_string()
    doc = doctran.parse(content=text)
    docs.append(await doc.interrogate().execute())

## Benchmark with raw document as vectors
Vectorize as normal, indexing the emails based on vectors generated from the raw email strings.

In [62]:
regular_docs = []
for doc in docs:
    regular_docs.append(Document(page_content=doc.raw_content))
email_db = FAISS.from_documents(regular_docs, embeddings)
query = "What are common topics phillip allen talks to his colleagues about?"
results = email_db.similarity_search_with_score(query, k=20)

### ❌ Result
Vectors computed from the raw emails have high distance from the query vector (0.44 - 0.56)

In [63]:
df = pd.DataFrame(results, columns=["email", "distance"])
df

Unnamed: 0,email,distance
0,page_content='Message-ID: <14486500.1075863687...,0.439633
1,page_content='Message-ID: <5469026.10758556672...,0.468381
2,page_content='Message-ID: <12338129.1075855667...,0.491914
3,page_content='Message-ID: <27394834.1075855703...,0.492709
4,page_content='Message-ID: <31461370.1075855667...,0.502439
5,page_content='Message-ID: <20188486.1075855667...,0.511111
6,page_content='Message-ID: <12536946.1075855667...,0.513767
7,"page_content=""Message-ID: <558883.107585566726...",0.517601
8,page_content='Message-ID: <20513208.1075855667...,0.522606
9,page_content='Message-ID: <18752461.1075855665...,0.56545


## Benchmark with questions and answers as vectors
Vectorize questions and index each email based on question/answer pairs generated from the email, instead of the raw email itself.

In [64]:
qa_docs = []
for doc in docs:
    qa_docs.extend([Document(page_content=json.dumps(qa), metadata={"raw_email": doc.raw_content}) for qa in doc.extracted_properties["questions_and_answers"]])
qa_db = FAISS.from_documents(qa_docs, embeddings)
query = "What are common topics phillip allen talks to his colleagues about?"
results = qa_db.similarity_search_with_score(query, k=20)

### 🎯 Result
Vectors computed from question/answer pairs have significantly lower distance from the query vector (0.32 -0.46)

In [65]:
df = pd.DataFrame(results, columns=["qa_pair", "distance"])
df

Unnamed: 0,qa_pair,distance
0,"page_content='{""question"": ""Who did Phillip Al...",0.326656
1,"page_content='{""question"": ""What does Phillip ...",0.339122
2,"page_content='{""question"": ""Who sent the email...",0.363461
3,"page_content='{""question"": ""Who sent the email...",0.363665
4,"page_content='{""question"": ""What does Phillip ...",0.374621
5,"page_content='{""question"": ""What does Phillip ...",0.392315
6,"page_content='{""question"": ""What is Phillip as...",0.398435
7,"page_content='{""question"": ""Who forwarded the ...",0.40539
8,"page_content='{""question"": ""What does Phillip ...",0.410968
9,"page_content='{""question"": ""What does Phillip ...",0.417999


## Benchmarking with additional questions

In [67]:
questions = [
    "What are common topics phillip allen talks to his colleagues about?",
    "What are the top concerns expressed in these emails?",
    "Who exhibits the most leadership in these emails?",
    "How would you describe the company culture based on these emails?",
    "Summarize the decisions made",
    "List all the people who were involved in the decision making process",
]
results = []
for question in questions:
    email_results = email_db.similarity_search_with_score(query, k=20)
    qa_results = qa_db.similarity_search_with_score(question, k=20)
    average_distance = lambda x: sum(y[1] for y in x) / len(x)
    results.append((question, average_distance(email_results), average_distance(qa_results)))
df = pd.DataFrame(results, columns=["question", "email_distance", "qa_distance"])
df

Unnamed: 0,question,email_distance,qa_distance
0,What are common topics phillip allen talks to ...,0.502664,0.413119
1,What are the top concerns expressed in these e...,0.502738,0.402262
2,Who exhibits the most leadership in these emails?,0.502664,0.403794
3,How would you describe the company culture bas...,0.502664,0.425694
4,Summarize the decisions made,0.502664,0.511507
5,List all the people who were involved in the d...,0.502664,0.502365


We are able to consistently achieve better results by vectorizing the question/answer pairs generated from this email dataset compared to vectorizing the raw emails, **as long as the user query is phrased as a question and not as instructions.**

## Conclusion
For use cases where users are expected to **ask questions** rather than provide instructions to a LLM, converting source documents into question and answer pairs is likely to result in more reliable results when performing vector retrieval. For example, this preprocessing technique is likely better suited for customer support or knowledge base search use cases, and less well suited for agent-based workflows.