##OASIS Search Engine, The Beginning

# Dataset: Enron Email Dataset

We are providing you with a small collection of emails from the Enron Email Dataset. The Enron Email Dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron. The full corpus contains a total of about 0.5M messages (https://www.cs.cmu.edu/~enron/). For this homework, we will use a small subset of the data. The subset contains 814 emails extracted from the `_sent_mail` of Arnold-J. We have zipped the 814 files (each file contains the information of an email). The zipped file is available on Canvas as `enron_814.zip`. The subset we provide is about 1.1MB. You should treat each email as a unique document to be indexed by your system. You can download the data from Canvas to your local filesystem. We're going to use these emails as the basis of OASIS, our Open Access Searchable Information System!


Below is an example of one email.

```text
Message-ID: <33025919.1075857594206.JavaMail.evans@thyme>
Date: Wed, 13 Dec 2000 13:09:00 -0800 (PST)
From: john.arnold@enron.com
To: slafontaine@globalp.com
Subject: re:spreads
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: John Arnold
X-To: slafontaine@globalp.com @ ENRON
X-cc:
X-bcc:
X-Folder: \John_Arnold_Dec2000\Notes Folders\'sent mail
X-Origin: Arnold-J
X-FileName: Jarnold.nsf

saw a lot of the bulls sell summer against length in front to mitigate
margins/absolute position limits/var.  as these guys are taking off the
front, they are also buying back summer.  el paso large buyer of next winter
today taking off spreads.  certainly a reason why the spreads were so strong
on the way up and such a piece now.   really the only one left with any risk
premium built in is h/j now.   it was trading equivalent of 180 on access,
down 40+ from this morning.  certainly if we are entering a period of bearish
to neutral trade, h/j will get whacked.  certainly understand the arguments
for h/j.  if h settles $20, that spread is probably worth $10.  H 20 call was
trading for 55 on monday.  today it was 10/17.  the market's view of
probability of h going crazy has certainly changed in past 48 hours and that
has to be reflected in h/j.




slafontaine@globalp.com on 12/13/2000 04:15:51 PM
To: slafontaine@globalp.com
cc: John.Arnold@enron.com
Subject: re:spreads



mkt getting a little more bearish the back of winter i think-if we get another
cold blast jan/feb mite move out. with oil moving down and march closer flat
px
wide to jan im not so bearish these sprds now-less bullish march april as
well.
```

# Part 1: Read and Parse the Email Data

## Print the first two documents

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:

import os
dirpath = "/content/drive/MyDrive/ML_633/enron_814" #can make changes to the path accordingly

files = []
for file in os.listdir(dirpath):
    file_path = os.path.join(dirpath, file)
    if os.path.isfile(file_path):
        with open(file_path, 'r') as f:
            outfile = f.read()
            files.append(outfile)

# print(files[0])
# Converting the files into list of dictionaries of document ID and document
docs = []

for f in files:
  sets = {}
  w = f.split()[1].split('.')
  s = w[0].split("<", 1)[1]
  sets['Document-ID'] = s+'.'+w[1]
  mssg = f.split('Jarnold.nsf',1)[1] #considered Jarnold.nsf word to split since it is common in all files

  #remove starting whiltelines for mssg
  for i in mssg.splitlines():
    if i != '':
      break
    else:
      mssg = mssg[1:]

  sets['content'] = mssg
  docs.append(sets)

#print first two docs
# print(docs[0:2])
print(docs[0]['Document-ID'], docs[0]['content'])
print(docs[1]['Document-ID'], docs[1]['content'])

12604682.1075857594601 someone had a foul odor for sure




Eric Thode@ENRON
11/30/2000 08:25 AM
To: John Arnold/HOU/ECT@ECT
cc:  
Subject: Photograph for eCompany Now

Thanks for your help with the photographer.  I just heard from Ann Schmidt 
and Margaret Allen that he was, shall we say, a "fragrant photographer."  


15267340.1075857594923 eat shit




John J Lavorato@ENRON
11/18/2000 01:01 PM
To: John Arnold/HOU/ECT@ECT
cc:  
Subject: 

Football bets 200 each

Minn -9.5
Buff +2.5
Phil -7
Indi -4.5
Cinnci +7
Det +6
clev +16
Den +9.5
Dall +7.5
Jack +3.5





Now that we can read the documents, let's move on to tokenization. We are going to simplify things for you:
1. You should **lowercase** all words.
2. Replace line breaks (e.g., \n, \n\n), punctuations, dashes and splash (e.g., -, /) and special characters (\u2019, \u2005, etc.) with empty space (" ").
3. Tokenize the documents by splitting on whitespace.
4. Then only keep words that have a-zA-Z in them.

In [None]:
import string

def tokens(doc):
  info = doc['content'].lower().replace("\n", " ")
  # Replace punctuation: !”#$%&'()*+,-./:;<=>?@[\]^_`{|}~
  w = ""
  for ch in info:
      if ch in string.punctuation:
        w += " "
      else:
        w += ch
  words = w.split()
  clear_words = list(filter(str.isalpha, words)) #to check for alphabets
  doc['tokens'] = clear_words
  # print(doc['content'])

## Print the first two documents after tokenizing (5 points)

* DocumentID Tokens

In [None]:

for doc in docs:
  tokens(doc)
print(docs[0]["Document-ID"], docs[0]["tokens"])
print(docs[1]["Document-ID"], docs[1]["tokens"])

12604682.1075857594601 ['someone', 'had', 'a', 'foul', 'odor', 'for', 'sure', 'eric', 'thode', 'enron', 'am', 'to', 'john', 'arnold', 'hou', 'ect', 'ect', 'cc', 'subject', 'photograph', 'for', 'ecompany', 'now', 'thanks', 'for', 'your', 'help', 'with', 'the', 'photographer', 'i', 'just', 'heard', 'from', 'ann', 'schmidt', 'and', 'margaret', 'allen', 'that', 'he', 'was', 'shall', 'we', 'say', 'a', 'fragrant', 'photographer']
15267340.1075857594923 ['eat', 'shit', 'john', 'j', 'lavorato', 'enron', 'pm', 'to', 'john', 'arnold', 'hou', 'ect', 'ect', 'cc', 'subject', 'football', 'bets', 'each', 'minn', 'buff', 'phil', 'indi', 'cinnci', 'det', 'clev', 'den', 'dall', 'jack']


## Dictionary Size

Next report the size of your dictionary, that is, how many unique tokens among all the documents.

In [None]:

wordfreq = []
wordlist = []
for doc in docs:
  for word in doc["tokens"]:
    wordlist.append(word)

for w in wordlist:
  wordfreq.append(wordlist.count(w))

print("List\n" + str(wordlist) + "\n")
print("Frequencies\n" + str(wordfreq) + "\n")
print("Pairs\n" + str(dict(zip(wordlist, wordfreq))))

List

Frequencies
[34, 119, 1879, 1, 1, 1165, 58, 31, 8, 2293, 563, 3768, 1514, 999, 1345, 2600, 2600, 820, 936, 1, 1165, 1, 179, 222, 1165, 577, 50, 576, 4111, 2, 1694, 225, 27, 704, 11, 8, 1795, 74, 96, 846, 199, 336, 20, 712, 46, 1879, 1, 2, 4, 11, 1514, 137, 50, 2293, 662, 3768, 1514, 999, 1345, 2600, 2600, 820, 936, 6, 3, 63, 3, 2, 5, 1, 2, 3, 3, 2, 1, 8, 102, 64, 1202, 289, 17, 107, 1, 85, 4111, 420, 54, 9, 7, 179, 14, 3768, 45, 24, 311, 4, 1202, 55, 32, 2293, 43, 39, 375, 704, 7, 40, 4, 563, 3768, 1514, 999, 1345, 2600, 2600, 820, 936, 420, 24, 4, 1514, 295, 1749, 604, 331, 17, 581, 1, 109, 107, 1202, 577, 3, 9, 222, 7, 176, 561, 7, 40, 4, 1345, 2600, 1590, 563, 2293, 43, 39, 375, 704, 7, 40, 4, 662, 3768, 1514, 999, 1345, 2600, 2600, 820, 936, 420, 24, 4, 1514, 41, 1202, 4111, 24, 4, 576, 7, 1694, 604, 45, 3768, 47, 394, 29, 577, 3, 1795, 18, 613, 1749, 604, 17, 1694, 155, 184, 3768, 81, 19, 1795, 1514, 7, 707, 4, 222, 7, 3, 3, 3, 39, 21, 1165, 331, 113, 33, 1465, 4111, 147, 17

In [6]:
token_pairs = dict(zip(wordlist, wordfreq))
print('Unique tokens: ', len(token_pairs))

Unique tokens:  9336


## Top-20 Words

In [None]:

most_popular_tokens = dict(sorted(token_pairs.items(), key=lambda x: x[1], reverse=True))
x = 0
print("Rank. Token, Count")
for key, value in most_popular_tokens.items():
  if x == 20:
    break
  print(f"{x+1}. {key}, {value}")
  x+=1

Rank. Token, Count
1. the, 4111
2. to, 3768
3. ect, 2600
4. enron, 2293
5. a, 1879
6. and, 1795
7. you, 1749
8. of, 1716
9. i, 1694
10. on, 1590
11. john, 1514
12. in, 1465
13. hou, 1345
14. com, 1317
15. is, 1202
16. for, 1165
17. arnold, 999
18. subject, 936
19. s, 869
20. it, 865


# Part 2: Boolean Retrieval

To make our life easier, output the DocumentIDs in lexicographic order.

In [8]:
# build the index here
# add cells as needed to organize your code

inverted_index = {} #dict to store tokens with their respective docs

for key, value in token_pairs.items():
  documents = []
  for doc in docs:
    if key in doc["tokens"]:
      documents.append(doc["Document-ID"])

  inverted_index[key] = documents

In [9]:
print(sorted(inverted_index["buyer"]))

['13960264.1075857658698', '17195279.1075857655281', '2268604.1075857652949', '25732708.1075857656969', '28376645.1075857655238', '33025919.1075857594206', '3677051.1075857657147', '5304428.1075857597649', '5307647.1075857657213']


## Running the five queries

In [10]:
# search for the input using your index and print out ids of matching documents.
def pre_process(query):
  info = query.lower().replace("\n", " ")
  # Replace punctuation: !”#$%&'()*+,-./:;<=>?@[\]^_`{|}~
  w = ""
  for ch in info:
      if ch in string.punctuation:
        w += " "
      else:
        w += ch
  words = w.split()
  clear_words = list(filter(str.isalpha, words)) #to check for alphabets
  return clear_words

def boolean(query):
  exp = pre_process(query)
  res = set()
  operation = 'AND'

  i = 0
  while i < len(exp):
    token = exp[i]
    if token == "and":
      operation = 'AND'
    elif token == "or":
      operation = 'OR'
    elif token == "not":
      if i + 1 < len(exp):
        exclude_set = set(inverted_index.get(exp[i + 1], []))
        res = res.difference(exclude_set)
      i += 1
    else:
      current_set = set(inverted_index.get(token, []))
      if operation == 'AND':
        res = res.intersection(current_set) if res else current_set
      elif operation == 'OR':
        res = res.union(current_set)
    i += 1

  return sorted(res)

Now show the results for the query: `buyer`

In [None]:

q1 = "buyer"
print("Query: ",q1, "\n")
res = boolean(q1)

for id in res[:5]:
  for doc in docs:
    if doc["Document-ID"] == id:
        print(f"Document-ID: {id}, Content: {doc['content']}")
        print("-----------------------------------------------Next Doc----------------------------------------------------------------------------")
        break

Query:  buyer 

Document-ID: 13960264.1075857658698, Content: very useful...thx.   keep me posted




Caroline Abramo@ENRON
12/22/2000 11:41 AM
To: John Arnold/HOU/ECT@ECT, Mike Maggi/Corp/Enron@Enron, Jennifer 
Fraser/HOU/ECT@ECT
cc: Per Sekse/NY/ECT@ECT 
Subject: fund views

Hi- all the funds are trying to figure out what the play is for next year- 
major divergence of opinions.  Most everyone we talk to takes a macro view.

+ Dwight Anderson from Tudor thinks anything above $6 is a sale from the 
perspective of shut in industrial demand- he believes that between $6-7 no 
industrial (basic industry type) can operate.  He tracks all the plant 
closures similar to what Elena does in Mike Robert's group but it seems on a 
more comprehensive level (Jen- it would be good for fundamentals to track 
this number- do some scenario analysis on it under various economic 
conditions- like recession!!).  I will try to find out what his total number 
for turned back gas is- just ammonia is a littl

*Now* show the results for the query: `margins AND limits`

---




In [None]:

q2 = "margins AND limits"
print("Query: ",q2, "\n")
res = boolean(q2)

for id in res[:5]:
  for doc in docs:
    if doc["Document-ID"] == id:
        print(f"Document-ID: {id}, Content: {doc['content']}")
        print("-----------------------------------------------Next Doc----------------------------------------------------------------------------")
        break

Query:  margins AND limits 

Document-ID: 33025919.1075857594206, Content: saw a lot of the bulls sell summer against length in front to mitigate 
margins/absolute position limits/var.  as these guys are taking off the 
front, they are also buying back summer.  el paso large buyer of next winter 
today taking off spreads.  certainly a reason why the spreads were so strong 
on the way up and such a piece now.   really the only one left with any risk 
premium built in is h/j now.   it was trading equivalent of 180 on access, 
down 40+ from this morning.  certainly if we are entering a period of bearish 
to neutral trade, h/j will get whacked.  certainly understand the arguments 
for h/j.  if h settles $20, that spread is probably worth $10.  H 20 call was 
trading for 55 on monday.  today it was 10/17.  the market's view of 
probability of h going crazy has certainly changed in past 48 hours and that 
has to be reflected in h/j.




slafontaine@globalp.com on 12/13/2000 04:15:51 PM
To: s

Now show the results for the query: `winter OR summer`

In [None]:

q3 = "winter OR summer"
print("Query: ",q3, "\n")
res = boolean(q3)

for id in res[:5]:
  for doc in docs:
    if doc["Document-ID"] == id:
        print(f"Document-ID: {id}, Content: {doc['content']}")
        print("-----------------------------------------------Next Doc----------------------------------------------------------------------------")
        break

Query:  winter OR summer 

Document-ID: 10353423.1075857652669, Content: maybe.  hydro situation dire in west.  think water levels are at recent 
historical lows.  problem is from gas standpoint, west is an island right 
now.  every molecle that can go there is.  so will provide limited support to 
prices in east.  hydro in east is actually very healthy.  would assume your 
markets are targeting eastern u.s. so i dont know if hydro problem in west is 
that relevant.




Sarah Mulholland
04/04/2001 08:09 AM
To: John Arnold/HOU/ECT@ECT
cc:  
Subject: Re: us fuel 4/2/01

interesting comment from singapore........

hope things are going well up there.
---------------------- Forwarded by Sarah Mulholland/HOU/ECT on 04/04/2001 
08:08 AM ---------------------------


Hans Wong
04/04/2001 08:05 AM
To: Sarah Mulholland/HOU/ECT@ECT
cc: Niamh Clarke/LON/ECT@ECT, Stewart Peter/LON/ECT@ECT, Caroline 
Cronin/EU/Enron@Enron, Angela Saenz/ENRON@enronXgate 
Subject: Re: us fuel 4/2/01  

i was reading 

Now show the results for the query: `buyers AND risk AND NOT crazy`

In [None]:

q4 = "buyers AND risk AND NOT crazy"
print("Query: ",q4, "\n")
res = boolean(q4)

for id in res[:5]:
  for doc in docs:
    if doc["Document-ID"] == id:
        print(f"Document-ID: {id}, Content: {doc['content']}")
        print("-----------------------------------------------Next Doc----------------------------------------------------------------------------")
        break

Query:  buyers AND risk AND NOT crazy 

Document-ID: 2726985.1075857655016, Content: ---------------------- Forwarded by John Arnold/HOU/ECT on 03/06/2001 07:48 
AM ---------------------------
From: Ann M Schmidt@ENRON on 03/05/2001 08:23 AM
To: Ann M Schmidt/Corp/Enron@ENRON
cc:  (bcc: John Arnold/HOU/ECT)
Subject: Enron Mentions - 03-04-01

Utility Deregulation: Square Peg, Round Hole?
The New York Times, 03/04/01

3 Executives Considered to Head Military
Los Angeles Times, 03/04/01

Bush leaning toward execs for military
The Seattle Times, 03/04/01

Enron's Chief Denies Role as Energy Villain / Critics regard Kenneth Lay as 
deregulation opportunist
The San Francisco Chronicle, 03/04/01

Enron boss says he's not to blame for profits in energy crisis
Associated Press Newswires, 03/04/01

The Stadium Curse? / Some stocks swoon after arena deals
The San Francisco Chronicle, 03/04/01



Money and Business/Financial Desk; Section 3
ECONOMIC VIEW
Utility Deregulation: Square Peg, Round Ho

Now show the results for the query: `never OR know`

In [None]:

q5 = "never OR know"
print("Query: ",q5, "\n")
res = boolean(q5)

for id in res[:5]:
  for doc in docs:
    if doc["Document-ID"] == id:
        print(f"Document-ID: {id}, Content: {doc['content']}")
        print("-----------------------------------------------Next Doc----------------------------------------------------------------------------")
        break

Query:  never OR know 

Document-ID: 10008095.1075857595829, Content: not me




Brian Hoskins@ENRON COMMUNICATIONS
10/23/2000 03:47 PM
To: John Arnold/HOU/ECT@ECT
cc:  
Subject: Re:

Never mind.  Did you do that?


Brian T. Hoskins
Enron Broadband Services
713-853-0380 (office)
713-412-3667 (mobile)
713-646-5745 (fax)
Brian_Hoskins@enron.net


----- Forwarded by Brian Hoskins/Enron Communications on 10/23/00 03:54 PM 
-----

	John Cheng@ENRON
	Sent by: John Cheng@ENRON
	10/23/00 03:47 PM
		
		 To: John Cheng/NA/Enron@Enron
		 cc: Brian Hoskins/Enron Communications@ENRON COMMUNICATIONS, John 
Cheng/NA/Enron@Enron@ENRON COMMUNICATIONS, Fangming 
Zhu/Corp/Enron@ENRON@ENRON COMMUNICATIONS, Pablo 
Torres/Corp/Enron@ENRON@ENRON COMMUNICATIONS, Don Adam/Corp/Enron@ENRON@ENRON 
COMMUNICATIONS, Marlin Gubser/HOU/ECT@ECT, Mark Hall/HOU/ECT
		 Subject: Re:

All,

Sorry for the lengthy thread.  A service request has been opened with 
Microsoft on the IE5 issues and her application.

Regards,

-jk

## Observations 

Although the boolean search engine does find relevant documents for the given queries, the documents returned may not always be the best ones, as there is no ranking system in place. The results are simply returned in lexicographic order, which may not necessarily reflect the most relevant or high-quality documents. Another point to note is that my search engine treats words like "time" and "times" as separate terms, which may affect the search results.

For straightforward and simple queries, customers would likely be very satisfied with the results. However, when queries become more complex, especially with multiple "AND", "OR", and "NOT" operators, the search time could increase significantly. Furthermore, customers may find it challenging to construct such complex queries unless they have a deep understanding of the search logic. Therefore, I would recommend using this engine primarily for simple queries.

As for the impact of pre-processing, I believe it is a crucial step in the model as it helps in properly tokenizing both the context and the query, allowing us to find better matches. In this regard, pre-processing positively affects search quality by ensuring cleaner data for comparison. However, it can sometimes have a negative impact, especially when it unintentionally removes important elements such as punctuation marks (e.g., colons for time), which may lead to confusion or incorrect matches. Therefore, while pre-processing is important for improving search accuracy, careful consideration must be given to how it handles special cases to prevent potential issues.

# Part 3: Ranking

We will explore two ranking methods, each offering a different approach to scoring and ranking documents:

### Ranking method A: Ranking with vector space model with TF-IDF

In [None]:

#doc -> TF-IDF
import numpy as np
import math

def countdocs(docs, token):
  return sum([1 for doc in docs if token in doc["tokens"]])

def term_freq(doc, token):
    return doc["tokens"].count(token)

def tf_idf(doc, docs):
  N = len(docs)
  tfidf_scores = {}
  tokens = set(doc['tokens'])
  for token in tokens:
    idf = math.log2(N/countdocs(docs, token))
    tf = math.log2(1 + term_freq(doc, token))
    tfidf_scores[token] = tf * idf
  return tfidf_scores

# print(tf_idf(docs[2],docs))

def query(q):
  query_vec = {}
  words = pre_process(q)
  for i in set(words):
    query_vec[i] = words.count(i)
  return query_vec

# print(query("trade too; much trade"))

def cosine_similarity(query_vec, doc_vec):
  all_terms = set(query_vec.keys()).union(set(doc_vec.keys()))
  dot_product = sum(query_vec.get(term, 0) * doc_vec.get(term, 0) for term in all_terms)
  query_mag = math.sqrt(sum(value ** 2 for value in query_vec.values()))
  doc_mag = math.sqrt(sum(value ** 2 for value in doc_vec.values()))
  if query_mag == 0 or doc_mag == 0:
    return 0.0
  return dot_product / (query_mag * doc_mag)

def tfidf_ranking(q, docs):
  similarity_scores = []
  query_vec = query(q)
  for doc in docs:
    doc_id = doc["Document-ID"]
    doc_content = doc["content"]
    doc_vec = tf_idf(doc, docs)
    score = cosine_similarity(query_vec, doc_vec)
    similarity_scores.append((doc_id, score, doc_content))

  scores = sorted(similarity_scores, key=lambda x:x[1], reverse=True)
  for rank, (doc_id, score, doc_content) in enumerate(scores[:5], start=1):
    print(f"Rank: {rank}, Score: {score}, Document-ID: {doc_id}, Document: {doc_content}")
  return

# print out the top-5 retrieved emails
tfidf_ranking("trade", docs)


Rank: 1, Score: 0.3053426239418323, Document-ID: 15827855.1075857658654, Document: torrey:
please set me up to trade crude.
John
Rank: 2, Score: 0.2993771874174144, Document-ID: 32835197.1075857597302, Document: Hey:
I just want to confirm the trades I have in your book.
Trade #1.  I sell 4000 X @ 4652

Trade #2. I buy 4000 X @ 4652
  I sell 4000 X @ 4902

Trade #3 I buy 4000 X @ 5000
  I sell 4000 F @ 5000


Net result: I have 4000 F in your book @ 4902.
Thanks, 
John
Rank: 3, Score: 0.2283779396714837, Document-ID: 2752057.1075857658632, Document: Torrey:
Can you also approve Mike Maggi to trade crude as well.  Thanks for your help.
John
Rank: 4, Score: 0.18569093447163737, Document-ID: 5340834.1075857658345, Document: greg:
what is the (correct) formula you devised for profitability on last trade is 
mid?
Rank: 5, Score: 0.1788072673138548, Document-ID: 3383202.1075857656796, Document: you fucker that's my trade.   i was trying to buy nines the last 20 minutes.  
all i got was scrap

### Ranking method B: Ranking with BM25
Finally, let's try the BM25 approach for ranking. Refer to https://en.wikipedia.org/wiki/Okapi_BM25 for the specific formula. You could choose k_1 = 1.2 and b = 0.75 but feel free to try other options.

In [None]:

def idf(docs, token):
  N = len(docs)
  return math.log((N-countdocs(docs, token)+0.5)/(countdocs(docs, token)+0.5)+1)

def score_bm25(docs, q):
  query = pre_process(q)
  score_bm = []
  size = sum(len(doc["tokens"]) for doc in docs)
  avgdl = size/len(docs)
  for doc in docs:
    score = 0
    d = len(doc["tokens"])
    doc_id = doc["Document-ID"]
    doc_content = doc["content"]
    for token in query:
      score += ((idf(docs, token)* term_freq(doc, token)*(2.2))/(term_freq(doc, token)+(1.2)*(0.25+0.75*(d/avgdl))))
    score_bm.append((doc_id, score, doc_content))

  scores_bm = sorted(score_bm, key=lambda x:x[1], reverse=True)
  return scores_bm[ :5]

# print out the top-5 retrieved emails
res = score_bm25(docs, "gas floor")
for rank, (doc_id, score, doc_content) in enumerate(res, start=1):
    print(f"Rank: {rank}, Score: {score}, Document-ID: {doc_id}, Document: {doc_content}")

Rank: 1, Score: 7.651994537660597, Document-ID: 29559946.1075857598198, Document: Thx for the spreadsheet.   2 questions : What time frame does this entail and 
does the correlation between the trader and AGG GAS include that trader's 
contribution to the floor's P&L.  In other words, is my P&L correlated with 
the floor or is it correlated to the rest of the floor absent me?


   
	Enron North America Corp.
	
	From:  Frank Hayden @ ENRON                           09/19/2000 02:41 PM
	

To: John Arnold/HOU/ECT@ECT
cc:  
Subject: 




Rank: 2, Score: 6.841881655391271, Document-ID: 32732331.1075857597410, Document: John:
I have asked Mike and Larry to spend half an hour each talking to you about 
opportunities on the gas floor.  Please advise if the following schedule is 
unacceptable.  I will be leaving today at 2:15.
Larry 4:00-4:30
Mike 4:30-5:00

Thanks,
John
Rank: 3, Score: 6.816065720711677, Document-ID: 23846275.1075857658302, Document: John:
I would like for you to come talk to 

In [18]:
res1 = score_bm25(docs, "trade")
for rank, (doc_id, score, doc_content) in enumerate(res1, start=1):
    print(f"Rank: {rank}, Score: {score}, Document-ID: {doc_id}, Document: {doc_content}")

Rank: 1, Score: 4.5396847527510085, Document-ID: 32835197.1075857597302, Document: Hey:
I just want to confirm the trades I have in your book.
Trade #1.  I sell 4000 X @ 4652

Trade #2. I buy 4000 X @ 4652
  I sell 4000 X @ 4902

Trade #3 I buy 4000 X @ 5000
  I sell 4000 F @ 5000


Net result: I have 4000 F in your book @ 4902.
Thanks, 
John
Rank: 2, Score: 4.061095982472488, Document-ID: 3383202.1075857656796, Document: you fucker that's my trade.   i was trying to buy nines the last 20 minutes.  
all i got was scraps.  50-100.  i think it's a great trade.




slafontaine@globalp.com on 02/07/2001 01:41:44 PM
To: John.Arnold@enron.com
cc:  
Subject: Re: weather pop



that is nuts-good sale-im gonna sell jun or july otm calls at some point




Rank: 3, Score: 3.990827641616982, Document-ID: 30793972.1075857600929, Document: Jim:
The list I gave you is a list of brokers that can clear through you, NOT 
brokers that you clear exclusively.  You do not clear any of my brokers 
exclusivel

* observations:


1.   There is a huge difference between the runtimes of both ranking approaches. BM25 proved to be best with 10 sec runtime while TF-IDF took more than a minute.
2.  For queries like "trade" as seen above, BM25 produces more relevant and diverse results, while TF-IDF return documents where the term appears excessively, even if those documents are less relevant.
3.  BM25 is way better than TF-IDF.

* Differences:


1.  BM25 limits the influence of repeated words using the k1 parameter, preventing term spamming from biasing the results. In contrast, TF-IDF gives higher priority to frequently occurring words, leading to potential keyword stuffing.
2.  BM25 balances normalization with the b parameter, preventing bias against long documents, whereas TF-IDF lacks length normalization by default, often favoring longer documents with more terms.
3.  BM25 provides more contextually relevant results compared to TF-IDF



# Part 4: Cool LLM RAG Extension

## Step 1: Query the LLM Directly (Without Retrieval)

In [None]:

!pip install -U -q "google-generativeai>=0.7.2" # Install the Python SDK
import google.generativeai as genai

In [20]:
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

In [21]:
query = "How is the job market for software field?"
model = genai.GenerativeModel('gemini-1.5-flash')
response = model.generate_content(query)
print(response.text)

The job market for software engineers and related roles remains strong, but the landscape is nuanced and evolving.  Here's a breakdown:

**Positive aspects:**

* **High demand:**  Software developers are in consistently high demand across various industries.  Almost every company, regardless of size or sector, relies on software and needs skilled professionals to build, maintain, and improve it.
* **Numerous specializations:**  The field offers a wide range of specializations, from front-end and back-end development to data science, machine learning, cybersecurity, DevOps, and more.  This diversity provides many career paths and opportunities for individuals with varying interests and skills.
* **Remote work opportunities:**  The software industry has embraced remote work more readily than many others, offering significant flexibility and location independence for many roles.
* **Good salaries:**  Software engineers generally command competitive salaries, especially those with in-deman

## Step 2: Query the LLM with Retrieved Emails (RAG)

In [22]:
# print out the LLM response without the email content
query2 = "How to effectively sell the energy"
print("query: ", query2)
model = genai.GenerativeModel('gemini-1.5-flash')
response = model.generate_content(query2)
print("The LLM's response without retrieval:")
print(response.text)
print("----------------------------------------------------------------------")
# print out the LLM response with the email content (RAG)
# BM25
content = score_bm25(docs, query2)
context = ""
for i in content:
  context += i[2]
model = genai.GenerativeModel('gemini-1.5-flash')
response = model.generate_content([query2, context])
print("The LLM's response with retrieval (RAG):")
print(response.text)

print("------------------------------------------------------------------------")
print("Results of BM25 search: ")
for rank, (doc_id, score, doc_content) in enumerate(content, start=1):
    print(f"Rank: {rank}, Score: {score}, Document-ID: {doc_id}, Document: {doc_content}")

query:  How to effectively sell the energy
The LLM's response without retrieval:
Selling energy effectively depends heavily on *what kind of energy* you're selling.  Are you selling:

* **Electricity (to homes/businesses)?** This requires a very different approach than...
* **Renewable energy credits (RECs)?**  Or...
* **Fossil fuels (oil, gas, coal)?**  Or...
* **Energy efficiency services (audits, installations)?** Or...
* **Energy drinks?**

The strategies for each are vastly different.  However, some general principles apply across the board:

**General Principles for Selling Energy (any type):**

* **Understand your target audience:**  Their needs, concerns, and motivations will dictate your sales approach.  A homeowner concerned about their energy bill will respond differently than a large industrial buyer negotiating a long-term contract.
* **Highlight the value proposition:**  Don't just sell the energy itself; sell the *benefits*.  This might include cost savings, environmenta

In [23]:
query3 = "How to handle financial pressure?"
print("query: ", query3)
model = genai.GenerativeModel('gemini-1.5-flash')
response = model.generate_content(query3)
print("The LLM's response without retrieval:")
print(response.text)
print("----------------------------------------------------------------------")
# print out the LLM response with the email content (RAG)
# BM25
content1 = score_bm25(docs, query3)
context1 = ""
for i in content1:
  context1 += i[2]
model = genai.GenerativeModel('gemini-1.5-flash')
resp = model.generate_content([query3, context1])
print("The LLM's response with retrieval (RAG):")
print(resp.text)

print("------------------------------------------------------------------------")
print("Results of BM25 search: ")
for rank, (doc_id, score, doc_content) in enumerate(content1, start=1):
    print(f"Rank: {rank}, Score: {score}, Document-ID: {doc_id}, Document: {doc_content}")

query:  How to handle financial pressure?
The LLM's response without retrieval:
Financial pressure can be incredibly stressful, but it's manageable with a proactive and structured approach. Here's a breakdown of how to handle it:

**1. Assess Your Situation:**

* **Track your spending:** Use budgeting apps, spreadsheets, or even a notebook to monitor where your money is going. Identify areas where you can cut back.  Be honest with yourself – this is crucial.
* **List your income and expenses:**  Clearly see the difference between the two.  Are you spending more than you earn? This is the core issue that needs addressing.
* **Identify your debts:** List all your debts (credit cards, loans, etc.), their interest rates, minimum payments, and total balances.  This gives you a clear picture of your debt burden.
* **Determine the source of the pressure:** Is it unexpected expenses, a job loss, low income, high debt, or a combination of factors?  Understanding the root cause is essential for 

## Discussion

*Retrieval affect on LLM's response: *

* Yes, we can observe that retrieval improved contextual relevance by incorporating domain-specific knowledge, such as Enron trading practices in the financial pressure example.
* In some cases, it also shifted the focus, rather than general financial advice, the RAG response provided insights from real-world scenarios. The relevance of the response largely depended on how well the retrieved context matched the query.
* It also enhanced factual accuracy by grounding the response in retrieved content rather than relying only on the LLM's pre-trained knowledge.

*Retrieval hurt performance: *
* In the "energy selling" example, the retrieved documents were not entirely relevant. Since BM25 selected documents based on loose keyword matches, the response ended up focusing on market dynamics from 2000-2001 rather than providing general sales strategies.
* Some retrieved documents repeated common knowledge which lead to responses that were longer without adding meaningful value.
* Apart from it, outdated information in the context such as energy selling, could mislead the LLM's result.

*Ideas for improving: *
* I believe "Filtering" the retrieved context by removing redundant or less relevant information could improve response quality.
* We may also try using an LLM-based re-ranking model's on top of BM25 to select more relevant context for better responses.