# Using Raphtory similarity search to uncover Enron criminal network

The Enron scandal was one of the largest corporate fraud cases in history, leading to the downfall of the company and the conviction of several executives. The graph below illustrates the significant decline in Enron's stock price between August 2000 and December 2001, providing valuable insights into the company's downfall.

![enron stock price](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/EnronStockPriceAugust2000toJanuary2001.svg/567px-EnronStockPriceAugust2000toJanuary2001.svg.png)

Some important nodes in Enron networks, implicated or not in the case, are the following:
- kenneth.lay: Kenneth Lay (CEO)
- andrew.fastow: Andrew Fastow (CFO)
- louise.kitchen: Louise Kitchen (Managing Director)
- j.kaminski: Vincent Kaminski (Managing Director for Research, raised strong objections to the financial practices of Andrew Fastow)
- a..shankman: Jeffrey Adam Shankman (Head of the global markets division, charged with White Collard Crime)
- michelle.cash: Michelle Cash (Assistant General Counsel)

Now, imagine being the judge in this case, facing a dataset with hundreds of thousands of emails and having the responsability of finding all the criminals and stuff involved. How much time would you need to find all of them and their relationships.

Well, hold tight because Raphtory supports now similarity search out of the box. This means we can quickly seearch over the whole network of emails messages to spot various types of crime just by submitting some semantic query regarding the crimes we are interested in disclosure.

## Wait how does this work? And what is similarity search in the first place?

Similarity search is nothing new. It's a technique that allows you to search among a set of documents those that are more similar from a semantic perspective so some given query.

For instance, the query might be `hiding information`. If we submit that query over a set of documents containing email messages, we are likely going to have returned those documents which in some way talk about hiding information.

And which role does that play in Raphtory land? In order to cross the bridge, we need to somehow be able to represent the entities in some graphs or some parts of them as documents. Then, this documents can be translated into embeddings or vectors using some embedding function so we can search over them. We call this in Raphtory vectorising a graph, and it is as easy as:

```
vg = g.vectorise(embeddding_function)
```

Raphtory has a default way to translate graph entities into documents. However, if we have a deep understanding of the semantics living in our graph, we can always create those documents ourselves, insert them as properties, and let Raphtory know which property use to pick them up:

```
vg = g.vectorise(embeddding_function, nodes="document", edges="document")
```

And that is it. Now running a similarity search query against the graph becomes trivial. We just need to use some of the methods available in the VectorisedGraph to select some documents using a query and then get the documents.

```
vg.append_by_similarity('hiding information', limit=10).get_documents()
```

This was a basic query, but you have many more methods available that allow you to implement complex similarity search algorithms leveraging the graph space between documents. 

Now that we have all the basics, let's get started trying to find some criminals!

### Imports and helper functions

In [1]:
import re
import pandas as pd
import altair as alt
from raphtory import *
from raphtory import algorithms
from raphtory.vectors import *
from langchain.embeddings import HuggingFaceEmbeddings
from email.utils import parsedate_to_datetime, parsedate
from datetime import timezone, datetime
from time import mktime


### Ingesting the data

In [None]:
# helper functions for the parsing of the enron dataset

def extract_sender(text):
    sender_cut = text.split("\nFrom: ")
    if len(sender_cut) > 1:
        email_cut =  sender_cut[1].split("\n")[0].split("@")
        if len(email_cut) > 1:
            return email_cut[0]
        else: 
            return
    else:
        return
    
def extract_sender_domain(text):
    sender_cut = text.split("\nFrom: ")
    if len(sender_cut) > 1:
        email_cut =  sender_cut[1].split("\n")[0].split("@")
        if len(email_cut) > 1:
            return email_cut[1]
        else: 
            return
    else:
        return
    
def extract_recipient(text):
    recipient_cut = text.split("\nTo: ")
    if len(recipient_cut) > 1:
        email_cut = recipient_cut[1].split("\n")[0].split("@")
        if len(email_cut) > 1:
            return email_cut[0]
        else:
            return
    else:
        return
    
def extract_actual_message(text):
    try:
        body = re.split("X-FileName: .*\n\n", text)[1]
        return re.split('-{3,}\s*Original Message\s*-{3,}', body)[0][:1000]
    except:
        return

extract_date = lambda text: text.split("Date: ")[1].split("\n")[0]

In [3]:
# We ingest the email dataset and carry out some cleaning using pandas
enron = pd.DataFrame()
enron['email'] = pd.read_csv('emails.csv', usecols=['message'])['message']
enron['src'] = enron['email'].apply(extract_sender)
enron['dst'] = enron['email'].apply(extract_recipient)
enron['time'] = enron['email'].apply(extract_date)
enron['message'] = enron['email'].apply(extract_actual_message)
enron['message'] = enron['message'].str.strip()

enron = enron.dropna(subset=["src", "dst", "time", "message"])
enron = enron.drop_duplicates(['src', 'dst', 'time', 'message'])
enron = enron[enron['message'].str.len() > 5]
enron = enron[enron['dst'] != 'undisclosed.recipients']
enron = enron[enron['email'].apply(extract_sender_domain) == 'enron.com']

enron['document'] = enron['src'] + " sent a message to " + enron['dst'] + " at " + enron['time'] + " with the following content:\n" + enron['message']

# And then we ingest those emails into a Raphtory graph, where the nodes are people,
# most of them belonging to Enron, and the edges are emails sent between them
raw_graph = Graph()
def ingest_edge(record):
    e = raw_graph.add_edge(record['time'], record['src'], record['dst'], {'document': record['document']})
    raw_graph.add_vertex(record['time'], record['src']).add_constant_properties({'document': ''})
    raw_graph.add_vertex(record['time'], record['dst']).add_constant_properties({'document': ''})
enron.apply(ingest_edge, axis=1)

# Finally we apply a window around the most interesting time period of the Enron bankrupty,
# the last 4 months of 2001
g = raw_graph.window('2001-09-01 00:00:00', '2002-01-01 00:00:00') # 4 months


And here our vectors module starts playing its role. We first vectorise the graph we just built. For that, as we explained before, we need to use an embedding function that translates documents into vectors. For this task we are going to pick up one model from langchain that can run locally, `gte-small`. Bear in mind that this is computationally very expensive. If we run this from a fresh start, that task will take hours to complete. But don't worry, to ease this, the vectorising process allows you to set up a cache file, so that if we find the embeddings for a document in it, we just grab it and avoid calling the expensive model. We already did this for you and so, there is already a file `embedding-cache` in the current directory that will successfully contain all the embeddings we need today.

In [6]:
embeddings = HuggingFaceEmbeddings(model_name="thenlper/gte-small")
embedding_function = lambda texts: embeddings.embed_documents(texts)

v = g.vectorise(
    embedding_function,
    "./embedding-cache",
    node_document="document",
    edge_document="document",
    verbose=True)


computing embeddings for nodes
computing embeddings for edges


### Running queries

Congratulations! you successfully loaded a vectorised graph containing four months of the Enron email dataset. Now you can submit any queries you want to try and find the criminal network hide behind this complex spider web. For instance, one of the presunt crimes that were commited was hiding information that later on lead to the bankruty. Let's try to find those interactions!

In [7]:
documents = v.append_edges_by_similarity("hiding information", 10).get_documents()
for doc in documents:
    print(doc.content)
    print('===========================================================================================')
    

eric.thode sent a message to louise.kitchen at Tue, 2 Oct 2001 06:22:21 -0700 (PDT) with the following content:
I guess this means everyone knows!  You just can't hide.
david.port sent a message to john.lavorato at Mon, 19 Nov 2001 07:51:54 -0800 (PST) with the following content:
i knew you would be hiding somewhere
i am afraid i'm going to have to tell on you
andy.rodriquez sent a message to charles.yeung at Tue, 23 Oct 2001 10:29:29 -0700 (PDT) with the following content:
NERC is apparently moving forward aggressively on their initiative to restrict access to market information.  Doug Sewell, who participates on the MAIN Planning Committee, was on a call this morning in which members of MAIN and NERC's Virginia Sulzberger discussed ways to limit access to information.  A new disturbing angle was that apparently, some members indirectly alleged that marketers have "higher turnover" and "loose lips," and were using this as a scare tactic to indicate why the information should be restri

In [None]:
And here it is, 

In [None]:
# TODO: implement an example of Langchain retriever using raphtory