# Using Raphtory similarity search to uncover Enron criminal network

The Enron scandal was one of the largest corporate fraud cases in history, leading to the downfall of the company and the conviction of several executives. The graph below illustrates the significant decline in Enron's stock price between August 2000 and December 2001, providing valuable insights into the company's downfall.

![enron stock price](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/EnronStockPriceAugust2000toJanuary2001.svg/567px-EnronStockPriceAugust2000toJanuary2001.svg.png)

Now, put yourself in the judge's seat, confronted with a vast dataset comprising hundreds of thousands of emails. Your responsibility? Identifying every culprit and related elements within. How long would it take to uncover all of them and their connections?

Prepare for a revelation: Raphtory now boasts seamless similarity search functionality. This enables swift exploration across the entire network of email messages, swiftly pinpointing diverse criminal activities. All it takes is submitting a semantic query pertaining to the specific crimes under investigation.

## Wait, how does this work? And what is similarity search in the first place?

Similarity search isn't a novel concept. It's a powerful technique that sifts through a collection of documents to identify those bearing semantic resemblance to a given query.

Consider a query like `hiding information`. Imagine applying this query across a corpus of email messages; the result would likely yield documents discussing various aspects of concealing information.

And which role does that play in Raphtory land? To traverse the bridge, we must represent entities within graphs as documents or sets of documents. These documents are then transformed into embeddings or vectors using an embedding function for effective searchability. In Raphtory, this process is referred to as 'vectorising' a graph, and it is as easy as:

```
vg = g.vectorise(embeddding_function)
```

Raphtory has a default way to translate graph entities into documents. However, if we have a deep understanding of our graph's semantics, we can always create those documents ourselves, insert them as properties, and let Raphtory know which property name use to pick them up:

```
g.add_node(0, 'Kenneth Lay', {'document': 'Kenneth Lay is the former CEO of Enron'})
vg = g.vectorise(embeddding_function, nodes="document")
```

Voila! Executing a similarity search query on the graph is now straightforward. Using methods within `VectorisedGraph`, we can select and retrieve documents based on a query:

```
vg.append_by_similarity('hiding information', limit=10).get_documents()
```

This example is a basic query, capturing the top 10 highest-scoring documents. However, Raphtory offers an array of advanced methods, enabling the implementation of complex similarity search algorithms. You can combine different queries into a single selection or even leverage the graph's space between documents to add more context to one selection using an similarity based expansion.

Now, armed with these fundamentals, let's embark on the quest to unearth some potential criminals!

## Preparing the investigation

In [1]:
import re
import pandas as pd
import altair as alt
from raphtory import *
from raphtory import algorithms
from raphtory import export
from raphtory.vectors import *
from langchain.embeddings import HuggingFaceEmbeddings
from email.utils import parsedate_to_datetime, parsedate
from datetime import timezone, datetime
from time import mktime


First, we define some auxiliary functions for parsing of the Enron dataset

In [2]:
def extract_sender(text):
    sender_cut = text.split("\nFrom: ")
    if len(sender_cut) > 1:
        email_cut =  sender_cut[1].split("\n")[0].split("@")
        if len(email_cut) > 1:
            return email_cut[0]
        else: 
            return
    else:
        return
    
def extract_sender_domain(text):
    sender_cut = text.split("\nFrom: ")
    if len(sender_cut) > 1:
        email_cut =  sender_cut[1].split("\n")[0].split("@")
        if len(email_cut) > 1:
            return email_cut[1]
        else: 
            return
    else:
        return
    
def extract_recipient(text):
    recipient_cut = text.split("\nTo: ")
    if len(recipient_cut) > 1:
        email_cut = recipient_cut[1].split("\n")[0].split("@")
        if len(email_cut) > 1:
            return email_cut[0]
        else:
            return
    else:
        return
    
def extract_actual_message(text):
    try:
        body = re.split("X-FileName: .*\n\n", text)[1]
        return re.split('-{3,}\s*Original Message\s*-{3,}', body)[0][:1000]
    except:
        return

extract_date = lambda text: text.split("Date: ")[1].split("\n")[0]

Then, we ingest the email dataset and carry out some cleaning using pandas

In [3]:
enron = pd.DataFrame()
enron['email'] = pd.read_csv('emails.csv', usecols=['message'])['message']
enron['src'] = enron['email'].apply(extract_sender)
enron['dst'] = enron['email'].apply(extract_recipient)
enron['time'] = enron['email'].apply(extract_date)
enron['message'] = enron['email'].apply(extract_actual_message)
enron['message'] = enron['message'].str.strip()

enron = enron.dropna(subset=["src", "dst", "time", "message"])
enron = enron.drop_duplicates(['src', 'dst', 'time', 'message'])
enron = enron[enron['message'].str.len() > 5]
enron = enron[enron['dst'] != 'undisclosed.recipients']
enron = enron[enron['email'].apply(extract_sender_domain) == 'enron.com']

enron['document'] = enron['src'] + " sent a message to " + enron['dst'] + " at " + enron['time'] + " with the following content:\n" + enron['message']

Next, we ingest those emails into a Raphtory graph. Here individuals serve as nodes,
most of them belonging to Enron, and the edges repesent email exchanges between them. Our criminal investigation targets the last four months of 2001, coinciding with the Enron bankruptcy, so we will create a window over the graph for that period.

In [4]:
raw_graph = Graph()
def ingest_edge(record):
    e = raw_graph.add_edge(record['time'], record['src'], record['dst'], {'document': record['document']})
    raw_graph.add_node(record['time'], record['src']).add_constant_properties({'document': ''})
    raw_graph.add_node(record['time'], record['dst']).add_constant_properties({'document': ''})
enron.apply(ingest_edge, axis=1)
g = raw_graph.window('2001-09-01 00:00:00', '2002-01-01 00:00:00')

And our `vectors` module comes into play at this stage. We are going to vectorise the graph we just built. As previously outlined, this involves employing an embedding function that translates documents into vectors. For this purpose, we've selected a local model from Langchain named `gte-small`. It's important to note that this operation is computationally very expensive. When initiated from scratch, the process can span several hours. However, to streamline this, the vectorising process enables the setup of a cache file. By utilizing this cache, embeddings for previously processed documents are readily available, avoiding the need to invoke the resource-intensive model repeatedly. Fortunately, we've already taken this step for you, and there already exists a file named `embedding-cache` in the current directory, containing all the necessary embeddings for today's task, so the execution will be instant

In [5]:
embeddings = HuggingFaceEmbeddings(model_name="thenlper/gte-small")
embedding_function = lambda texts: embeddings.embed_documents(texts)

vg = g.vectorise(
    embedding_function,
    "./embedding-cache",
    node_document="document",
    edge_document="document",
    verbose=True)


.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/66.8M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

computing embeddings for nodes
computing embeddings for edges


## Finding the criminal network

Congratulations! You've successfully loaded a vectorised graph containing four critical months of the Enron email dataset. Using our similarity search engine, we can now submit queries to uncover potential criminal behavior or, at the very least, steer the investigation in the right direction. Our aim is to identify individuals who might hold pertinent information regarding Enron's internal practices. You're encouraged to experiment with your own queries and methodologies. However, for starters, we'll concentrate on three pivotal topics that were crucial in past investigations:
1. Hiding company debt through special purpose entities (SPEs)
2. Manipulation of the energy market
3. Withholding crucial information from investors

We will make use of the following auxiliary functions to help us get the job done. Feel free to explore these functions for insights on handling common tasks.

In [6]:
def print_emails(query, limit):
    for doc in vg.append_edges_by_similarity(query, limit).get_documents():
        print(doc.content)
        print('===========================================================================================')
        
def show_network_for_query(query, limit):
    edges = vg.append_edges_by_similarity(query, limit).edges()
    network = Graph()
    for edge in edges:
        network.add_edge(0, edge.src.name, edge.dst.name)

    network = export.to_pyvis(graph=network, edge_color="#FF0000")
    return network.show('nx.html')


### Hiding company debt

To uncover pertinent communications on this subject, we will use the query:
- `hide company debt`

In [7]:
print_emails("hide company debt", 2)

rod.hayslett sent a message to james.saunders at Sun, 18 Nov 2001 08:05:22 -0800 (PST) with the following content:
That is one of the things we are working on.   At this point in time Dynegy will be responsible for this debt, if theyexercise their rights under the preferred stock agreements, which would leave them with common pledged to the lenders.  The price they paid recognized the debt was there, if it is not there, the price will be higher.   Suffice it to say all of these things will be tken care of before it funds.
--------------------------
Sent from my BlackBerry Wireless Handheld (www.BlackBerry.net)
mariella.mahan sent a message to stanley.horton at Tue, 13 Nov 2001 12:53:54 -0800 (PST) with the following content:
Something for us to talk about during our next staff meeting.

There are three projects which have significant cash flow problems and thus=
 difficulties in meeting debt obligations: these are: SECLP, Panama and Gaz=
a.  In the past, as I suppose we have done in Da

Here, we can find an interesting email in the second position. This message highlights significant cash flow problems in three projects (SECLP, Panama, and Gaza) that face difficulties in meeting debt obligations. It discusses the position of not injecting cash into these companies and being prepared to face default and possible loan acceleration. This might be a good starting point for some investigations.

### Manipulation of the energy market

Among the charges leveled against Enron were allegations of manipulating the energy market leveraging their influential position. To explore conversations pertaining to this matter and potentially uncover concrete evidence, we'll employ the query:
- `manipulating the energy market`

In [8]:
print_emails("manipulating the energy market", 2)

paulo.issler sent a message to zimin.lu at Mon, 22 Oct 2001 09:49:02 -0700 (PDT) with the following content:
Ed has written the draft for the first assignment. I will be checking it today and making my comments by tomorrow morning. Feel free to make yours. I beleive the book Mananging Energy Price Risk is a great reference for the second assignment.     

Thanks.
Paulo Issler
bill.williams sent a message to alan.comnes at Thu, 25 Oct 2001 12:59:06 -0700 (PDT) with the following content:
Alan,

I have a few questions regarding the emminent implementation of the new target pricing mechanism.
1. Does uninstructed energy still get paid (if not, we cannot hedge financials)
2. The CISO Table 1. lists an unintended consequence as " Target price may be manipulated due to no obligation to deliver"
	Why is there no obligation to deliver?
3. Is there still a load deviation penalty? Or would that be considered seperately?

Thanks for the help.

Bill


Here, in the second returned email, discussions revolve around legal boundaries concerning the new target pricing mechanism. They highlight the potential for manipulation within the new mechanism due to the absence of an obligation to deliver. While this doesn't explicitly confirm illegal conduct, individuals engaged in this conversation might possess valuable insights to elucidate Enron's practices concerning this topic.

### Withholding crucial information from investors

Finally, to address the last point of our investigation, we'll employ the query:

- `lie to investors`

This time we'll just show the subgraph comprising these communications, aiming to uncover the network of individuals that might be involved in this.

In [9]:
show_network_for_query("lie to investors", 8)

nx.html


And at the very center of this network, we find Kenneth Lay, the former Enron CEO. He was indeed pleaded not guilty to eleven criminal charges. He was convicted of six counts of securities and wire fraud and was subject to a maximum of 45 years in prison. However, Lay died on July 5, 2006, before sentencing was to occur.

## Bonus: Integrating Raphtory with an LLM

As a bonus for this tutorial, we are going to look at how we can easily integrate Raphtory with a Large Language Model (LLM). There are many ways we can accomplish this, but leveraging the Langchain ecosystem seems like an excellent starting point. One of the options is defining a langchain Retriever using Raphtory. This allows the creation of various chains for diverse purposes such as Question/Answer setups, or agent-driven pipelines.

In this example, we'll build the most basic QA setup, a Retrieval-augmented generation (RAG) pipeline. This kind of pipeline icombines of a document retriever and an LLM. When a question is submitted to the pipeline, that question initially goes through the document retriever, which extract relevants documents from a set. These documents are then fed into the LLM alongside the question to provide context for generating the final answer.

The first step involves creating a Langchain retriever interface for a Raphtroy vectorised graph. To do this, we extend the `BaseRetriever` class and implement the `_get_relevant_documents` method, as shown below:

In [10]:
from langchain.schema.retriever import BaseRetriever, Document
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
from typing import Optional, Dict, Any, List

def adapt_document(document):
    return Document(
        page_content=document.content,
        metadata={
            'src': document.entity.src.name,
            'dst': document.entity.dst.name
        }
    )

class RaphtoryRetriever(BaseRetriever):
    graph: VectorisedGraph
    """Source graph."""
    top_k: Optional[int]
    """Number of items to return."""
    
    def _get_relevant_documents(
        self,
        query: str,
        *,
        run_manager: CallbackManagerForRetrieverRun,
        metadata: Optional[Dict[str, Any]] = None,
    ) -> List[Document]:
        docs = self.graph.append_edges_by_similarity(query, self.top_k).get_documents()
        return [adapt_document(doc) for doc in docs]

ImportError: cannot import name 'Document' from 'langchain.schema.retriever' (/Users/bensteer/miniconda3/envs/raphtory/lib/python3.11/site-packages/langchain/schema/retriever.py)

Next, we can define the RAG chain using our retriever as the context. In this instance, we'll create a placeholder LLM model that answres with the statement, `"I'm a dummy LLM model that got the input:"` and returns the input it receives. While this dummy model might not serve as a practical investigative tool, it allows us to observe the output from our retriever in action. If you do want to build a proper pipeline, you can replace the placeholder LLM with a real one using the code snippet beolw:

```python
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI()
```

This, however,  requires an OpenAI access token. For now, let's proceed with our dummy model:

In [None]:
from langchain.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retriever = RaphtoryRetriever(graph=vg, top_k=3)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

llm = lambda input: f"I'm a dummy LLM model that got the input:\n{input.text}"

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

And finally we can invoke the chain providing a question:

In [None]:
answer = rag_chain.invoke('which person should I investigate to know more about Enron usage of special purpose entities')
print(answer)