# DISCLAIMER:

* This code resides in the folder graphml/stats/
* It will not currently run in a notebook or a docker container. I haven't figured out why, but graph.py takes forever.
* You need to run the code on a computer directly
* The runtime order is: [graphml_class/stats/download.py](graphml_class/stats/download.py) --> [graphml_class/stats/xml_to_parquet.py](graphml_class/stats/xml_to_parquet.py) --> [graphml_class/stats/graph.py](graphml_class/stats/graph.py) --> [graphml_class/stats/motif.py](graphml_class/stats/motif.py)

# Network Motifs with PySpark and GraphFrames

I'll be honest with you... I am a big data, large knowledge graph specialist. All the data we have used so far is very small to me. While it is difficult to cover scalable methods for everything, I wanted to introduce you to a tool for PySpark called [GraphFrames](https://graphframes.github.io/graphframes/docs/_site/index.html). It has some powerful utility methods like [connected components](https://graphframes.github.io/graphframes/docs/_site/user-guide.html#connected-components) that many teams use to perform tasks like merging nodes during entity resolution - node deduplication. It can also perform property graph [motif finding](https://graphframes.github.io/graphframes/docs/_site/user-guide.html#motif-finding) for networks with billions of nodes and edges.

GraphFrames are graphs [created from node / edge lists](https://graphframes.github.io/graphframes/docs/_site/user-guide.html#creating-graphframes) which are [pyspark.sql.DataFrames](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html). This means you get the power of PySpark to write arbitrary graph operations on large datasets thanks to [Spark](https://spark.apache.org/docs/latest/).

## Imperative vs Declarative

SQL is a declarative language. PySpark is an imperative API, with a declarative SQL interface if you prefer. We'll be using the `pyspark.sql.DataFrame` and `pyspark.sql.functions` APIs to process data step-by-step. This is different from Pandas, which is fairly declarative in its APIs. Keep that in mind below... we get to carefully control how the dataflow works, which can be time consuming compared to a single Pandas command that does more than what Spark `DataFrames` do - MapReduce and iterate :)

## Running Checks

It is easy to mess up an algorithm when building knowledge graphs in PySpark / GraphFrames. You will notice that throughout  the script below, I check and print counts as I go. I encourage you to do this as well, or your scripts will seem to run but will produce bad knowledge graphs that produce bad answers.

## Running GraphFrames

To run GraphFrames locally from a shell, you can import the package with the `spark-shell` or `pyspark` `--packages` argument:

```bash
pyspark --packages graphframes:graphframes:0.8.3-spark3.5-s_2.12
```

In [1]:
import os
import re
from typing import List

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import DataFrame, SparkSession

In [2]:
# This is actually already set in Docker, just reminding you Java is needed
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"

In [3]:
# Initialize a SparkSession. You can configre SparkSession via: .config("spark.some.config.option", "some-value")
spark: SparkSession = SparkSession.builder.appName(
    "Big Graph Builder"
).getOrCreate()  # Set app name  # Get or create the SparkSession
# sc: SparkContext = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/08 05:33:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/08 05:33:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/12/08 05:33:26 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


### Real-World Network Motifs and Higher-Order Networks

Remember these slides? This is a real-world use of network motif matching to find sources of risk for banks in terms of money laundering and terrorism funding occurring in financial networks of people and companies.

<center><img src="images/Multiple-Path-Indirect-Ownership-Motif.jpg" width="800px" /></center>
<center>A well known pattern to hide Ultimate Beneficial Ownership (UBO) of a company</center>

<br />

<center><img src="images/PySpark-GraphFrames-Motif-Search-Python-Code.jpg" width="800px" /></center>
<center>The PySpark / GraphFrames pseudo-code that implements [most of] the motif above</center>

<br />

The next motif is more complicated... it uses Sentence Transformers with PySpark - [[here's how](https://stackoverflow.com/questions/72398129/creating-a-sentence-transformer-model-in-spark-mllib)] to perform what I think is a new type of network motif - a _semantic network motif_.

<center><img src="images/Corrupt-Incorporation-Services-Motif.jpg" width="800px" /></center>
<center>A more complex <i>semantic network motif</i> using Sentence Transformers to do fuzzy string matching of officer names</center>

<br />

While I don't have time to demonstrate it at present, once you have network motifs, it is possible to group by all members of a motif match and combine their edges into a new, higher-order node with new semantics. If you link or [cluster](https://arxiv.org/abs/1612.08447) these new nodes, they form a [higher-order network](https://arxiv.org/abs/2104.11329). If the motif is a pattern important to your problem domain, nodes in this higher-order network might exist in a _solution space_ rather than your graph's original _problem space_. For example, if you define risk motifs in a financial network, the way they cluster can show you centers of risk. Pattern matching is a different approach than graph machine learning but can yield similarly powerful results.

<center><img src="images/Higher-Order-Networks-Using-Edge-Projection-via-Property-Graphlet-Minors.jpg" width="800px" /></center>
<center>Forming a higher-order network using network motif clustering</center>

For more information, check out Stanfor SNAP's page on [Higher-order organization of complex networks](https://snap.stanford.edu/higher-order/). The diagrams of overlapping motifs are really cool:

<center><img src="images/Leskovec-Cutting-Motif-Clusters.png" width="600px" /></center>

<br /><br />

<center><img src="images/Leskovec-Motif-Cluster-Efficiency.png" width="600px" /></center>