
# Advanced Graph Analysis & NLP

In this notebook, we will be using a combination of Natural Language Processing and network analysis to look at a Mafia network. The settings the mafia network is the following: As members of an investigation unit, we have been observing a network of families, which have been associated with nefarious activities. We will use two datasets:

1. PDF files written by an undercover agent
2. A network of interactions between those families

For NLP in Spark we can use the open-source *spark-nlp* library, which allows us to use a variety of Deep Learning models.

Since here we are given several PDFs to work with, we need a Python library to parse them using ``UDF``. We first need to install the external Python library ``pypdf``. The *graphframes* library, which we used in the previous lab, offers the useful ``GraphFrame``, but choices for graph algorithms are relatively limited. Thus, we will install the ``networkx`` library, which offers a range of popular graph algorithms.

In [0]:
!pip3 install pypdf networkx

In [0]:
import io
import numpy as np
import networkx as nx
from pypdf import PdfReader 

We install the ``spark-nlp`` dependencies next. 

In [0]:
import sparknlp

# Start Spark Session
spark = sparknlp.start()

from sparknlp.base import DocumentAssembler, Pipeline, LightPipeline
from sparknlp.annotator import (
    Tokenizer,
    WordEmbeddingsModel,
    NerDLModel,
    NerConverter
)

import pyspark.sql.functions as F

We load all the reports from our agent as PDF files.

In [0]:
mafia_network_communications = spark.read.format("binaryFile").load("dbfs:/FileStore/crime_letters/*.pdf")

In [0]:
mafia_network_communications.display()

As the next step, we define a Python UDF that takes the binary content of each PDF and convert it to text.

In [0]:
@udf
def pdf_to_text(pdf) -> str:
    """
    We transform a PDF (binary) into a string. The "content" column is already binary, so we read the bytes directly.
    """

    # First we load the binary content
    bytes_stream = io.BytesIO(pdf)

    # We initialize the reader
    reader = PdfReader(bytes_stream)

    # As the final step, we go over each page (note though that in our case our PDFs have only one page each) and extract the text
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"

    return text

In [0]:
mafia_network_communications = mafia_network_communications \
  .withColumn("report_text", pdf_to_text(F.col("content")))

In [0]:
mafia_network_communications.display()

#### Named Entity Recognition
Named Entity Recognition (NER) is a natural language processing (NLP) technique that identifies and categorizes named entities within text into predefined categories such as names of persons, organizations, locations and dates. Our informant, who has infiltrated the organization, is sending regular letters. You are tasked with building a prototype of automatically processing the reports and extracting the names of the people in each letter, which can then be related to the larger network. 

We will define manually a pipeline that transforms our report text first into an embedding representation and then extracts our entities.

In [0]:
# Step 1: Transforms raw texts to "document" annotation
# Code here

# Step 2: Tokenization
# Code here

# Step 3: Get the embeddings using glove_100d
# Code here

# Step 4: Use the ``ner_dl`` model
# Code here

# Step 5: Convert to NER
# Code here

# Define the pipeline
# Code here

``Pipelines`` are a Spark concept that we will revisit during our labs on Machine Learning. You will find the API similar to that of ``scikit-learn`` in that it follows the fit/transform structure. Using pipelines, you may specify the steps of a sequence. Most often, you will find yourself using them 

In [0]:
# We fit our model 
# Code here

# And transform the data
# Code here

In [0]:
# Code here ner_results...

Let us now inspect the results.

In [0]:
# Code here

To facilitate our analysis, we will use the ``path`` column to extract the date of the report and use the date to assign a report id. We will accomplish this by using a Window function. From there, we explode the array of struct (the result of the NER) and retrieve only the entity and the associated word.

In [0]:
from pyspark.sql.window import Window

In [0]:
# 1. We extract the date
# 2. Get a row ID based on date
# 3. Explode the column
# 4. Extract the results (in a struct)

reports_parsed = ner_results \
    .withColumn("date", F.to_date(F.regexp_extract(F.col("path"), r"(\d{4}-\d+-\d{2})", 1))) \
    .withColumn("report_id", F.row_number().over(Window.orderBy("date"))) \
    .withColumn("ner_exploded", F.explode("ner")) \
    .withColumns({
        "result":  F.col("ner_exploded.result"), 
        "metadata": F.col("ner_exploded.metadata.word") 
    }
    ) \
    .withColumn("row_number", F.row_number().over(Window.orderBy("report_id"))) \
    .select(F.col("result"), F.col("metadata"), F.col("report_id"), F.col("row_number"))

In [0]:
reports_parsed.limit(50).display()

We know that our agent always spells out the full name of the persons he follows (how convenient!). This allows us to do a clever join to get the people's full name: We join the DataFrame onto itself on the newly created variable ``row_number``, where the left side corresponds to the first name and the right side to the last name. Since we know the ordering we have as the join key (``row_number``, ``row_number - 1``).

In [0]:
#sub_network = reports_parsed.alias("df1")...

#display(sub_network)

We have now extracted a subnet of the overall Mafia network. Our tasks are now two-fold:

1. Which is the individual with the highest influence within the sub-network (based on each relation type)?
2. Which is the individual with the highest influence within the overall network (based on each relation type)?

Both of these question can be answered using network analysis!

In [0]:
from graphframes import * 

nodes = spark.read.csv("dbfs:/FileStore/mafia_nodes.csv", header=True)
edges = spark.read.csv("dbfs:/FileStore/mafia_edges.csv", header=True)


We will create an index column for the relation types.

In [0]:
import pyspark.sql.types as tp

edges = edges \
    .withColumn("relation_type_index", F.dense_rank().over(Window.orderBy("relation_type"))) \
    .withColumn("weight", F.col("weight").cast(tp.IntegerType()))
edges.display()

Now we instantiate our complete graph.

In [0]:
mafia_graph = GraphFrame(nodes, edges)

In [0]:
# Let us inspect the graph
mafia_graph.vertices.show()
mafia_graph.edges.show()

In [0]:
import pyspark.sql.functions as F
mafia_graph.edges.select(F.col("relation_type")).distinct().show()

We now subset our overall ``nodes`` DataFrame to extract the sub-network only.

In [0]:
sub_network_nodes = mafia_graph.vertices \
    .join(sub_network, on=["First Name", "Last Name"], how="inner") \
    .dropDuplicates(["First Name", "Last Name"])

In [0]:
# mafia_graph_edges_sub_network = mafia_graph.edges...
mafia_graph_edges_sub_network.display()

In [0]:
# Get unique edges from subnetwork
# Code here 

# Filter network by subnetwork nodes
# Code here

In [0]:
mafia_subgraph = GraphFrame(sub_network_nodes, sub_network_df)

While ``graphframes`` offers a method to compute ``degree centrality``, its inventory is relatively limited. Since we are operating on different partitions of the graph (i.e. the subgraph induced by the mentioned people in the reports and the entire graph), we can use Spark's capabilities and parallelize the operations. To this end, we will make use of the ``networkx`` library, which offers a wealth of functions to work with graphs.

As we can see, there are four different types of edges:
- Asked for Meeting
- Threatened
- Sent Money
- Called

Those edge types give us the idea that this graph is directed.

Let us now proceed to our actual network analysis. We will compute to network centrality measures here, (in-/out-)degree centrality and betweenness centrality.

- *Degree centrality* measures the importance of a node in a network based on its connections. In the context of in-degree centrality, this metric quantifies how many incoming connections a node has, reflecting its popularity or influence within the network. Conversely, out-degree centrality assesses the number of outgoing connections from a node, indicating its capacity to disseminate information or influence others. 

- *Betweenness centrality*, on the other hand, evaluates the extent to which a node serves as a bridge or intermediary between other nodes in the network. Nodes with high betweenness centrality often lie on many shortest paths between pairs of nodes, suggesting their critical role in maintaining connectivity and facilitating communication within the network.

Both of these, among many others, are implement in ``networkx``. To make us of these functionalities, we use a trick we learned in a previous class: ``pandas`` UDFs, which allow us to pass our graphframe or dataframe into a function and operate on it as normal Python code.

In [0]:
import pandas as pd

output_schema_degree = tp.StructType([
    tp.StructField("relation_type", tp.StringType(), False),
    tp.StructField("node", tp.StringType(), False),
    tp.StructField("in_degree_centrality", tp.FloatType(), False),
    tp.StructField("out_degree_centrality", tp.FloatType(), False),
])

output_schema_betweenness = tp.StructType([
    tp.StructField("relation_type", tp.StringType(), False),
    tp.StructField("node", tp.StringType(), False),
    tp.StructField("betweenness_centrality", tp.FloatType(), False),
])

def nx_degree_centrality(pdf: pd.DataFrame) -> pd.DataFrame:
    # We get the relation_type key
    key = pdf["relation_type"].iloc[0]

    # Here we instantiate a networkx directed graph (DiGraph)
    in_degree_centralities = nx.in_degree_centrality(nx.DiGraph(nx.from_pandas_edgelist(pdf, "src", "dst", edge_attr="weight")))
    out_degree_centralities = nx.out_degree_centrality(nx.DiGraph(nx.from_pandas_edgelist(pdf, "src", "dst", edge_attr="weight")))
    
    # Finally, we return a pd.DataFrame
    return pd.DataFrame(
        {
            "relation_type": [key for _ in range(1, len(in_degree_centralities.values()) + 1)], 
            "node": in_degree_centralities.keys(), 
            "in_degree_centrality": in_degree_centralities.values(),
            "out_degree_centrality": out_degree_centralities.values(),
        }
    )

def nx_betweenness_centrality(pdf: pd.DataFrame) -> pd.DataFrame:
    # We get the relation_type key
    key = pdf["relation_type"].iloc[0]

    # Here we instantiate a networkx directed graph (DiGraph)
    betweenness_centrality = nx.betweenness_centrality(nx.DiGraph(nx.from_pandas_edgelist(pdf, "src", "dst", edge_attr="weight")))
    
    # Finally, we return a pd.DataFrame
    return pd.DataFrame(
        {
            "relation_type": [key for _ in range(1, len(betweenness_centrality.values()) + 1)], 
            "node": betweenness_centrality.keys(), 
            "betweenness_centrality": betweenness_centrality.values()
        }
    )

We compute degree centrality and betweenness centrality for the subgraph

In [0]:
# Summarize results for degree centrality
mafia_subgraph \
    .edges \
    .groupby("relation_type") \
    .applyInPandas(nx_degree_centrality, output_schema_degree) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "in_degree_centrality", ascending=False) \
    .display()

In [0]:
# Summarize results for betweenness centrality
mafia_subgraph \
    .edges \
    .groupby("relation_type") \
    .applyInPandas(nx_betweenness_centrality, output_schema_betweenness) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "betweenness_centrality", ascending=False) \
    .display()

We repeat this exercise for the entire graph.

In [0]:
# Summarize results for degree centrality
mafia_graph \
    .edges \
    .groupby("relation_type") \
    .applyInPandas(nx_degree_centrality, output_schema_degree) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "in_degree_centrality", ascending=False) \
    .display()

In [0]:
# Summarize results for betweenness centrality
mafia_graph \
    .edges \
    .groupby("relation_type") \
    .applyInPandas(nx_betweenness_centrality, output_schema_betweenness) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "betweenness_centrality", ascending=False) \
    .display()

#### Bonus
If would like to compute the overall influence, without regard to the relation type, you can use a *fictional* relation type by creating a column with a literal value, such as 1 or a string. Note that using networkx does not leverage the speed of Spark, unless we partition our network in some way. Since we have a network with weights, we can use a groupby to sum the weight, which automatically imposes a uniqueness condition. We still need our fictional groupby to use ``applyInPandas``.

In [0]:
mafia_graph \
    .edges \
    .withColumn("relation_type", F.lit("1")) \
    .groupby(["src", "dst", "relation_type"]) \
    .agg(F.sum("weight").alias("weight")) \
    .groupby("relation_type") \
    .applyInPandas(nx_degree_centrality, output_schema_degree) \
    .alias("degree_df") \
    .join(mafia_graph.vertices.alias("node_df"), F.col("degree_df.node") == F.col("node_df.id")) \
    .orderBy("relation_type", "in_degree_centrality", ascending=False) \
    .display()