<a href="https://colab.research.google.com/github/tomasonjo/blogs/blob/master/matrix/MatrixNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Updated to GDS 2.0 version
* Link to original blog post: https://towardsdatascience.com/construct-the-matrix-interaction-network-based-on-the-movie-script-738b4fa9b46d

In [None]:
!sudo apt install tesseract-ocr
!sudo apt-get install poppler-utils 
# Install dependencies
!pip install -U selenium neo4j pytesseract pdf2image spacy --upgrade
!python -m spacy download en_core_web_sm

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tesseract-ocr is already the newest version (4.00~git2288-10f4998a-2).
0 upgraded, 0 newly installed, 0 to remove and 69 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
poppler-utils is already the newest version (0.62.0-2ubuntu2.12).
0 upgraded, 0 newly installed, 0 to remove and 69 not upgraded.
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 5.5 MB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installa

In [None]:
# Setup selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:3 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Ign:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:7 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:12 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:13 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic 

  if sys.path[0] == '':


# Construct the Matrix interaction network based on the movie script
## Combine web scraping, OCR, and entity recognition to construct and analyze the Matrix interaction network in Neo4j
Christmas is just around the corner, and with it, comes the newest Matrix movie. I can't think of a better way to wait till the movie is released than to perform a network analysis of the first Matrix movie.
## Agenda
This blog post will present how to combine web scraping, OCR, and NLP techniques to construct the Matrix interaction network. 
* Scraping Matrix fandom page with Selenium
* Using PyTesseract to read the Matrix movie script PDF
* Extract characters in each scene by using the SpaCy's rule-based matcher
* Construct and analyze the character's co-occurrence network in Neo4j

I have already performed a similar analysis based on the Harry Potter book, and this time we will be using the Matrix movie script.

## Scraping Matrix Fandom page with Selenium
We will begin by scraping the Matrix Fandom page to get the list of characters that appeared in the movie. As mentioned, we will be using the Selenium library to achieve this. The content of the fandom page is available under the CC BY 4.0 license.
In the first step, we will extract the names and the links of characters that appeared in the first Matrix movie.

In [None]:
wikifan_url = "https://matrix.fandom.com/wiki/Category:Characters_in_The_Matrix"

member_list = []
wd.get(wikifan_url)
members = wd.find_elements_by_class_name("category-page__member-link")
for m in members:
  member_list.append({'url':m.get_attribute('href'), 'name': m.text})

# Manually append Trinity
member_list.append({'url': 'https://matrix.fandom.com/wiki/Trinity', 'name': 'Trinity'})

  This is separate from the ipykernel package so we can avoid doing imports until


Just because we can, we will also extract detailed information from personal pages of characters.

In [None]:
for m in member_list:
  wd.get(m['url'])
  elements = wd.find_elements_by_class_name("pi-data")
  for e in elements:
    try:
      label = e.find_element_by_tag_name("h3")
      value = e.find_element_by_tag_name("div")
      m[label.text] = value.text
    except:
      pass

  

  This is separate from the ipykernel package so we can avoid doing imports until


Before continuing, we will store the character information into Neo4j. If you are using the Colab notebook, then it would be easiest to create either a free Neo4j Sandbox or free Aura database instance to store the results.
Once you have created the Sandbox or the Aura environment, simply copy the connection details into the notebook.

In [None]:
from neo4j import GraphDatabase
# Change the host and user/password combination to your neo4j
# Will not work with a localhost bolt url
host = 'bolt://44.200.249.124:7687'
user = 'neo4j'
password = 'battle-manpower-sand'
driver = GraphDatabase.driver(host,auth=(user, password))

Now that you have defined the connection to your Neo4j instance, you can go ahead and import the characters' information.

In [None]:
entity_query = """
UNWIND $data as row
CREATE (c:Character)
SET c += row
"""
with driver.session() as session:
  session.run(entity_query, {'data': member_list})

At this moment, there are no connections in the database, just lonely and isolated nodes. If you wanted to, you could refactor some of the node properties such as Spouse to a relationship. However, we will skip this part and move on to constructing a co-occurrence network based on the movie script.
## Using PyTesseract to read the Matrix movie script PDF
The movie script is available on the Daily Script web page in PDF format. While no explicit license is stated, the web page says that the scripts are available for educational purposes, so we are good to go.
We will use PyTesseract library to transform the PDF into a text format.

In [None]:
import requests
import pdf2image
import pytesseract

pdf_link = "https://www.dailyscript.com/scripts/the_matrix.pdf"

pdf = requests.get(pdf_link)
doc = pdf2image.convert_from_bytes(pdf.content)

# Get the article text
article = []
for page_number, page_data in enumerate(doc):
    # First page is the title
    if page_number == 0:
      continue
    txt = pytesseract.image_to_string(page_data, lang='eng').encode("utf-8")
    article.append(txt.decode("utf-8"))
article_txt = " ".join(article)

This process takes around 15 minutes, so you can use this time to take a break and perhaps stretch your legs.
Before moving on to the character extraction step, we will perform a simple text cleanup and split the script by scenes.

In [None]:
# a bit of cleaning
article_clean_txt = "\n".join([line for line in article_txt.split("\n") if not "THE MATRIX" in line and not "CONTINUED" in line])

In [None]:
# Optionally store to file
with open('/matrix_script.txt', 'w') as writefile:
    writefile.write(article_clean_txt)

from google.colab import files
files.download('/matrix_script.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
#Split by scenes
scenes = []
single_scene = []
for line in article_clean_txt.split("\n"):
  # If empty line
  if not line or line.startswith("OMITTED"):
    continue
  if line.startswith("INT.") or line.startswith("EXT.") or line.startswith("THE END"):
    scenes.append(("\n").join(single_scene))
    single_scene = []
  single_scene.append(line)

scene_names = [el.split("\n")[0] for el in scenes]

Now that we have preprocessed the text, we will go ahead and identify all the characters that appear in a particular scene. Since we already know which characters to expect from our fandom scraping process, we'll use SpaCy's rule-based matcher to identify characters.
A character name can appear in two forms. First, if the character is talking, its name is uppercased in the text. The second form is the title-cased version, where a character is mentioned by other persons or mentioned in the scene description. SpaCy makes it really easy to describe these two patterns.
We will also omit the The word from the pattern definition. For example, the fandom page contains the character The Oracle. Therefore, we will skip the word The and only search for Oracle pattern.
The following code will construct the SpaCy's matcher object used in the next step to identify characters.

In [None]:
def get_matcher_patterns(name):
  """Function that construct a SpaCy rule-based pattern from a name"""
  matcher_pattern = []
  clean_name = name.replace("The", "").strip()
  parts_of_name = clean_name.split(" ")
  # Append the capitalized version
  matcher_pattern.append([{"LOWER": n.lower(), "IS_TITLE": True} for n in parts_of_name])
  # Append the uppercased version
  matcher_pattern.append([{"LOWER": n.lower(), "IS_UPPER": True} for n in parts_of_name])
  return matcher_pattern

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

for m in member_list:
  matcher.add(m['name'], get_matcher_patterns(m['name']))

We have the text and the entity matcher ready. We will iterate over scenes, identify all the characters that appear in it, and store the results directly into Neo4j in one step.

In [None]:
x = 0

for s, sn in zip(scenes, scene_names):
  characters = set()
  doc = nlp(s.replace("\n", " "))
  matches = matcher(doc)
  
  for match_id, start, end in matches:
    characters.add(str(doc[start:end]).lower())
  
  entity_query = """
      MERGE (s:Scene {id: $scene_id})
      SET s.title = $scene_title
      WITH s
      UNWIND $characters as char
      MATCH (c:Character)
      WHERE toLower(c.name) = CASE WHEN NOT char IN ["oracle", "priestess"] THEN char ELSE "the " + char END
      MERGE (c)-[:IN_SCENE]->(s)
      """
  with driver.session() as session:
    session.run(entity_query, {'scene_id': x, 'scene_title': sn, 'characters': list(characters)})

  x += 1




That's how easy it is to extract information from the movie script and store the output into Neo4j.
Now we will move on to the network analysis part of this blog. First, we will evaluate all the characters that didn't appear in any scene.

In [None]:
import pandas as pd

def read_query(query, params=None):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

In [None]:
read_query("""
MATCH (n:Character) WHERE NOT (n)--()
RETURN n.name
""")

Unnamed: 0,n.name
0,Austin
1,Crawford
2,FedEx Man
3,Garcia
4,Green
5,Kim
6,Law Enforcement
7,Nguyen
8,S.W.A.T.
9,Security Guards


Twelve characters weren't identified in any scene. I don't remember any of the characters except the Woman in Red. Probably she is mentioned as the red woman or something similar in the text, so our pattern matcher didn't identify her. Of course, we could fine-tune the patterns to include these types of exceptions, but overall, it seems that the main characters were identified.
Next, we will examine the characters that appeared in most scenes.

In [None]:
read_query("""
MATCH (n:Character)
RETURN n.name AS character,
       size((n)-[:IN_SCENE]->()) as scenes
ORDER BY scenes DESC
LIMIT 5
""")

Unnamed: 0,character,scenes
0,Neo,115
1,Morpheus,87
2,Trinity,77
3,Tank,58
4,Agent Smith,36


Nothing shocking here. Neo appeared in more than half of the scenes, followed by Morpheus and Trinity. Only Emil Eifrem is surprisingly missing from this list.
We can define the co-occurrence event as a pair of characters that appeared in the same scene. In this context, co-occurrence can also be understood as an interaction. The more scenes a duo of characters appeared in, the more they interacted in the movie.
We can evaluate which characters interacted the most by executing the following Cypher statement.

In [None]:
read_query("""
MATCH (n1:Character)-[:IN_SCENE]->()<-[:IN_SCENE]-(n2:Character)
WHERE id(n1) < id(n2)
RETURN n1.name AS character1,
       n2.name AS character2,
       count(*) AS count
ORDER BY count DESC
LIMIT 5
""")

Unnamed: 0,character1,character2,count
0,Morpheus,Neo,59
1,Neo,Trinity,55
2,Morpheus,Trinity,34
3,Neo,Tank,26
4,Morpheus,Tank,26


Most interactions occurred between Neo, Morpheus, Trinity, and Tank. If you watched the movie, this all makes sense.
Lastly, we can infer the co-occurrence network between characters and perform a network analysis of it. We will simply count the number of interactions between a pair of characters and store the information as a relationship.

In [None]:
read_query("""
MATCH (n1:Character)-[:IN_SCENE]->()<-[:IN_SCENE]-(n2:Character)
WHERE id(n1) < id(n2)
WITH n1, n2, count(*) AS count
MERGE (n1)-[r:INTERACTS]-(n2)
SET r.weight = count
""")

## Graph Data Science library
Neo4j features a Graph Data Science library with more than 50 graph algorithms ranging from centrality, community detection, and node embedding categories.
We will use PageRank to evaluate node importance and Louvain to determine the community structure of the inferred co-occurrence network. Instead of inspecting each algorithm result separately, we will store the results and construct a network visualization that visualizes both node importance and community structure.
First, we have to project an in-memory graph in order to be able to execute graph algorithms on it. Notice that we project the co-occurrence relationship as undirected. For example, if Neo interacted with Trinity, this directly implies that Trinity also interacted with Neo.

In [None]:
read_query("""
CALL gds.graph.project("matrix", "Character", {INTERACTS: {orientation:"UNDIRECTED", properties:"weight"}})
""")

Unnamed: 0,nodeProjection,relationshipProjection,graphName,nodeCount,relationshipCount,createMillis
0,"{'Character': {'properties': {}, 'label': 'Cha...","{'INTERACTS': {'orientation': 'UNDIRECTED', 'a...",matrix,33,176,55


Now we can go ahead and execute the weighted PageRank algorithm and store the results back to Neo4j.

In [None]:
read_query("""
CALL gds.pageRank.write("matrix", {relationshipWeightProperty:"weight", writeProperty:"pagerank"})
""")

Unnamed: 0,writeMillis,nodePropertiesWritten,ranIterations,didConverge,centralityDistribution,postProcessingMillis,createMillis,computeMillis,configuration
0,12,33,20,False,"{'p99': 3.6682729721069336, 'min': 0.149999618...",29,0,278,"{'maxIterations': 20, 'writeConcurrency': 4, '..."


Lastly, we execute the weighted Louvain algorithm to deduce the community structure and store the results in the database.

In [None]:
read_query("""
CALL gds.louvain.write("matrix", {relationshipWeightProperty:"weight", writeProperty:"louvain"})
""")

Unnamed: 0,writeMillis,nodePropertiesWritten,modularity,modularities,ranLevels,communityCount,communityDistribution,postProcessingMillis,createMillis,computeMillis,configuration
0,13,33,0.08213,"[0.07350871374383017, 0.08213048458616334]",2,16,"{'p99': 11, 'min': 1, 'max': 11, 'mean': 2.062...",2,0,893,"{'maxIterations': 10, 'writeConcurrency': 4, '..."


Open Neo4j Bloom to visualize the results