# Introduction

This notebook builds slightly off of the previous two.  In this one, we will be populated our database off of Wikidata.  To do so, we will start with the same starter paragraph from Wikipedia.  We will use `spacy` to get the named entities and then use those to scrape Wikipedia using a bot called `Pywikibot`.  (See the `README.md` for information on how to create the token you will need for this bot.)

The steps we will follow below are:

1. Get Wikipedia entry for the target
2. Use `spacy` to identify the named entities
3. Use `spacy` to clean the text of the named entities
4. Get the Q-codes for all entities (subjects)
5. For a given list of claims (verbs) associated with the subjects, get all targets (objects)
6. Connect to Neo4j
7. Get all P31 claims (_"instance of"_) for all nodes to create node labels
8. Add the nodes and properties to the graph
9. Add the edges to the graph

Once these steps are completed, we will then move to the next notebook where we will do some basic data science / machine learning.

## Note on the use of Google Colab with `pywikibot`

As of the September 2021 running of this course, there is no good way to configure the `user_config.py` for `pywikibot`.  If you are going to run this notebook, it is recommended that you do so in your own installation of Jupyter notebook instead.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install neo4j
!pip install Wikipedia
!pip install pywikibot

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import spacy
!python -m spacy download en_core_web_md
SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
VERBS = ['ROOT', 'advcl']
OBJECTS = ["dobj", "dative", "attr", "oprd", 'pobj']
ENTITY_LABELS = ['PERSON', 'NORP', 'GPE', 'ORG', 'FAC', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART']

#api_key = open('.api_key').read()

non_nc = spacy.load('en_core_web_md')

nlp = spacy.load('en_core_web_md')
nlp.add_pipe('merge_noun_chunks')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


<function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

In [None]:
%matplotlib inline

import json
import re
import urllib
from pprint import pprint
import time
from tqdm import tqdm

from neo4j import GraphDatabase

import numpy as np
import pandas as pd
import wikipedia

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.matcher import Matcher
from spacy.tokens import Doc, Span, Token

from pywikibot.data import api
import pywikibot
import wikipedia
import pprint

print(spacy.__version__)

3.5.2


In [None]:
non_nc = spacy.load('en_core_web_md')

nlp = spacy.load('en_core_web_md')
nlp.add_pipe('merge_noun_chunks')

print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'merge_noun_chunks']


# We start by getting the Wikipedia summary paragraph for our target search term, Barack Obama

In [None]:
# text= wikipedia.summary('Missouri')
page = wikipedia.page("Missouri")
text = page.content[0:10000]
doc = nlp(text)
text

'Missouri is a state in the Midwestern region of the United States. Ranking 21st in land area, it is bordered by eight states (tied for the most with Tennessee): Iowa to the north, Illinois, Kentucky and Tennessee to the east, Arkansas to the south and Oklahoma, Kansas, and Nebraska to the west. In the south are the Ozarks, a forested highland, providing timber, minerals, and recreation. The Missouri River, after which the state is named, flows through the center into the Mississippi River, which makes up the eastern border. With more than six million residents, it is the 19th-most populous state of the country. The largest urban areas are St. Louis, Kansas City, Springfield, and Columbia; the capital is Jefferson City.\nHumans have inhabited what is now Missouri for at least 12,000 years. The Mississippian culture, which emerged at least in the ninth century, built cities and mounds before declining in the 14th century. When European explorers arrived in the 17th century, they encount

# We can use `displacy` to visualize the named entities in the text, which will be the starting nodes for our graph.

Note that you will see some obvious errors below, but the named entity recognizition (NER) algorithm in `spacy` is still really well suited for this task.

In [None]:
spacy.displacy.render(doc, style='ent', jupyter = True)

# Let's review some of the detected entities

In [None]:
ent_ignore_ls = ['DATE']
ner_list = []

for el in doc.ents:
    if el.label_ not in ent_ignore_ls:
        #print(el, el.label_)
        if el.text not in ner_list:
            temp_doc = nlp(el.text)
            ner_list.append(el.text)

ner_list[0:5]

['Missouri', 'the United States', 'Tennessee', 'Iowa', 'Illinois']

# Text cleaning

Even some of the entities will be dirty text.  So we still want to do things like removing special characters and stop words.  By the time we get to the end of the next two cells, you can see what our remaining list of named entities is.  This will be our starter list for scraping Wikidata.

In [None]:
def remove_special_characters(text):
    
    regex = re.compile(r'[\n\r\t]')
    clean_text = regex.sub(" ", text)
    
    return clean_text


def remove_stop_words_and_punct(text, print_text=False):
    
    result_ls = []
    rsw_doc = non_nc(text)
    
    for token in rsw_doc:
        if print_text:
            print(token, token.is_stop)
            print('--------------')
        if not token.is_stop and not token.is_punct and not token.is_space:
            result_ls.append(str(token))
    
    result_str = ' '.join(result_ls)

    return result_str

In [None]:
node_text_ls = []

for el in ner_list:
    clean_text = remove_special_characters(el)
    no_sw = remove_stop_words_and_punct(clean_text)
    if no_sw not in node_text_ls and no_sw != '':
        node_text_ls.append(no_sw)

node_text_ls

['Missouri',
 'United States',
 'Tennessee',
 'Iowa',
 'Illinois',
 'Kentucky',
 'Arkansas',
 'Oklahoma',
 'Kansas',
 'Nebraska',
 'Ozarks',
 'Missouri River',
 'Mississippi River',
 'St. Louis',
 'Kansas City',
 'Springfield',
 'Columbia',
 'Jefferson City',
 'French',
 'Louisiana',
 'Ste',
 'Genevieve',
 'Louisiana Purchase',
 'Americans',
 'Upland South',
 'new Missouri Territory',
 'Virginia',
 'Mid Missouri',
 'Missouri Rhineland',
 'Gateway Arch',
 'Pony Express',
 'Oregon Trail',
 'Santa Fe Trail',
 'California Trail',
 'American Civil War',
 'Greater St. Louis',
 'Midwestern Southern United States',
 'U.S.',
 'Anheuser Busch',
 'Lake Ozarks',
 'Table Rock Lake',
 'Branson',
 'known Missourians',
 'Chuck Berry',
 'Sheryl Crow',
 'Walt Disney',
 'Edwin Hubble',
 'Nelly',
 'Brad Pitt',
 'Harry S. Truman',
 'Mark Twain',
 'Cerner Express Scripts',
 'Monsanto',
 'Emerson Electric',
 'Edward Jones',
 'H&R Block',
 'Wells Fargo Advisors',
 'Centene Corporation',
 "O'Reilly Auto Parts"

## These are some helper functions for interfacing with Wikidata

We are also establishng the bot connection to the site here.

In [None]:
def getItems(site, itemtitle):
    params = { 'action' :'wbsearchentities' , 'format' : 'json' , 'language' : 'en', 'type' : 'item', 'search': itemtitle}
    request = api.Request(site=site,**params)
    return request.submit()

def getItem(site, wdItem, token):
    request = api.Request(site=site,
                          action='wbgetentities',
                          format='json',
                          ids=wdItem)    
    return request.submit()

def prettyPrint(variable):
    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint(variable)

# Login to wikidata
token = '90f1b25025597ff727ed414a8a3e2ac7644f3f47+\''
wikidata = pywikibot.Site('wikidata', 'wikidata')
site = pywikibot.Site("wikidata", "wikidata")

# Confirmation that we are able to connect the bot to Wikidata

In [None]:
itempage = pywikibot.ItemPage(wikidata, "Q1581") 
itempage

ItemPage('Q1581')

# Now we are going to start scraping Wikidata with our bot

First, we are going to take all of our named entities and identify them in Wikidata.  This is done by correlating the individual entity with a Wikidata Q-code, which is what Wikidata uses to index all entities.  As you will see, not all of the entities are in Wikidata, likely because of the fact that there are modifiers to the text prior to the actual entity (ex: _Republican nominee_ John McCain).  But will we still be OK. :)

In [None]:
item_ls = []
i = 0

for el in node_text_ls:
    #itempage = pywikibot.ItemPage(wikidata, el)
    #print(el, itempage)
    wikidataEntries = getItems(site, el)
    try:
        tup = (wikidataEntries['search'][0]['id'], el)
        item_ls.append(tup)
    except:
        i += 1
        print('Missing ', i,'th entry for ', el)
    #item_ls.append(tup)
    
dedup_item_ls = []

for item in item_ls:
    if item not in dedup_item_ls:
        dedup_item_ls.append(item)
        
dedup_item_ls

  request = api.Request(site=site,**params)


Missing  1 th entry for  new Missouri Territory
Missing  2 th entry for  Midwestern Southern United States
Missing  3 th entry for  Lake Ozarks
Missing  4 th entry for  known Missourians
Missing  5 th entry for  Cerner Express Scripts
Missing  6 th entry for  Cave State
Missing  7 th entry for  indigenous Missouria
Missing  8 th entry for  mih ZUR ee
Missing  9 th entry for  mih ZUR ə
Missing  10 th entry for  Congressman Willard Vandiver
Missing  11 th entry for  Lead State
Missing  12 th entry for  Bullion State
Missing  13 th entry for  present day St. Louis
Missing  14 th entry for  present day Collinsville
Missing  15 th entry for  Gulf Mexico
Missing  16 th entry for  1400 CE
Missing  17 th entry for  Missouri Indians
Missing  18 th entry for  lower Missouri Valley
Missing  19 th entry for  des Illinois
Missing  20 th entry for  Middle Mississippi Valley
Missing  21 th entry for  La Haute Louisiane
Missing  22 th entry for  High Louisiana
Missing  23 th entry for  ethnic French C

[('Q1581', 'Missouri'),
 ('Q30', 'United States'),
 ('Q1509', 'Tennessee'),
 ('Q1546', 'Iowa'),
 ('Q1204', 'Illinois'),
 ('Q1603', 'Kentucky'),
 ('Q1612', 'Arkansas'),
 ('Q1649', 'Oklahoma'),
 ('Q1558', 'Kansas'),
 ('Q1553', 'Nebraska'),
 ('Q1321468', 'Ozarks'),
 ('Q5419', 'Missouri River'),
 ('Q1497', 'Mississippi River'),
 ('Q38022', 'St. Louis'),
 ('Q41819', 'Kansas City'),
 ('Q28515', 'Springfield'),
 ('Q49088', 'Columbia'),
 ('Q28180', 'Jefferson City'),
 ('Q150', 'French'),
 ('Q1588', 'Louisiana'),
 ('Q303288', 'Ste'),
 ('Q235863', 'Genevieve'),
 ('Q193155', 'Louisiana Purchase'),
 ('Q846570', 'Americans'),
 ('Q1399638', 'Upland South'),
 ('Q1370', 'Virginia'),
 ('Q6840760', 'Mid Missouri'),
 ('Q6879615', 'Missouri Rhineland'),
 ('Q2027162', 'Gateway Arch'),
 ('Q859130', 'Pony Express'),
 ('Q862312', 'Oregon Trail'),
 ('Q1856887', 'Santa Fe Trail'),
 ('Q1026956', 'California Trail'),
 ('Q8676', 'American Civil War'),
 ('Q944269', 'Greater St. Louis'),
 ('Q30', 'U.S.'),
 ('Q125074

# How do we get the verbs?

In Wikidata, these are called "claims" or "statements" and are indexed through the P-value.  There are literally thousands of different P values.  I have gone through and identified a series that I thought might be particularly interesting for this dataset.  This list should absolutely be customized to the application/graph.

### Note

This process can take several minutes, depending on the size of your starter list and the amount of traffic hitting Wikidata at any given time.  You might even hit timeout errors.  They will eventually resolve themselves.  Grab a cup of coffee.  For Barack Obama's entity list, this takes around 10-12 minutes or so.

In [None]:
%%time
p_dc = {'P17': 'country',
        'P19': 'place_of_birth',
        'P27': 'country_of_citizenship',
        'P30': 'continent',
        'P31': 'instance_of',
        'P35': 'head_of_state',
        'P36': 'capital',
        'P37': 'official_language',
        'P39': 'position_held',
        'P69': 'educated_at',
        'P101': 'field_of_work',
        'P102': 'member_of_political_party',
        'P150': 'contains_administrative_territorial_entity',
        'P159': 'headquarters_location',
        'P166': 'award_received',
        'P172': 'ethnic_group',
        'P361': 'part_of',
        'P463': 'member_of',
        'P551': 'residence',
        'P607': 'conflict',
        'P793': 'significant_event',
        'P1344': 'participated_in',
        'P1813': 'short_name',
        'P2670': 'has_parts_of_the_class'
       }

full_node_tup_ls = []

for el in tqdm(item_ls):
    itempage = pywikibot.ItemPage(wikidata, el[0])
    itemdata = itempage.get()
    source_node = itemdata['labels']['en']
    #print(el, source_node)

    for key in p_dc.keys():
        #print(source_node, key, p_dc[key])
        #print(itemdata['claims'])
        try:
            for i in itemdata['claims'][key]:
                target = i.getTarget()
                #print(target.id)
                tup = (source_node, el[0], key, p_dc[key], target.labels['en'], target.id)
                if tup not in full_node_tup_ls:
                    full_node_tup_ls.append(tup)
        except:
            continue

#full_node_tup_ls

100%|██████████| 91/91 [06:58<00:00,  4.60s/it]

CPU times: user 57.9 s, sys: 972 ms, total: 58.8 s
Wall time: 6min 58s





In [None]:
df = pd.DataFrame(full_node_tup_ls, columns=['source_name', 'source_q', 'rel_p', 'rel_name', 'target_name', 'target_q'])
df.head()

Unnamed: 0,source_name,source_q,rel_p,rel_name,target_name,target_q
0,Missouri,Q1581,P17,country,United States of America,Q30
1,Missouri,Q1581,P30,continent,North America,Q49
2,Missouri,Q1581,P31,instance_of,U.S. state,Q35657
3,Missouri,Q1581,P36,capital,Jefferson City,Q28180
4,Missouri,Q1581,P150,contains_administrative_territorial_entity,Jackson County,Q127238


In [None]:
df.shape

(1900, 6)

# Connecting to Neo4j

As before, we will connect to Neo4j with the usual class.  We will also set up a constraint on unique P-values, since this has many potential benefits, particularly as the graph gets larger.

In [None]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

In [None]:
# If you are using a Sandbox instance, you will want to use the following (commented) line.  
# If you are using a Docker container for your DB, use the uncommented line.
# conn = Neo4jConnection(uri="bolt://some_ip_address:7687", user="neo4j", pwd="some_password")

conn = Neo4jConnection(uri="bolt://44.204.202.139:7687", user="neo4j", pwd="system-energizers-example")

In [None]:
conn.query('CREATE CONSTRAINT q_value IF NOT EXISTS FOR (n:Node) REQUIRE n.id IS UNIQUE')

[]

In [None]:
source_df = df[['source_name', 'source_q']].drop_duplicates()
source_df.columns = ['name', 'id']
target_df = df[['target_name', 'target_q']].drop_duplicates()
target_df.columns = ['name', 'id']
all_nodes_df = pd.concat([source_df, target_df]).drop_duplicates()
all_nodes_df.shape

(1755, 2)

# Some helper functions

The below functions are responsible for populating the graph.  To make the graph more rich, we do want to be able to give a descriptive node label.  We will use the Wikidata claim _"instance of"_ (P31) for this.  So, for example, Barack Obama is an instance of a human whereas the Unites States is an instance of a "sovereign state."

In [None]:
def get_p31(row):
    # P31 corresponds to "instance of"
    
    itempage = pywikibot.ItemPage(wikidata, row)
    itemdata = itempage.get()
    try:
        target = itemdata['claims']['P31'][0].getTarget()
        target.get()
        return target.labels['en']
    except:
        return 'Unknown'
    

def add_nodes(rows, batch_size=10000):
    # Adds author nodes to the Neo4j graph as a batch job.

    query = '''UNWIND $rows AS row
               MERGE (:Node {name: row.name, id: row.id, type: row.node_label})
               RETURN count(*) as total
    '''
    return insert_data(query, rows, batch_size)


def add_edges(rows, batch_size=50000):
    
    
    query = """UNWIND $rows AS row
               MATCH (src:Node {id: row.source_q}), (tar:Node {id: row.target_q})
               CREATE (src)-[:%s]->(tar)
    """ % edge
    
    return insert_data(query, rows, batch_size)


def insert_data(query, rows, batch_size = 10000):
    # Function to handle the updating the Neo4j database in batch mode.

    total = 0
    batch = 0
    start = time.time()
    result = None

    while batch * batch_size < len(rows):

        res = conn.query(query, parameters={'rows': rows[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
        try:
            total += res[0]['total']
        except:
            total += 0
        batch += 1
        result = {"total":total, "batches":batch, "time":time.time()-start}
        print(result)

    return result

In [None]:
%%time
all_nodes_df['node_label'] = all_nodes_df['id'].map(get_p31)
all_nodes_df.head()

CPU times: user 38.7 s, sys: 925 ms, total: 39.7 s
Wall time: 7min 37s


Unnamed: 0,name,id,node_label
0,Missouri,Q1581,U.S. state
120,United States of America,Q30,sovereign state
319,Tennessee,Q1509,U.S. state
419,Iowa,Q1546,U.S. state
525,Illinois,Q1204,U.S. state


In [None]:
add_nodes(all_nodes_df)

{'total': 1755, 'batches': 1, 'time': 0.6407902240753174}


{'total': 1755, 'batches': 1, 'time': 0.6407902240753174}

In [None]:
edge_ls = df['rel_name'].unique().tolist()
#edge_ls

In [None]:
for edge in edge_ls:
    print(edge)
    y = df[df['rel_name'] == edge]
    #print(y.shape)
    add_edges(y)

country
{'total': 0, 'batches': 1, 'time': 0.13343501091003418}
continent
{'total': 0, 'batches': 1, 'time': 0.09743428230285645}
instance_of
{'total': 0, 'batches': 1, 'time': 0.12682390213012695}
capital
{'total': 0, 'batches': 1, 'time': 0.08441805839538574}
contains_administrative_territorial_entity
{'total': 0, 'batches': 1, 'time': 0.6976070404052734}
part_of
{'total': 0, 'batches': 1, 'time': 0.0970618724822998}
head_of_state
{'total': 0, 'batches': 1, 'time': 0.09447908401489258}
official_language
{'total': 0, 'batches': 1, 'time': 0.09770059585571289}
ethnic_group
{'total': 0, 'batches': 1, 'time': 0.0352015495300293}
member_of
{'total': 0, 'batches': 1, 'time': 0.07689619064331055}
significant_event
{'total': 0, 'batches': 1, 'time': 0.09228730201721191}
participated_in
{'total': 0, 'batches': 1, 'time': 0.10011029243469238}
headquarters_location
{'total': 0, 'batches': 1, 'time': 0.09417200088500977}
place_of_birth
{'total': 0, 'batches': 1, 'time': 0.09899067878723145}
coun

In [None]:
y = all_nodes_df['node_label'].value_counts()
print(y[0:5])

county of Kentucky    120
county of Missouri    114
county of Kansas      105
county of Illinois    102
county of Iowa         99
Name: node_label, dtype: int64


# Conclusion

At this point we have populated our database.  You should get 1312 nodes and 1622 relationships (once deduping in Cypher is completed and all nodes are attributed to the proper labels determined by P31).  We will do some things in Cypher (see `cypher_queries/queries.cql` and follow along with the "Method 2" section).  Once those are done, we can proceed to the final notebook where we will show how to do some basic ML on the graph.