In [None]:
# -----------------------------------------------------------
# Walk through this directory, showing how each file contributes
# to the overall pipeline.
# -----------------------------------------------------------

In [3]:
# for others to use this script, it will help to change this variable to
# whatever the route it to the root of your dssg-cfa folder.
ROUTETOROOTDIR = '/home/dssg-cfa/notebooks/dssg-cfa-public/'
IMPORTSCRIPTSDIR = ROUTETOROOTDIR + "util/py_files"
UTILDIR = ROUTETOROOTDIR + 'util'
JSONSDIR = ROUTETOROOTDIR + 'A_pdf_to_text/jsons_ke_gazettes/'
CSVTRAINDIR = ROUTETOROOTDIR + 'B_text_preprocessing/csv_outputs_train/'
CSVTESTDIR = ROUTETOROOTDIR + 'B_text_preprocessing/csv_outputs_test/'
import os
import json 
import matplotlib.pyplot as plt
import random
import numpy as np
from sklearn.cluster import KMeans

os.chdir(IMPORTSCRIPTSDIR)
import setup
os.chdir(IMPORTSCRIPTSDIR)
import C_exportNERAPI
os.chdir(IMPORTSCRIPTSDIR)
import networkClasses
os.chdir(IMPORTSCRIPTSDIR)
import networkInfrastructure

In this demonstration below we will show how to use NER output from a gazette to build a network. The PDF we are processing can be found [here](https://data.connectedafrica.net/entities/241300.cc2c2a9f7521d1ce81135cffde04cb83de9111e6#page=3), although these links change, so it might also help to try [here](https://data.connectedafrica.net/search?filter%3Acollection_id=18&limit=30&q=%2205-July-2019%22).

Let's pick up where we left off, by loading NER output for a single gazette segment.

In [4]:
C_exportNERAPI.getNEROutput(0)[0]

[('PERSON', 'Abdalla Mohamed Abdalla'),
 ('OWNER ADDRESS', 'P.O. Box 90145, Mombasa in the Republic of Kenya'),
 ('LAND SIZE', '0.0163 hectare'),
 ('ORG', 'Plot'),
 ('ORG', 'Mombasa/Block XVI/598'),
 ('LOC', ', in'),
 ('DISTRICT', 'Mombasa'),
 ('DEED STATUS', 'lost'),
 ('DATE', '60) days')]

This pulls a lot more than our regexs did, but also has some false positives.

Next, let's load group these entities into some objects that actually make sense. First we need to collect all the data in one place...

In [5]:
data = networkClasses.getAllDataOneGazette(0)[0]
data

([('PERSON', 'Abdalla Mohamed Abdalla'),
  ('OWNER ADDRESS', 'P.O. Box 90145, Mombasa in the Republic of Kenya'),
  ('LAND SIZE', '0.0163 hectare'),
  ('ORG', 'Plot'),
  ('ORG', 'Mombasa/Block XVI/598'),
  ('LOC', ', in'),
  ('DISTRICT', 'Mombasa'),
  ('DEED STATUS', 'lost'),
  ('DATE', '60) days')],
 name                                           Abdalla Mohamed Abdalla
 address               P.O. Box 90145, Mombasa in the Republic of Kenya
 land size                                               0.0163 hectare
 district                                                           NaN
 title number                                                       NaN
 plot number                                                        NaN
 LR number                                                          NaN
 grant number                                                       NaN
 signator                                                 J. G. WANJOHI
 signator role                                    

Then we can group all these varied data points into entities.

In [6]:
objects = networkClasses.processNERSegment(data[0],data[1])
networkClasses.printResults(objects)

Person node: 
Name: Abdalla Mohamed Abdalla.
Address: P.O. Box 90145, Mombasa in the Republic of Kenya.
District: ['Mombasa'].

Org node: 
Name: Mombasa/Block XVI/598.
Address: P.O. Box 90145, Mombasa in the Republic of Kenya.
District: ['Mombasa'].

Org node: 
Name: Plot.
Address: P.O. Box 90145, Mombasa in the Republic of Kenya.
District: ['Mombasa'].

Land node: 
Location: , in.
Size: 0.0163 hectare.
District: Mombasa.

Edge between person or organization and land.
Deed Status: ['lost'].
Date of Announcement: 5th July, 2019.
MR Number: MR/6508092.

Signator object. 
Name: J. G. WANJOHI
Location: Mombasa
Role:  Registrar of Titles



Those extra org classifications from the spaCy output certainly are frustrating, but hopefully the general way in which we create few objects out of many entities makes sense. This is all shown in great detail in networkClasses.ipynb.

Next, we will generate a whole bunch of entities and connections from our singular gazette, and put them into a csv file for you to explore with your favorite network visualization tool! We can easily expand to more gazettes just by running them through the rest of the pipeline.

In [7]:
%%capture
# above is to prevent annoying output.
networkInfrastructure.saveGraph(sizeSample = 1, districtEdges = True, addressEdges = True)
# district edges: draw a very weak connection between entities within the same district
# address edges: draw a medium strength connection between entities with the same address