In [None]:
# -----------------------------------------------------------
# Walk through this directory, showing how each file contributes
# to the overall pipeline.
# -----------------------------------------------------------

In [2]:
# for others to use this script, it will help to change this variable to
# whatever the route it to the root of your dssg-cfa folder.
ROUTETOROOTDIR = '/home/dssg-cfa/notebooks/dssg-cfa-public/'
IMPORTSCRIPTSDIR = ROUTETOROOTDIR + "util/py_files"
UTILDIR = ROUTETOROOTDIR + 'util'
JSONSDIR = ROUTETOROOTDIR + 'A_pdf_to_text/jsons_ke_gazettes/'
CSVTRAINDIR = ROUTETOROOTDIR + 'B_text_preprocessing/csv_outputs_train/'
CSVTESTDIR = ROUTETOROOTDIR + 'B_text_preprocessing/csv_outputs_test/'
import os
import json 
import matplotlib.pyplot as plt
import random
import numpy as np
from sklearn.cluster import KMeans

os.chdir(IMPORTSCRIPTSDIR)
import setup
os.chdir(IMPORTSCRIPTSDIR)
import C_exportNERAPI
os.chdir(IMPORTSCRIPTSDIR)
import networkClasses
os.chdir(IMPORTSCRIPTSDIR)
import networkInfrastructure

In this demonstration below we will show how to use NER output from a gazette to build a network. The PDF we are processing can be found [here](https://data.connectedafrica.net/entities/241300.cc2c2a9f7521d1ce81135cffde04cb83de9111e6#page=3), although these links change, so it might also help to try [here](https://data.connectedafrica.net/search?filter%3Acollection_id=18&limit=30&q=%2205-July-2019%22).

Let's pick up where we left off, by loading NER output for a single gazette segment.

In [3]:
C_exportNERAPI.getNEROutput(0)[0]

[('CARDINAL', '1'),
 ('PERSON', 'Hussein Mahfudh Jeizan'),
 ('CARDINAL', '2'),
 ('PERSON', 'Mohammed Mahfudh Jeizan'),
 ('PERSON', 'Mahfudh Ahmed Jeizan'),
 ('OWNER ADDRESS', 'P.O. Box 4321–00506, Nairobi in the Republic of Kenya'),
 ('LAND REGISTRATION', 'L.R. number 209/11092/77'),
 ('LOC', 'in the city of Nairobi'),
 ('LOC', 'the Nairobi Area'),
 ('CARDINAL', '77900/1'),
 ('DEED STATUS', 'lost'),
 ('DATE', '60) days')]

This pulls a lot more than our regexs did, but also has some false positives.

Next, let's load group these entities into some objects that actually make sense. First we need to collect all the data in one place...

In [4]:
data = networkClasses.getAllDataOneGazette(0)[0]
data

([('CARDINAL', '1'),
  ('PERSON', 'Hussein Mahfudh Jeizan'),
  ('CARDINAL', '2'),
  ('PERSON', 'Mohammed Mahfudh Jeizan'),
  ('PERSON', 'Mahfudh Ahmed Jeizan'),
  ('OWNER ADDRESS', 'P.O. Box 4321–00506, Nairobi in the Republic of Kenya'),
  ('LAND REGISTRATION', 'L.R. number 209/11092/77'),
  ('LOC', 'in the city of Nairobi'),
  ('LOC', 'the Nairobi Area'),
  ('CARDINAL', '77900/1'),
  ('DEED STATUS', 'lost'),
  ('DATE', '60) days')],
 name                 (1) Hussein Mahfudh Jeizan and (2) Mohammed Ma...
 address              P.O. Box 4321–00506, Nairobi in the Republic o...
 land size                                                          NaN
 district                       the city of Nairobi in the Nairobi Area
 title number                                                       NaN
 plot number                                                        NaN
 LR number                                        L.R. No. 209/11092/77
 grant number                                            

Then we can group all these varied data points into entities.

In [5]:
objects = networkClasses.processNERSegment(data[0],data[1])
networkClasses.printResults(objects)

Person node: 
Name: Mahfudh Ahmed Jeizan.
Address: P.O. Box 4321–00506, Nairobi in the Republic of Kenya.
District: ['Nairobi'].

Person node: 
Name: Mohammed Mahfudh Jeizan.
Address: P.O. Box 4321–00506, Nairobi in the Republic of Kenya.
District: ['Nairobi'].

Person node: 
Name: Hussein Mahfudh Jeizan.
Address: P.O. Box 4321–00506, Nairobi in the Republic of Kenya.
District: ['Nairobi'].

Land node: 
Land Registration: L.R. number 209/11092/77.
Location: the Nairobi Area.

Edge between person or organization and land.
Deed Status: ['lost'].
Date of Announcement: 6th October, 2017.
MR Number: MR/3567528.

Signator object. 
Name: G. M. MUYANGA
Location: Nairobi
Role:  Registrar of Titles



Those extra org classifications from the spaCy output certainly are frustrating, but hopefully the general way in which we create few objects out of many entities makes sense. This is all shown in great detail in networkClasses.ipynb.

Next, we will generate a whole bunch of entities and connections from our singular gazette, and put them into a csv file for you to explore with your favorite network visualization tool! We can easily expand to more gazettes just by running them through the rest of the pipeline.

In [6]:
%%capture
# above is to prevent annoying output.
networkInfrastructure.saveGraph(gazetteSelection = [0], districtEdges = True, addressEdges = True)
# district edges: draw a very weak connection between entities within the same district
# address edges: draw a medium strength connection between entities with the same address

We can also generate a graph with all 10 of the gazettes we have to use.

In [7]:
%%capture
networkInfrastructure.saveGraph(gazetteSelection = list(range(0,10)), districtEdges = True, addressEdges = True)