In [1]:
# -----------------------------------------------------------
# Walk through this directory, showing how each file contributes
# to the overall pipeline.
# -----------------------------------------------------------

In [2]:
# for others to use this script, it will help to change this variable to
# whatever the route it to the root of your dssg-cfa folder.
ROUTETOROOTDIR = '/home/dssg-cfa/notebooks/dssg-cfa-public/'
IMPORTSCRIPTSDIR = ROUTETOROOTDIR + "util/py_files"
EXPORTDATADIR1 = ROUTETOROOTDIR + 'B_text_preproessing/csv_outputs/'
UTILDIR = ROUTETOROOTDIR + 'util'
JSONSDIR = ROUTETOROOTDIR + 'A_pdf_to_text/jsons_ke_gazettes/'
import os
import json 
import matplotlib.pyplot as plt
import random
import numpy as np
from sklearn.cluster import KMeans

os.chdir(IMPORTSCRIPTSDIR)
import setup
os.chdir(IMPORTSCRIPTSDIR)
import orderingText
import readingJsonsBulk
import retoolingSegmentation
import trainingDataForSpaCy

In this demonstration below we will show how to preprocess text from a page in a Kenya gazette. The PDF we are processing can be found [here](https://data.connectedafrica.net/entities/241300.cc2c2a9f7521d1ce81135cffde04cb83de9111e6#page=3), although these links change, so it might also help to try [here](https://data.connectedafrica.net/search?filter%3Acollection_id=18&limit=30&q=%2205-July-2019%22).

Let's pick up where we left off, by loading text from a gazette json:

In [3]:
gazetteNum, pageNum = 0, 2
jsonDict = orderingText.readJsonIntoDict(JSONSDIR, "gazette-ke-vol-cxxi-no-85-dated-05-july-2019")
text = orderingText.readPage(jsonDict, pageNum)
print(text[0:1000])

GAZETTE NOTICE NO. 5866 
THE LAND REGISTRATION ACT 
(No. 3 of 2012) 
ISSUE OF A PROVISIONAL CERTIFICATE 
WHEREAS Abdalla Mohamed Abdalla, of P.O. Box 90145, Mombasa in the Republic of Kenya, is registered as proprietor in fee simple of all that piece of land containing 0.0163 hectare or thereabouts, known as Plot No. Mombasa/Block XVI/598, situate, in Mombasa District, and whereas sufficient evidence has been adduced to show that the said certificate of title has been lost, notice is given that after the expiration of sixty (60) days from the date hereof, I shall ssue a provisional certificate of title provided that no objection has been received within that period. 
Dated the 5th July, 2019. 
J. G. WANJOHI, 
MR/6508092 Registrar of Titles, Mombasa. GAZETTE NOTICE NO. 5867 
THE LAND REGISTRATION ACT 
(No. 3 of 2012) 
ISSUE OF A PROVISIONAL CERTIFICATE 
WHEREAS Fatma Eric Edward Barallon, of P.O. Box 1851- 80200, Malindi in the Republic of Kenya, is registered as proprietor lessee from 

Next, we want to split this text into its separate segments. This is done in retoolingSegmentation.

In [4]:
segments = retoolingSegmentation.getSegments(text)
segments

[<retoolingSegmentation.Segment at 0x7fa638f008d0>,
 <retoolingSegmentation.Segment at 0x7fa669ed42d0>,
 <retoolingSegmentation.Segment at 0x7fa669ed4690>,
 <retoolingSegmentation.Segment at 0x7fa669ed4390>,
 <retoolingSegmentation.Segment at 0x7fa669ed4090>,
 <retoolingSegmentation.Segment at 0x7fa638f00c50>,
 <retoolingSegmentation.Segment at 0x7fa63b252f10>]

This output isn't very helpful. Another point: we only capture seven segments here, when it looks like there are eight on the page. Let's demonstrate how we can see more info on these segments while spotting the bug.

A segment is an object, and has many attributes (see orderingText.ipynb for more detail). One of them is text.

In [5]:
print(segments[0].text)

GAZETTE NOTICE NO. 5866 
THE LAND REGISTRATION ACT 
(No. 3 of 2012) 
ISSUE OF A PROVISIONAL CERTIFICATE 
WHEREAS Abdalla Mohamed Abdalla, of P.O. Box 90145, Mombasa in the Republic of Kenya, is registered as proprietor in fee simple of all that piece of land containing 0.0163 hectare or thereabouts, known as Plot No. Mombasa/Block XVI/598, situate, in Mombasa District, and whereas sufficient evidence has been adduced to show that the said certificate of title has been lost, notice is given that after the expiration of sixty (60) days from the date hereof, I shall ssue a provisional certificate of title provided that no objection has been received within that period. 
Dated the 5th July, 2019. 
J. G. WANJOHI, 
MR/6508092 Registrar of Titles, Mombasa. 


We have found the bug! This sixth segment contains two segments when it should have only one. The cause is the superfluous comma right in the middle, in 'GAZETTE NOTICE NO. 587,2'. 

Segments have lots of other attributes. They are all extracted uses regular expressions, and are thus far from perfectly accurate. They will be quite useful, however, as training data for our spaCy model later on. 

Let's take a look at some attributes that we pulled.

In [10]:
print(segments[0].name)
print(segments[0].address)
print(segments[0].signator)
print(segments[0].date)

Abdalla Mohamed Abdalla
P.O. Box 90145, Mombasa in the Republic of Kenya
J. G. WANJOHI
5th July, 2019


We can conveniently save all of the information in a segment object to a csv in this way:

In [7]:
%%capture
# when writing a csv there is a lot of annoying output, the above line gets rid of it


readingJsonsBulk.writeEntireGazetteToCsv(0)     
# this script uses gazette numbers instead of names so that they can be converted in bulk quickly

The summary information of a whole bunch of these segment objects can be found in csv_outputs_train. Go and explore and see what's in there!

The final thing to display is that we have written a script which easily uses all of this csv data to create training data for our spaCy model. It is essentially a quick and dirty, highly imperfect method for doing entity extraction.

In [12]:
firstExample = trainingDataForSpaCy.exportTrainData(0)[1]
trainingDataForSpaCy.pullFound(firstExample)

WHEREAS Fatma Eric Edward Barallon, of P.O. Box 1851- 80200, Malindi in the Republic of Kenya, is registered as proprietor lessee from the precise properties limited for a term of 99 years, from Ist December, 1993, subject to annual rent of KSh. 152 per annum, of all that piece of land known as Apartment number 4, Block "B" Ground Floor, erected on plot number 11954 Malindi, situate in the district of Malindi, registered as C.R. 53765, and whereas sufficient evidence has been adduced to show that the said certificate of lease has been lost, notice is given that after the expiration of sixty (60) days from the date hereof, I shall issue a provisional certificate of title provided that no objection has been received within that period.

LAND REGISTRATION: plot number 11954 Malindi
DEED STATUS: lost
OWNERSHIP STATUS: proprietor lessee
PERSON: Fatma Eric Edward Barallon
OWNER ADDRESS: P.O. Box 1851- 80200, Malindi in the Republic of Kenya
DISTRICT: Malindi


Not perfect, but also not bad!