In [34]:
# -------------------------------------------
#
# This notebook explains the named entity recognition tools 
# that DSSG-CfA team considered but didn't use.
#
# Sections are: 
# Introduction
# Tool I: Stanza
# Tool 2: NLTK
# Tool 3: Snorkel
# 
# -------------------------------------------

In [1]:
__verion__: '0.0.1'
__author__: 'T Tesfaye'
__date__: 'Aug 19, 2020'

## Introduction

To conduct named entity recognition (NER) on the Kenyan Gazettes, our team explored various tools before deciding to use [`spaCy`](https://spacy.io/usage/spacy-101), which is one of the most widely used free, open-source library for advanced natural language processing in Python. 

The other documents in this directory explain the spacy model in depth. For beginners, the notebook titled `Y_general_spaCy_beginner_tutorial.ipynb` provides a step by step walkthrough of using spacy.


This notebook will explain the other tools our team considered but did not end up persuing due to various reasons. We hope this will help those continuing this project.


## Tool 1: Stanza


**Overview of Stanza**

Stanza, previously known as Stanford Named Entity Recognizer, is a collection of tools developed by Stanford University that can be used to process raw text data from its initial stages through entity recognition. It is compatible with 66 human languages. Although the backend is written in Java, it has a smooth Python interface. Detailed information about this tool and tutorials can be found [here](https://stanfordnlp.github.io/stanza/pipeline.html).

Stanza reads a given text as a document. This document has numerous features including *.sentences* which returns a list of each item detected as a sentence. Then, each sentence has a *.tokens* feature which returns a list of all the tokens contained in the given sentence while each token has a *.words* feature that returns a list of tokens in the word. In the English language, a token is usually a word however, for instance, in French, the two words *de le* can be concatenated to form one token *du*. Eventually, we conduct a named entity recognition on each token.


To use Stanza, one starts by importing stanza using their preferred Linux interface and following the instructions at this [official stanza website](https://stanfordnlp.github.io/stanza/pipeline.html). 

**Conclusion of Applying Stanza on Gazettes**

To test the compatibility of Stanza with the gazette entries, we applied Stanza to randomly selected gazette notices and concluded that:

Unique Strengths:

1. Stanza does a good job of identifying names that are not American, townships that are specific to Kenya, and even P.O. Boxes and hectare sizes of lands.

Unique Weaknesses:

1. Despite it’s good performance, the accuracy of Stanza’s entity detection depends on the capitalization of the specific word as well as the other texts surrounding the word. The following output demonstrates that the same name `J. K. Njoroge` can be identified as a date, a person, or an organization depending on the words surrounding it and it’s capitalization. Due to this shortcoming, we could not rely on Stanza as our final NER tool.

J.: I-DATE
K.: I-DATE
NJOROGE: E-DATE

J.: B-PERSON
K: I-PERSON
.: I-PERSON
Njoroge: E-PERSON

J.: B-PERSON
K: I-PERSON
.: E-PERSON
NJOROGE: S-ORG

2. We also found that the Stanza user interface and online support system was less than desirable.


## Tool 2: NLTK

**Overview of NLTK**

[Natural Language Toolkit (NLTK)](https://www.nltk.org/) is a platform for building Python programs to work with human language data. It is one of the most widely used named entity recognition tools on the market. One can follow [this tutorial](https://www.digitalocean.com/community/tutorials/how-to-work-with-language-data-in-python-3-using-the-natural-language-toolkit-nltk) to install and to start working with NLTK.

**Conclusion of Applying NLTK on Gazettes**

After experimenting with NLTK on randomly selected gazette section, we concluded the following:

Unique Strengths:

* NLTK has a strong multi-tokenization tool. For instance, a user can easily create a custom token such as "title No." and train the NLTK model to recognize this phrase as one token which is particularly helpful in recognizing LAND REGISTRATION numbers from the gazettes. multi-tokenization is slightly complicated in spaCy.


Weaknesses
* The approach required to create a modified NER model in NLTK is more cumborsome than that of spaCy. This is the main reason our team chose spaCy than NLTK.

## Tool 3: Snorkel

**Overview of Snorkel**


Snorkel is a system for programmatically building and managing training datasets without manual labeling. Check [their website](https://www.snorkel.org/get-started/) for more detail and [their github page](https://github.com/snorkel-team) to find tutorials.

**Conclusion of Applying Snorkel on Gazettes**
After experimenting with Snorkel on randomly selected gazette section, we concluded the following:

Unique Strengths
* Snorkel is an advanced tool that allows us to generate numerous labeled training data using few hand labeled data.

Weaknesses
* Due to lack of time, our team could not explore the full capabilities of Snorkel.



## Conclusion:

After considering these three tools (Stanza, NLTK, and Snorkel), our team decided to use `spaCy` due to the following reasons:

* `spaCy` provides a strong default NER system that identifies entities such as Person, Organization, and Date.
* `spaCy` is easily customizable. A modified model can easily be trained using custom training sets. We used Land Registration Act notices to train a model that recognizes entities such as "Land Size" and "Land Registration Number."
* `spaCy` has ample support online.
* `spaCy` is free and open source hence it would not place constraints on our partners' financial resources.

In [2]:
## THE END