In [34]:
# -------------------------------------------
#
# This notebook walks through the use and implementation of spaCy
# as used by the DSSG-CfA team.
#
# Sections are: 
# Part I: Installation
# Part II: 6 Steps to use a spaCy Model
# Part III: Default Model
# Part IV: Modified Model Training/Testing
# 
# -------------------------------------------

In [36]:
__verion__: '0.0.1'
__author__: 'T Tesfaye'
__date__: 'Aug 11, 2020'

## <center> Intro </center>

<span style='color:blue'> `spaCy` is one of the **most widely used** named entity recognition (NER) tools in the market. </span> Team DSSG-CfA chose to use it because of two specific reasons:

1. It has strong default NER detection system, and
2. It can be easily customized to detect labels, other than the defaults, the user is interested in.

**[Here is](https://spacy.io/api/annotation#named-entities) a list of all the default entities.** A user can build on these entities.

## Part I: Installation

* Install `spaCy`: For `pip` users, run `pip install -U spacy` and for `conda` users, run `conda install -c conda-forge spacy` in your terminal command line.
* To activate the English Model: run `python -m spacy download en_core_web_sm` in your terminal command line. Note, this model is trained on the web.
* To activate the Portuguese Model: run `python -m spacy download pt_core_news_sm` in your terminal command line. Note, this model is trained on the news.

If both installations complete with the note `Download and installation successful`, you have both the spaCy package and the specific English as well as Portuguese models. If you get error messages of other dependencies that need to be installed, go ahead and install them.

## Part II: 6 Steps to use a spaCy Model

These are the general steps to using a spacy model. These steps can be when running the default model vs the custom model.

1. **Step 1: Decide which model to use** 

Each default model has four parts:

    * Language: en = english
    * Type: Model capabilities (e.g. core for general-purpose model with vocabulary, syntax, entities and word vectors, or depent for only vocab, syntax and entities).
    * Genre: Type of text the model is trained on, e.g. web or news.
    * Size: Model size indicator, sm, md or lg. The default is `sm` and it does pretty well. 
  
  
  
   So, use these information to decide which language to use and which model to activate. **Note** It is recommended to use `en_core_web_sm` for English and `pt_core_news_sm` for Portuguese (the Portuguese model seems to only have this option).
   
2. **Step 2: Load the model**

Once you have decided which model to use, `import spacy` and load the model using the line `spacy.load('en_core_web_sm')` for English and `spacy.load('pt_core_news_sm')` for Portuguese. It is recommended to save these models in a variable name for ease of access. That is,

`nlp_en = spacy.load('en_core_web_sm')` and `nlp_pt = spacy.load(' pt_core_news_sm')`.

In [8]:
import spacy

nlp_en = spacy.load('en_core_web_sm')
nlp_pt = spacy.load('pt_core_news_sm')

3. **Step 3: Apply The Model and Convert The Text to Doc**

The spacy model requires texts to be in a token (losely translated as each word in a sentence for English) format in order to extract entities. Hence, apply the model and tokenize the text by running the command `nlp_en(text_of_interest)`. The output of this stage is a sequence of tokens called `doc`. Here's an example

In [33]:
# English Text of Interest
en_test_text = 'Jane Doe, who was born on December 12, 1990, lives in Lisbon and works at Microsoft.'

# Portuguese Text of Interest (as per Google Translate)
pt_test_text = 'Jane Doe, nascida a 12 de dezembro de 1990, vive em Lisboa e trabalha na Microsoft.'

# Apply the model

en_doc = nlp_en(en_test_text)
pt_doc = nlp_pt(pt_test_text)
type(en_doc) # to show that the returned item is a doc
type(pt_doc) # to show that the returned item is a doc

spacy.tokens.doc.Doc

By this time, the model has been applied to the text of interest, tokenized it, and it has extracted the entities.

4. **Step 4: Extract Entities**

In step 3, the model has identified the entities. To access/extract the entities, run the command `.ents` on the tokenized text to return a tuple containing all the entities identified. Check [this link](https://spacy.io/api/doc) to see all the other potential attributes of a `doc` in addition to `.ents` but for now, we'll only focus on `.ents.` 

Here's an example.

In [14]:
print(en_doc.ents)
print(pt_doc.ents)

(Jane Doe, December 12, 1990, Lisbon, Microsoft)


tuple

As you can see, the name, the date, the city, and the company are identified correctly in the English version but the date is missed in the Portuguese version (the reason is unclear).

5. **Step 5: Operate on the Entities**

We would like to get more information on the entities beyond seeing them in a tuple. For example, we want to see how they were labeled (PERSON/CITY), their character position, etc. 

I couldn't find a comprehensive list of entity attributes, these seem to be the most important: `ent.text`, `ent.start_char`, `ent.end_char`, `ent.label_`. **Note** `.label_` returns the actual label like 'PERSON' while `.label` returns the code for the label like '380'. `label_` is more readable to humans. 

Check [this link](https://spacy.io/api/span) to dig deeper into potential attributes.

Let's look at these four attributes in the English and Portuguese entities.


In [20]:
print("**English**")

for ent in en_doc.ents:
    print(ent.text, ent.label_, ent.start_char, ent.end_char)

print("\n~~~~~~~//~~~~~~~~~\n")
print('**Portuguese**')
for ent in pt_doc.ents:
    print(ent.text, ent.label_, ent.start_char, ent.end_char)



**English**
Jane Doe PERSON 0 8
December 12, 1990 DATE 26 43
Lisbon GPE 54 60
Microsoft ORG 74 83

~~~~~~~//~~~~~~~~~

**Portuguese**
Jane Doe PER 0 8
Lisboa LOC 52 58
Microsoft ORG 73 82


You will notice that the English model and the Portuguese model return slightly different results. This could be due to the difference in the underlying model. Recommended to look into this deeper.

6. **Step 6: Store your results and Visualize**

You can store your results in the desired format and visualize the tokens using the `displacy` package. Feed the `doc` (i.e. tokenized text) to displacy and set the `style` argument to `='ent'` if the goal is to highlight and label the entities as shown below. Check out [this link](https://spacy.io/usage/visualizers) the other possible styles and modes of visualization.


**Important Note** If your kernel doesn't stop running after the visualization has been rendered on your screen, feel free to click on the stop button and abort. This doesn't interrupt/disrupt any of your desired ouptputs.

Here's an example

In [28]:
from spacy import displacy

print("English Tokenization Viz")

displacy.serve(en_doc, style='ent')

English Tokenization Viz



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [29]:
print("Portuguese Tokenization Viz")
displacy.serve(pt_doc, style='ent')

Portuguese Tokenization Viz



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


<span style='color:blue'> **That's all!** Using these short lines, you have extracted the default entities from your text. </span>

## Part III: Default Model

The process described in part II invokes the default spaCy model. Here is an example of how to follow the above steps to apply that default model to a large detaset.

In [None]:
# Store your texts in a list or in your desired format and store your output in your desired format
input_texts = []
output_entities = []


# Step 1: Decide on a Model. We'll use english here to show the process

# Step 2: Load the model
nlp_model = spacy.load("en_core_web_sm")

# Step 3: Loop through your texts and apply the model
for example in input_texts: 
    doc = nlp_model(example)
    
    # Step 4: Extract the entities and Step 5: Operate on the entities
    entities = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
    
    # Step 6: Store the results. If you are hoping to use these to train the modified model, it is recommended to store them in the following nested tuple format
    output_entities.append((text, {'entities': ents}))

## Part IV: Modified Model Training/Testing

Although the default spaCy model is strong, it can also be modified.

The following function is copy pasted from [this blog](https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718?gi=e6a352b438ed) which provides a step by step walkthrough.

Here's a list of modifications to make to customize this function:

* `training_data`: should be a list of tuples where each tuple is a single training set with the format `(text, {'entities': ents})` where `text` is the full text and `ents` is all the entities with their beginning and ending character and label as shown in the Part III example. Note `'entities'` is the key for the dictionary containing the ents.

    + **Note** modifying spacy suffers from a catastrophic forgetting problem as described in [this link]. Hence, make sure the training data contains examples of both modified labels as well as default labels. 
* `all_labels`: A list of all the default and the modified labels you want the model to detect. Including extraneous labels does not impact the performance of the model while missing labels crash the model. Hence, it is recommended to include all the potential labels.

    + Example: `all_labels = ['PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'LAND', 'PLOT NUMBER']`

* `output_dir`: Provide an output directory for where you would like to save the trained model. It is recommended to save the model in its own folder for clear organization.
* `new_model_name`: Any string of your choosing to be used as the name for the trained model
* `n_iter`: The number of iterations you would like the model to run through. The more iterations a model performs, the better its performance (up to a certain point) but the longer it takes.



Note, this function is provided for English. To adopt it for Portuguese, change the line: `nlp = spacy.blank("en")` to `nlp = spacy.blank("pt")`
    


In [31]:
all_my_training_data = []
all_my_labels = []
my_local_output_dir = ''


def trainModifiedNERModel(training_data, all_labels, model=None, new_model_name="modified_ner_model_gazettes", output_dir=local_output_dir, n_iter=100):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
        
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    # add new entity label to entity recognizer
    # Adding  labels shouldn't mess anything up
    
    
    # Add our labels to the ner
    for i in all_labels:
        ner.add_label(i)
    
    #ner.add_label("LAND")
    
    if model is None:
    #if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    # only train NER
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(training_data)
            batches = minibatch(training_data, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)
 
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)
        
    
    return "Model Trained and Saved."


Here's an example of how to run this function

In [None]:
trainModifiedNERModel(training_data=all_my_training_data, all_labels=all_my_labels, output_dir=my_local_output_dir, n_iter=100)

Depending on the size of your training size, running this function could take a few minutes or an over an hour. Once the training is completed, you will see these four items in your output directory: 

1. A folder named `ner`
2. A folder named `vocab`
3. A file named `meta.json`
4. A file names `tokenizer`

If you see these four items and your kernel is done running, then you can conclude that your model has finished training.

### Loading Trained Model

Loading the trained model is as simple as following the general instructions outlined in Part II with one exception. In Part II, Step 2, we loaded the model by running the command `spacy.load('en_core_web_sm')` for English and `spacy.load('pt_core_news_sm')` for Portuguese. However, since we are now interested in loading our own model, we will supply spacy with the path to the directory containing the modified model. In other words, we run `spacy.load(my_local_output_dir)`.

This is all the modificantion we need to make and we can follow steps 3 - 6 as outlined in Part II above to explore our trained model.

### Testing The Trained Model

Team DSSG-CfA decided to conduct model testing manually. We assumed that the best/reference performance for a named entity recognition algorithm is a 100% since humans can accurately identify entities. Given the time constraint, we weren't able to come with an automated way of producing test sets whose default _and_ modified entities are labeled with 100% accuracy. Hence, it was faster for our team to visualize the entities using a single line of `displacy` code (as shown above) and quickly skim through the output in order to access the performance of the model. We recommend this approach if other teams are satisfied with running only a handful of tests and we recommend teams to find an automated alternative if they want to run hundreds of tests.