# Dealing with Text Data Workshop

Through examples we will:

1.  Retrieve data
2.  Ethics checklist
3.  Tokenize
4.  Normalize
5.  Demo: Label data with `doccano`
6.  Convert to `spacy` format
7.  Extra:  Train a model
9.  References

## Setup

In [None]:
# Ensure we are using the right pip for the Python kernel
# If not using conda, try without the {sys.prefix}/bin part
import sys
! {sys.prefix}/bin/pip install -r requirements.txt

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import os
import unicodedata
import string
import importlib
import spacy
from nltk.stem.snowball import SnowballStemmer

Download the language features and model for English for use with spaCy.

In [None]:
import sys
# This download is actually downloading en_core_web_sm, en is the shortcut name
! {sys.executable} -m spacy download en

## Get data

Free NY Times recipe data.  Copyright is from the NY Times so please consider this.

Use the `requests` library to get a recipe as raw HTML and `BeautifulSoup` to parse through the HTML page to get to content of interest.

In [None]:
page = requests.get('https://cooking.nytimes.com/recipes/1018442-chicken-soup-from-scratch')
soup = BeautifulSoup(page.content, 'html.parser')
steps = soup.findAll("ol", {"class": "recipe-steps"})

print(steps)

Clean HTML tags to get raw text.

In [None]:
def cleanhtml(raw_html):
    """Function to clean up the html tags in data."""
    cleanr = re.compile('<.*?>')
    # Remove html tags
    cleantext = re.sub(cleanr, '', raw_html)
    cleantext = cleantext.replace('\n', ' ').rstrip().strip()
    return cleantext

cleansteps = cleanhtml(str(steps[0]))
print(cleansteps)

Save data to disk

In [None]:
with open(os.path.join('data', 'sample_data.txt'), 'w') as fptr:
    fptr.write(cleansteps)

## Ethics checklist

`deon` is a command line tool for creating Data Science ethics checklists.  https://github.com/drivendataorg/deon

Run `deon` as follows to create the standard checklist and see the output in the repo folder.

In [None]:
! deon -o ETHICS.md

The portion dealing mainly with data will look like:

A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [ ] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?



## Word tokenize text with spaCy

Tokenizing is breaking apart a corpus or document into units like words, n-grams or sentences (called sentence tokenization) that make sense for the NLP task at hand.

Many libraries perform tokenization like NLTK, [Gensim](https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize), [spaCy](https://spacy.io/usage/linguistic-features#tokenization).   Oftentimes, the ML practitioner will implement their own tokenizer function.  spaCy does tokenization intelligently, as in the word 'U.K.' _not_ being broken apart into ['U', '.', 'K', '.'], but rather kept intact as it should be in most cases.  Here we'll stick with spaCy for consistency and the intelligent features (utilizing ML behind the scenes).  

In spaCy, the tokenizer, going from left to right, performs the following steps:
* Splits on whitespace
* Checks:
  - Does substring match an exception rule?
  - Can a prefix, infix or suffix be split off?
  
Here's an example of how spaCy does tokenization:

![spaCy tokenization](https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg)

We will re-tokenize later after some more preprocessing.

In [None]:
## If getting OSError with spacy.load('en'), try uncommenting and running the following

# importlib.reload(spacy)
# ! {sys.prefix}/bin/python -m spacy download en

In [None]:
# Read our data back in
with open(os.path.join('data', 'sample_data.txt'), 'r') as fptr:
    article = fptr.read()
print(article)

Load 'en_core_web_sm' with it's link 'en' (they are the same thing, i.e. 'en' is the link/shortcut name)

In [None]:
spacy_nlp = spacy.load('en')

In [None]:
doc = spacy_nlp(article)
tokens = [token.text for token in doc]
print(tokens)

## Normalize text

There is no one way to normalize text and at times it can also require domain knowledge. Normalizing text can include the following.

* Convert Unicode charaters to ASCII
* Make lowercase
* Remove punctuation
* Remove stop words
* Stemming
* Lemmatization


Let's do each separately.

### Unicode to ASCII

Convert Unicode to ASCII as a form of text normalization.

In [None]:
all_letters_numbers = string.ascii_letters + " .,;'" + "0123456789"
n_letters = len(all_letters_numbers)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters_numbers
    )

In [None]:
ascii_article = unicode_to_ascii(article)
print(ascii_article)

### Lemmatization with spaCy

Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's lemma, dictionary form or base word. [[1]](#references)  For instance, "are, is, being" becomes "be".

In spaCy we operate on the Document `doc` from above (which, btw, does much more, actually, than lemmatization).  As a note, Parts of Speech Tagging is used to lemmatize words.

A nice diagram of stemming (which we will speak about) and lemmatization (what we _are_ speaking about) is shown here:

![stemming vs lemmatization](assets/stem_lemma.png)

<div align="right"><a href="https://www.quora.com/What-is-difference-between-stemming-and-lemmatization">Source</a></div>

In [None]:
# Lemmatize unless it's a special case, e.g. '-PRON-' replacing 'it'
lemmatized_tokens = [token.lemma_ if '-' not in token.lemma_ else token.text for token in doc]
print(lemmatized_tokens)

# print([token.lemma_ for token in doc]) # to see this pronoun replacement

### Remove stop words with spaCy

In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. [[1]](#references)

To do this we check the attributes of spaCy Document tokens at:  https://spacy.io/api/token#attributes (look for `is_stop`).

High frequency words are then going to be removed, such as:  of, the, and.

In [None]:
no_stop_words = [token.text for token in doc if not token.is_stop]
print(no_stop_words)

### Stemming with NLTK

Stemming is the task of finding the root of a word.  Surprisingly, spaCy does not have Stemmers, so we will turn to the NLTK package for this.  See the how-to at:  http://www.nltk.org/howto/stem.html

In [None]:
stemmer = SnowballStemmer("english")
stemmed_tokens = [stemmer.stem(token.text) for token in doc]
print(stemmed_tokens)

### Put it together

**Exercise 1**:  Write one function to convert to ascii and lemmatize.

In [None]:
def normalize_text_to_tokens(text):
    pass

In [None]:
norm_tokens = normalize_text_to_tokens(article)
print(norm_tokens)

### Save text

In [None]:
with open(os.path.join('data', 'normalized_sample_data.txt'), 'w') as fptr:
    fptr.write(' '.join(norm_tokens)
               .replace(' .', '.')
               .replace(' ,', ',')
               .replace(' ;', ';'))

## Sentence tokenization

We'd like to perform the task of NER so we read the normalized data back in to prepare it to be labeled.

In [None]:
with open(os.path.join('data', 'normalized_sample_data.txt'), 'r') as fptr:
    article = fptr.read()

In [None]:
doc = spacy_nlp(article)

for i, token in enumerate(doc.sents):
    print('-->Sentence %d: %s' % (i, token.text))

### Save sentences

In [None]:
with open(os.path.join('data', 'normalized_sentences.txt'), 'w') as fptr:
    fptr.write('\n'.join(sent.text for sent in doc.sents))

## Demo:  Label data with `doccano`

`doccano` is an open source text labeling tool.  If you wish to setup on your own, see the instructions at (the docker setup is recommended):  https://github.com/chakki-works/doccano.

Once the app is running in the browser, labeling is a simple as importing data (`data/normalized_sentences.txt`), making the label and highlighting the text.

![doccano example](assets/doccano.png)

Then it is exported as JSON with labels.

![doccano export](assets/doccano_export.png)

The output has been saved to this repo for your convenience as `data/doccano_annots.json`.

## Convert custom data to spaCy format

The doccano export is very similiar to what spaCy expects.  

Our doccano format looks like:

```json
{"id": 54, "text": "place the chicken, celery, carrot, onion, parsnip if use, parsley, peppercorn, bay leave and salt in a large soup pot and cover with cold water by 1 inch.", "meta": {}, "annotation_approver": null, "labels": [[109, 117, "EQUIPMENT"]]}
{"id": 55, "text": "bring to a boil over high heat, then immediately reduce the heat to very low.", "meta": {}, "annotation_approver": null, "labels": []}
```

The spaCy format looks like:

```json
TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
```

**Exercise**:  Can you write a script to convert?

In [None]:
import json

TRAIN_DATA = []

# open file and convert here
with open('data/doccano_annots.json', 'r') as fptr:
    pass


print(TRAIN_DATA)

## Extra:  Example of training a SpaCy NER model

NER or Named Entity Recognition can be achieved with algorithms like RNNs, LSTMs or Bi-LSTMs, for example.  An _entity_ is a person, place or thing upon which the action is placed (e.g. shown here with labels ORG, GPE, MONEY) and is usually spoken about in terms of intents where an _intent_ is the action, want or desire (e.g. "is looking at buying").

![ner example](assets/ner.png)
<div align="right">Image Source:  https://spacy.io/usage/linguistic-features#named-entities-101</div>

Load the model

In [None]:
# Setting up the pipeline and entity recognizer.
model = None
if model is not None:
    spacy_nlp = spacy.load(model)  # load existing spacy model
    print("Loaded model '%s'" % model)
else:
    spacy_nlp = spacy.blank('en')  # create blank Language class
    print("Created blank 'en' model")

if 'ner' not in spacy_nlp.pipe_names:
    ner = spacy_nlp.create_pipe('ner')
    spacy_nlp.add_pipe(ner)
else:
    ner = spacy_nlp.get_pipe('ner')

Add a new entity label

In [None]:
# Add new entity labels to entity recognizer
LABEL = ['EQUIPMENT']
for i in LABEL:
    ner.add_label(i)

# Inititalizing optimizer for training
if model is None:
    optimizer = spacy_nlp.begin_training()
else:
    optimizer = spacy_nlp.entity.create_optimizer()

Train model by looping and updating weights

In [None]:
import random
from spacy.util import minibatch, compounding

# Epochs
n_iter = 10

# Get names of other pipes to disable them during training to train only NER
other_pipes = [pipe for pipe in spacy_nlp.pipe_names if pipe != 'ner']
with spacy_nlp.disable_pipes(*other_pipes):  # only train NER
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        batches = minibatch(TRAIN_DATA, size=2)
        for batch in batches:
            texts, annotations = zip(*batch)
            spacy_nlp.update(texts, annotations, sgd=optimizer, drop=0.40,
                       losses=losses)
        print('Losses', losses)

Save model

In [None]:
from pathlib import Path


# Save model
output_dir = 'weights'
new_model_name = 'en_equipment'
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    spacy_nlp.meta['name'] = new_model_name  # rename model
    spacy_nlp.to_disk(output_dir)
    print("Saved model to", output_dir)

Test model

In [None]:
# Test the saved model

# Preprocess text with our function from above
test_text = 'Now, fry the vegetables in the sauce pan or skillet, stiring constantly using a spoon, bake in the oven, steam in a large pot or cook in the microwave.'
# Normalize with function from above
tokens_preprocess = normalize_text_to_tokens(test_text)
# Return normalized text to sentence form
test_text_processed = ' '.join(tokens_preprocess)\
               .replace(' .', '.')\
               .replace(' ,', ',')\
               .replace(' ;', ';')
print(test_text_processed)

# Load model and predict on test text
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text_processed)
for ent in doc2.ents:
    print(ent.label_, ent.text)

## References

1.  [NLP Pipeline series by Edward Ma](https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3)
2.  [NLP From Scratch: Classifying Names with a Character-Level RNN - on PyTorch Docs](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)
3.  [How does Data Noising Help to Improve your NLP Model? by Edward Ma](https://medium.com/towards-artificial-intelligence/how-does-data-noising-help-to-improve-your-nlp-model-480619f9fb10)
4.  [Custom Named Entity Recognition Using spaCy by Kaustumbh Jaiswal
](https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718)
4.  [spaCy pipelines for pre-trained BERT, XLNet and GPT-2 (Use PyTorch-based transformers from within SpaCy)](https://github.com/explosion/spacy-transformers)