# Dealing with Text Data Workshop

Through examples we will:

1.  Retrieve data
2.  Ethics checklist
3.  Tokenize
4.  Normalize
5.  Label data with `doccano`
6.  Convert to `spacy` format
7.  Extra:  train a model
8.  Extra:  augment data
9.  References

## Setup

In [None]:
# Ensure we are using the right pip for the Python kernel
# If not using conda, try without the {sys.prefix}/bin part
import sys
! {sys.prefix}/bin/pip install -r requirements.txt

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import string
import importlib
import spacy
import nltk

Download the language features and model for English for use with spaCy.

In [None]:
import sys
# This download is actually downloading en_core_web_sm, en is the shortcut name
! {sys.executable} -m spacy download en

## Get data

Free NY Times recipe data.  Copyright is from the NY Times so please consider this.

Use the `requests` library to get a recipe as raw HTML and `BeautifulSoup` to parse through the HTML page to get to content of interest.

In [None]:
page = requests.get('https://cooking.nytimes.com/recipes/1018442-chicken-soup-from-scratch')
soup = BeautifulSoup(page.content, 'html.parser')
steps = soup.findAll("ol", {"class": "recipe-steps"})

print(steps)

Clean HTML tags to get raw text.

In [None]:
def cleanhtml(raw_html):
    """Function to clean up the html tags in data."""
    cleanr = re.compile('<.*?>')
    # Remove html tags
    cleantext = re.sub(cleanr, '', raw_html)
    cleantext = cleantext.replace('\n', ' ').rstrip().strip()
    return cleantext

cleansteps = cleanhtml(str(steps[0]))
print(cleansteps)

Save data to disk

In [None]:
with open('sample_data.txt', 'w') as fptr:
    fptr.write(cleansteps)

## Ethics checklist

`deon`

## Word tokenize text with spaCy

Tokenizing is breaking apart a corpus or document into units like words, n-grams or sentences (called sentence tokenization) that make sense for the NLP task at hand.

Many libraries perform tokenization like NLTK, [Gensim](https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize), [spaCy](https://spacy.io/usage/linguistic-features#tokenization).   Oftentimes, the ML practitioner will implement their own tokenizer function.  spaCy does tokenization intelligently, as in the word 'U.K.' _not_ being broken apart into ['U', '.', 'K', '.'], but rather kept intact as it should be in most cases.  Here we'll stick with spaCy for consistency and the intelligent features (utilizing ML behind the scenes).  

In spaCy, the tokenizer, going from left to right, performs the following steps:
* Splits on whitespace
* Checks:
  - Does substring match an exception rule?
  - Can a prefix, infix or suffix be split off?
  
Here's an example of how spaCy does tokenization:

![spaCy tokenization](https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg)

We will re-tokenize later after some more preprocessing.

In [None]:
## If getting OSError with spacy.load('en'), try uncommenting and running the following

# importlib.reload(spacy)
# ! {sys.prefix}/bin/python -m spacy download en

In [None]:
# Read our data back in
with open('sample_data.txt', 'r') as fptr:
    article = fptr.read()

In [None]:
# Load 'en_core_web_sm' with it's link 'en' (they are the same thing, i.e. 'en' is the link/shortcut name)
spacy_nlp = spacy.load('en')

In [None]:
doc = spacy_nlp(article)
tokens = [token.text for token in doc]
print(tokens)

## Normalize text

There is no one way to normalize text and at times it can also require domain knowledge. Normalizing text can include the following.

* Convert Unicode charaters to ASCII
* Make lowercase
* Remove punctuation
* Remove stop words
* Stemming
* Lemmatization

### Unicode to ASCII

Convert Unicode to ASCII as a form of text normalization.

In [None]:
all_letters_numbers = string.ascii_letters + " .,;'" + "0123456789"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters_numbers
    )

In [None]:
ascii_article = unicode_to_ascii(article)
print(ascii_article)

### Lemmatization with spaCy

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, dictionary form or base word. [[1]](#references)  For instance, "are, is, being" becomes "be".

In spaCy we operate on the Document `doc` from above (which, btw, does much more, actually, than lemmatization).

In [None]:
# Lemmatize unless it's a special case, e.g. '-PRON-' replacing 'it'
lemmatized_tokens = [token.lemma_ if '-' not in token.lemma_ else token.text for token in doc]
print(lemmatized_tokens)

# print([token.lemma_ for token in doc]) # to see this replacement

### Remove stop words with spaCy

In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. [[1]](#references)

To do this we check the attributes of spaCy Document tokens at:  https://spacy.io/api/token#attributes (look for `is_stop`).

In [None]:
no_stop_words = [token.text for token in doc if not token.is_stop]
print(no_stop_words)

### Stemming with NLTK

Stemming is the task of finding the root of a word.  Surprisingly, spaCy does not have Stemmers, so we will turn to the NLTK package for this.  See the how-to at:  http://www.nltk.org/howto/stem.html

### Put it all together

**Exercise**:  Write one function to convert to ascii and lemmatize.

In [None]:
def normalize_text_to_tokens(text):
    pass

In [None]:
tokens = normalize_text_to_tokens(article)
print(tokens)

### Save text

In [None]:
with open('normalized_sample_data.txt', 'w') as fptr:
    fptr.write(' '.join(tokens))

## Label data with `doccano`

`doccano` is an open source text labeling tool.  If you wish to setup on your own, see the instructions at:  

## Convert custom data to spaCy format

## Extra:  Example of training a SpaCy NER model

## Extra:  Augment data

`nlpaug`

## References

1.  [NLP Pipeline series by Edward Ma](https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3)
2.  [NLP From Scratch: Classifying Names with a Character-Level RNN - on PyTorch Docs](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)
3.  [How does Data Noising Help to Improve your NLP Model? by Edward Ma](https://medium.com/towards-artificial-intelligence/how-does-data-noising-help-to-improve-your-nlp-model-480619f9fb10)
4.  [Custom Named Entity Recognition Using spaCy by Kaustumbh Jaiswal
](https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718)
4.  [spaCy pipelines for pre-trained BERT, XLNet and GPT-2 (Use PyTorch-based transformers from within SpaCy)](https://github.com/explosion/spacy-transformers)