# Dealing with Text Data Workshop

Through examples we will:

1.  Retrieve data
2.  Ethics checklist
3.  Tokenization
4.  Normalize
5.  Augment
    * Data noising
6.  Convert to a tensor for ML

## Get data

Free NY Times recipe data.  Copyright is from the NY Times so please consider this.

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import string

Use the `requests` library to get a recipe as raw HTML and `BeautifulSoup` to parse through the HTML page to get to content of interest.

In [None]:
page = requests.get('https://cooking.nytimes.com/recipes/1018442-chicken-soup-from-scratch')
soup = BeautifulSoup(page.content, 'html.parser')
steps = soup.findAll("ol", {"class": "recipe-steps"})

print(steps)

Clean HTML tags to get raw text.

In [None]:
def cleanhtml(raw_html):
    """Function to clean up the html tags in data."""
    cleanr = re.compile('<.*?>')
    # Remove html tags
    cleantext = re.sub(cleanr, '', raw_html)
    cleantext = cleantext.replace('\n', ' ').rstrip().strip()
#     # Remove special quotes
#     cleantext = cleantext.replace('“', '').replace('”', '')
#     cleantext = cleantext.lower()
    return cleantext

cleansteps = cleanhtml(str(steps[0]))
print(cleansteps)

## Ethics checklist

`deon`

## Tokenize text

`spacy`

## Normalize text

There is no one way to normalize text and sometimes it can involve domain knowledge. Normalizing text can include the following.

* Convert Unicode charaters to ASCII
* Make lowercase
* Remove punctuation
* Remove stop words
* Stemming
* Lemmatization

### Unicode to ASCII

Convert Unicode letters to ASCII letters as a form of text normalization.

In [None]:
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

In [None]:
unicode_to_ascii(cleansteps)

### Remove stop words with SpaCy

In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. [[1]](#references)

### Lemmatization with SpaCy

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. [[1]](#references)

### Use SacreMoses for reversing normalization later

## Augment data

`nlpaug`

## Convert to tensor for training a model

## References

1.  [NLP Pipeline series by Edward Ma](https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3)
2.  [NLP From Scratch: Classifying Names with a Character-Level RNN - on PyTorch Docs](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)
3.  [How does Data Noising Help to Improve your NLP Model? by Edward Ma](https://medium.com/towards-artificial-intelligence/how-does-data-noising-help-to-improve-your-nlp-model-480619f9fb10)