# Text Cleaning

This is a code walkthrough for self-starters on most common text cleaning tasks.
After the end of this notebook, you will be able to: 
1. Understand tokenization, do it manually yourself - and using spaCy
2. Understand why stop word removal and case standardization work - with spaCy examples
3. Differentiate between Stemming and Lemmatization - with spaCy lemmatization examples

I have always liked **The Adventures of Sherlock Holmes** by _Arthur Conan Doyle_. Let's download the book and save it locally:

In [1]:
url = 'http://www.gutenberg.org/ebooks/1661.txt.utf-8'
file_name = 'sherlock.txt'

In [2]:
import urllib.request
# Download the file from `url` and save it locally under `file_name`:

with urllib.request.urlopen(url) as response:
    with open(file_name, 'wb') as out_file:
        data = response.read() # a `bytes` object
        out_file.write(data)

In [3]:
!ls {*.txt}

sherlock.txt


In [4]:
!head -2 sherlock.txt

ï»¿Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle



The file contains header and footer information from Project Gutenberg. We are not interested in the same and will discard the copyright and other legal notices. 
Todo: 
- Open the file and delete the header and footer information and save the file as ```sherlock_clean.txt```

I opened the text file to see that I need to remove the first 33 lines. Let's do that using shell commands - which also work on Windows inside Jupyter notebook: 

In [5]:
!sed -i 1,33d sherlock.txt

I use the ```sed``` syntax. 

The ```-i``` flag tells to make the changes in place.  
```1,33d``` instructs to delete lines 1 to 33.

In [6]:
!head -5 sherlock.txt

THE ADVENTURES OF SHERLOCK HOLMES

by

SIR ARTHUR CONAN DOYLE


## What do I see? 

Before I continue to text cleaning for any Natural Language Processing Task, I like to spend a few seconds taking a quick glance at the data itself. I noted down some of the things I spotted below, of course a trained eye can see a lot more than I did: 

1. Dates are written in a mixed format: `twentieth of March, 1888`, times are too: `three o'clock`
1. Text is wrapped at around 70 columns, or no line can be longer than 70 characters 
1. There are lot of proper nouns. These include names such as `Atkinson`, `Trepoff` in addition to locations such as `Trinconmalee` and `Baker Street` etc.
1. The index is in Roman numerals such as `I` and `IV` and not `1` or `4`
1. There are lot of dialogues such as: "You have carte blanche." with no narrative around them. This storytelling style switches freely from a narrative to a dialogue driven. 
1. The grammar and vocabulary is slightly unusual because of the time when Doyle wrote.  

## Load Data

In [7]:
#let's the load data to RAM
text = open(file_name, 'r', encoding='utf-8').read()  # note that I add an encoding='utf-8' parameter to preserve information
print(text[:5])

THE A


In [8]:
print(f'The file is loaded as datatype: {type(text)} and has {len(text)} characters in it')

The file is loaded as datatype: <class 'str'> and has 581204 characters in it


### Exploring Loaded Data

In [9]:
# how many unique characters do we see? 
# For reference, ASCII has 127 characters in it - so we expect this to have at most 127 characters
unique_chars = list(set(text))
unique_chars.sort()
print(unique_chars)
print(f'There are {len(unique_chars)} unique characters, including both ASCII and Unicode character')

['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'â', 'è', 'é']
There are 85 unique characters, including both ASCII and Unicode character


For our machine learning models, we often need the words to occur as individual tokens or single words. This process is called:

## Tokenization 

We convert the raw text into a list of words. This preserves the original ordering of the text. 

### Split by Whitespace

In [10]:
words = text.split()
print(len(words))

107431


In [11]:
print(words[90:200])  #start with the first chapeter, ignoring the index for now

['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex.', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler.', 'All', 'emotions,', 'and', 'that', 'one', 'particularly,', 'were', 'abhorrent', 'to', 'his', 'cold,', 'precise', 'but', 'admirably', 'balanced', 'mind.', 'He', 'was,', 'I', 'take', 'it,', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen,', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position.', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions,', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer.', 'They', 'were', 'admirable', 'things', 'for']


In [12]:
# Let's look at another example: 
'red-headed woman on the street'.split()

['red-headed', 'woman', 'on', 'the', 'street']

Notice how the words red-headed were not split. This is something we may or may not want to keep always.  

*Problem:* Punctuations are often appearing with the word itself, like: `Adler.` and `emotions,`.

*Solution:* Simply extract words and discard everything else. This means we will discard all non-ASCII characters and punctuations.

### Split by Word Extraction
**Introducing Regex**

In [13]:
import re
re.split('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

Regular expressions can be daunting at first, but are very powerful. The regular expression `\W+` means *a word character (A-Z etc.) repeated one or more times*.

In [14]:
words_alphanumeric = re.split('\W+', text)

In [15]:
len(words_alphanumeric), len(words)

(109111, 107431)

In [16]:
print(words_alphanumeric[90:200])

['BOHEMIA', 'I', 'To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler', 'All', 'emotions', 'and', 'that', 'one', 'particularly', 'were', 'abhorrent', 'to', 'his', 'cold', 'precise', 'but', 'admirably', 'balanced', 'mind', 'He', 'was', 'I', 'take', 'it', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer', 'They', 'were', 'admirable']


We notice how `Adler` no longer has the punctuation with her. This is what we wanted. Mission Accomplished.  

**What was the tradeoff we made here?** To understand that, let's look at another example: 

In [17]:
words_break = re.split('\W+', "Isn't he coming home for dinner with the red-headed girl?")
print(words_break)

['Isn', 't', 'he', 'coming', 'home', 'for', 'dinner', 'with', 'the', 'red', 'headed', 'girl', '']


We have split `Isn't` to `Isn` and `t`. This is not good if you were working with say email or Twitter data, because you would've a lot more of such contractions. As a minor annoyance, we have an extra empty token at the end. 

Similarly, because we neglected punctuation `red-headed` is broken into two words: `red` and `headed`

We can write custom rules in our tokenization strategy to cover all these cases. Or, use something which already has been written for us. 

### spaCy for Tokenization

In [18]:
%%time
import spacy
nlp = spacy.load('en')

Wall time: 2.46 s


In [19]:
doc = nlp(text)

The above syntax creates a spaCy object `doc`. The object pre-computes a lot of linguistic features, including tokens. 

We can retrieve them by calling the object iterator. Below, we call the iterator and `list` it. 

In [20]:
print(list(doc)[150:200])

[whole, of, her, sex, ., It, was, not, that, he, felt, 
, any, emotion, akin, to, love, for, Irene, Adler, ., All, emotions, ,, and, that, 
, one, particularly, ,, were, abhorrent, to, his, cold, ,, precise, but, 
, admirably, balanced, mind, ., He, was, ,, I, take, it, ,]


Conveniently, spaCy tokenizes all *punctuations and words* and returned those as individual tokens as well. Let's try the example which we didn't like earlier:

In [21]:
words = nlp("Isn't he coming home for dinner with the red-headed girl?")
print([token for token in words])

[Is, n't, he, coming, home, for, dinner, with, the, red, -, headed, girl, ?]


*Observations*:
- spaCy got the `Isn't` split as we wanted 
- `red-headed` was broken into 3 tokens: `red`, `-`, `headed`. Since the punctuation information isn't lost, we can restore the original `red-headed` token if we want to

**How does the spaCy tokenizer work ?**

> First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
> 
> - **Does the substring match a tokenizer exception rule?** For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
> - **Can a prefix, suffix or infix be split off?** For example punctuation like commas, periods, hyphens or quotes.
>
> ![caption](https://spacy.io/assets/img/tokenization.svg)
> from [spaCy-101](https://spacy.io/usage/spacy-101) docs

**Sentence Tokenization**: We can also use spaCy to extract one sentence at a time, instead of one-word-at-a-time. 

In [22]:
sentences = list(doc.sents)
print(sentences[13:18])

[I. A SCANDAL IN BOHEMIA

I.

To Sherlock Holmes, she is always THE woman., I have seldom heard
him mention her under any other name., In his eyes she eclipses
and predominates the whole of her sex., It was not that he felt
any emotion akin to love for Irene Adler.]


#### STOP WORD REMOVAL & CASE CHANGE

These simple ideas are widespread and fairly effective for a lot of tasks. They are particularly useful in reducing the number of unique tokens in a document for your processing.  

spaCy has already marked each token as a stop word or not and stored it in `is_stop` attribute of each token. This makes it very handy for text cleaning. Let's take a quick look: 

In [49]:
sentence_example = "the AI/AGI uprising cannot happen without the progress of NLP"

In [54]:
[(token, token.is_stop, token.is_punct) for token in nlp(sentence_example)]

[(the, True, False),
 (AI, False, False),
 (/, False, True),
 (AGI, True, False),
 (uprising, False, False),
 (can, True, False),
 (not, True, False),
 (happen, False, False),
 (without, True, False),
 (the, True, False),
 (progress, False, False),
 (of, True, False),
 (NLP, True, False)]

In [57]:
for token in doc[:5]:
    print(token, token.is_stop, token.is_punct)

THE False False
ADVENTURES False False
OF False False
SHERLOCK False False
HOLMES False False


Interesting, while `the` and `of` were marked as stop words, `THE` and `OF` were not. This is not a bug but by design. spaCy doesn't remove words which are probably important because of their CAPS or Title Case automatically. 

We can instead force this behaviour by converting our original text to lower case before we pass it to spaCy. 

In [30]:
text_lower = text.lower()  # native python function
doc_lower = nlp(text_lower)

In [32]:
for token in doc_lower[:5]:
    print(token, token.is_stop)

the True
adventures False
of True
sherlock False
holmes False


In [28]:
from spacy.lang.en.stop_words import STOP_WORDS
f'spaCy has a dictionary of {len(list(STOP_WORDS))} stop words'

'spaCy has a dictionary of 305 stop words'

You can also extend the STOP WORDS dictionary on the fly for your domain. For instance, if you were using this code to process text of a NLP book, we might want to add words like `NLP`, `Processing`, `AGI`, `Data` etc. to stop words list. 

In [58]:
domain_stop_words = ["NLP", "Processing", "AGI"]
for word in domain_stop_words:
    STOP_WORDS.add(word)

In [59]:
[(token, token.is_stop, token.is_punct) for token in nlp(sentence_example)]

[(the, True, False),
 (AI, False, False),
 (/, False, True),
 (AGI, True, False),
 (uprising, False, False),
 (can, True, False),
 (not, True, False),
 (happen, False, False),
 (without, True, False),
 (the, True, False),
 (progress, False, False),
 (of, True, False),
 (NLP, True, False)]

Exactly as expected, `NLP` and `AGI` are now marked as stop words too. 

Let's pull out string tokens which are not stop words into a Python list or similar data structure. 
Most NLP tasks which come after text pre-processing expect string tokens and not spaCy token objects as datatype. 

We will remove both stop words and punctuations here for demonstration: 

In [61]:
[str(token) for token in nlp(sentence_example) if not token.is_stop and not token.is_punct]

['AI', 'uprising', 'happen', 'progress']

In [62]:
[str(token) for token in nlp(sentence_example) if not token.is_stop]

['AI', '/', 'uprising', 'happen', 'progress']

## Stemming and Lemmatization

> **Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

> **Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. 

> If confronted with the token `saw`, stemming might return just `s`, whereas lemmatization would attempt to return either `see` or `saw` depending on whether the use of the token was as a verb or a noun.

> - Christopher Manning et al, 2008, [IR-Book](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) 

### spaCy for Lemmatization

**spaCy only supports lemmatization**. As discussed by spaCy core contributor in [issue #327](https://github.com/explosion/spaCy/issues/327) on Github, stemmer's are rarely a good idea. 

We want to treat `meet/NOUN` as different from `meeting/VERB`. Unlike Stanford NLTK which was created to introduce as many NLP ideas as possible, spaCy takes an opinionated stand against stemming. 

An underscore at end, such as `lemma_` tells spaCy we are looking for something which is human readable. spaCy stores the internal hash or identifier which spaCy stores in `token.lemma`. 

In [72]:
lemma_sentence_example = "Their Apples & Banana fruit salads are amazing. Would you like meeting me at the cafe?"
[(token, token.lemma_, token.lemma, token.pos_ ) for token in nlp(lemma_sentence_example)]

[(Their, '-PRON-', 561228191312463089, 'ADJ'),
 (Apples, 'apples', 14374618037326464786, 'PROPN'),
 (&, '&', 15473034735919704609, 'CCONJ'),
 (Banana, 'banana', 2525716904149915114, 'PROPN'),
 (fruit, 'fruit', 17674554054627885835, 'NOUN'),
 (salads, 'salad', 16382906660984395826, 'NOUN'),
 (are, 'be', 10382539506755952630, 'VERB'),
 (amazing, 'amazing', 12968186374132960503, 'ADJ'),
 (., '.', 12646065887601541794, 'PUNCT'),
 (Would, 'would', 6992604926141104606, 'VERB'),
 (you, '-PRON-', 561228191312463089, 'PRON'),
 (like, 'like', 18194338103975822726, 'VERB'),
 (meeting, 'meet', 6880656908171229526, 'VERB'),
 (me, '-PRON-', 561228191312463089, 'PRON'),
 (at, 'at', 11667289587015813222, 'ADP'),
 (the, 'the', 7425985699627899538, 'DET'),
 (cafe, 'cafe', 10569699879655997926, 'NOUN'),
 (?, '?', 8205403955989537350, 'PUNCT')]

There is a quite a few things going on here. Let's discuss them: 

**-PRON-**

spaCy has a slightly annoying lemma (recall: lemma is the output of lemmatization): -PRON-. This is used as the lemma for all PRONouns such as `Their`, `you`, `me` and `I`. Some other NLP tools instead lemmatize to `I` instead of a placeholder `-PRON-`

**(automatic) Lower casing**

While checking for stop words, spaCy did not automatically lower case our input while comparison. On the other hand, while lemmatization it did. It converted `Apples` to `apple` and `Banana` to `banana`. 

**meeting to meet**

Lemmatization is aware of the linguistic role the word plays in context. `Meeting` is converted to `meet` because it's a verb. spaCy does expose part of speech tagging and other linguistic features for us to use. We will learn how to query those soon. 

### spaCy comparison with Stanford CoreNLP and NLTK

|Feature|	Spacy|	NLTK|	Core NLP|
        |---|---|-----|---|
|Python API|	Y|	Y|	N|
|Multi Language support|	Y|	Y|	Y|
|Tokenization|	Y|	Y|	Y|
|Lemmatization| Y|	Y|	Y|
|Stemming|	N|	Y|	Y|
|Part-of-speech tagging|	Y|	Y|	Y|
|Sentence segmentation|	Y|	Y|	Y|
|Dependency parsing|	Y|	N|	Y|
|Entity Recognition|	Y|	Y|	Y|
|*Integrated word vectors*|	Y|	N|	N|
|Sentiment analysis|	Y|	Y|	Y|
|*Coreference resolution*|	N|	N|	Y|
|*Built-in Text Classification*|Y | N| N|

- corrected the partially wrong and outdated version on [AnalyticsVidhya](https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/)