# Text Cleaning

This is a code walkthrough for self-starters on most text cleaning task.  

I have always liked **The Adventures of Sherlock Holmes** by _Arthur Conan Doyle_. Let's download the book and save it locally:

In [1]:
url = 'http://www.gutenberg.org/ebooks/1661.txt.utf-8'
file_name = 'sherlock.txt'

In [2]:
import urllib.request
# Download the file from `url` and save it locally under `file_name`:

with urllib.request.urlopen(url) as response:
    with open(file_name, 'wb') as out_file:
        data = response.read() # a `bytes` object
        out_file.write(data)

In [3]:
!ls {*.txt}

sherlock.txt


In [4]:
!head -5 sherlock.txt

ï»¿Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included


The file contains header and footer information from Project Gutenberg. We are not interested in the same and will discard the copyright and other legal notices. 

Todo: 
- Open the file and delete the header and footer information and save the file as ```sherlock_clean.txt```

I opened the text file to see that I need to remove the first 33 lines. Let's do that using shell commands - which also work on Windows inside Jupyter notebook: 

In [5]:
!sed -i 1,33d sherlock.txt

I use the ```sed``` syntax. 

The ```-i``` flag tells to make the changes in place.  
```1,33d``` instructs to delete lines 1 to 33.

In [6]:
!head -5 sherlock.txt

THE ADVENTURES OF SHERLOCK HOLMES

by

SIR ARTHUR CONAN DOYLE


## Data Exploration Notes


Before I continue to text cleaning for any Natural Language Processing Task, I like to spend a few seconds taking a quick glance at the data itself. I noted down some of the things I spotted below, of course a trained eye can see a lot more than I did: 

1. Dates are written in a mixed format: `twentieth of March, 1888`, times are too: `three o'clock`
1. Text is wrapped at around 70 columns, or no line can be longer than 70 characters 
1. There are lot of proper nouns. These include names such as `Atkinson`, `Trepoff` in addition to locations such as `Trinconmalee` and `Baker Street` etc.
1. The index is in Roman numerals such as `I` and `IV` and not `1` or `4`
1. There are lot of dialogues such as: "You have carte blanche." with no narrative around them. This storytelling style switches freely from a narrative to a dialogue driven. 
1. The grammar and vocabulary is slightly unusual because of the time when Doyle wrote.  

## Load Data

In [7]:
#let's the load data to RAM
text = open(file_name, 'r', encoding='utf-8').read()  # note that I add an encoding='utf-8' parameter to preserve information
print(text[:5])

THE A


In [8]:
print(f'The file is loaded as datatype: {type(text)} and has {len(text)} characters in it')

The file is loaded as datatype: <class 'str'> and has 581204 characters in it


### Explore Loaded Data

In [9]:
# how many unique characters do we see? 
# For reference, ASCII has 127 characters in it - so we expect this to have at most 127 characters
unique_chars = list(set(text))
unique_chars.sort()
print(unique_chars)
print(f'There are {len(unique_chars)} unique characters, including both ASCII and Unicode character')

['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'â', 'è', 'é']
There are 85 unique characters, including both ASCII and Unicode character


For our machine learning models, we often need the words to occur as individual tokens or single words. This process is called:

## Tokenization 

We convert the raw text into a list of words. This preserves the original ordering of the text. 

### Split by Whitespace

In [10]:
words = text.split()
print(len(words))

107431


In [24]:
print(words[90:200])  #start with the first chapeter, ignoring the index for now

['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex.', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler.', 'All', 'emotions,', 'and', 'that', 'one', 'particularly,', 'were', 'abhorrent', 'to', 'his', 'cold,', 'precise', 'but', 'admirably', 'balanced', 'mind.', 'He', 'was,', 'I', 'take', 'it,', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen,', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position.', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions,', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer.', 'They', 'were', 'admirable', 'things', 'for']


In [25]:
# Let's look at another example: 
'red-headed woman on the street'.split()

['red-headed', 'woman', 'on', 'the', 'street']

Notice how the words red-headed were not split. This is something we may or may not want to keep always.  

*Problem:* Punctuations are often appearing with the word itself, like: `Adler.` and `emotions,`.

*Solution:* Simply extract words and discard everything else. This means we will discard all non-ASCII characters and punctuations.

### Split by Word Extraction
**Introducing Regex**

In [12]:
import re
re.split('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

Regular expressions can be daunting at first, but are very powerful. The regular expression `\W+` means *a word character (A-Z etc.) repeated one or more times*.

In [15]:
words_alphanumeric = re.split('\W+', text)

In [16]:
len(words_alphanumeric), len(words)

(109111, 107431)

In [17]:
print(words_alphanumeric[90:200])

['BOHEMIA', 'I', 'To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler', 'All', 'emotions', 'and', 'that', 'one', 'particularly', 'were', 'abhorrent', 'to', 'his', 'cold', 'precise', 'but', 'admirably', 'balanced', 'mind', 'He', 'was', 'I', 'take', 'it', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer', 'They', 'were', 'admirable']


We notice how `Adler` no longer has the punctuation with her. This is what we wanted. Mission Accomplished.  

**What was the tradeoff we made here?** To understand that, let's look at another example: 

In [26]:
words_break = re.split('\W+', "Isn't he coming home for dinner with the red-headed girl?")
print(words_break)

['Isn', 't', 'he', 'coming', 'home', 'for', 'dinner', 'with', 'the', 'red', 'headed', 'girl', '']


We have split `Isn't` to `Isn` and `t`. This is not good if you were working with say email or Twitter data, because you would've a lot more of such contractions. As a minor annoyance, we have an extra empty token at the end. 

Similarly, because we neglected punctuation `red-headed` is broken into two words: `red` and `headed`

We can write custom rules in our tokenization strategy to cover all these cases. Or, use something which already has been written for us. 

### spaCy for Tokenization

In [27]:
import spacy

nlp = spacy.load('en')

In [29]:
doc = nlp(text)

The above syntax creates a spaCy object `doc`. The object pre-computes a lot of linguistic features, including tokens. 

We can retrieve them by calling the object iterator. Below, we call the iterator and `list` it. 

In [36]:
print(list(doc)[150:200])

[whole, of, her, sex, ., It, was, not, that, he, felt, 
, any, emotion, akin, to, love, for, Irene, Adler, ., All, emotions, ,, and, that, 
, one, particularly, ,, were, abhorrent, to, his, cold, ,, precise, but, 
, admirably, balanced, mind, ., He, was, ,, I, take, it, ,]


Conveniently, spaCy tokenizes all *punctuations and words* and returned those as individual tokens as well. Let's try the example which we didn't like earlier:

In [37]:
words = nlp("Isn't he coming home for dinner with the red-headed girl?")
print([token for token in words])

[Is, n't, he, coming, home, for, dinner, with, the, red, -, headed, girl, ?]


*Observations*:
- spaCy got the `Isn't` split as we wanted 
- `red-headed` was broken into 3 tokens: `red`, `-`, `headed`. Since the punctuation information isn't lost, we can restore the original `red-headed` token if we want to

**How does the spaCy tokenizer work ?**

> First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
> 
> - **Does the substring match a tokenizer exception rule?** For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
> - **Can a prefix, suffix or infix be split off?** For example punctuation like commas, periods, hyphens or quotes.
>
> ![caption](https://spacy.io/assets/img/tokenization.svg)
> from [spaCy-101](https://spacy.io/usage/spacy-101) docs

We can also use spaCy to extract one sentence at a time, instead of one-word-at-a-time. 

In [45]:
sentences = list(doc.sents)
print(sentences[13:18])

[I. A SCANDAL IN BOHEMIA

I.

To Sherlock Holmes, she is always THE woman., I have seldom heard
him mention her under any other name., In his eyes she eclipses
and predominates the whole of her sex., It was not that he felt
any emotion akin to love for Irene Adler.]
