# Cleaning Messy Text: RegEx and NLTK

## Text Analysis with NLTK

### [Centre for Data, Culture & Society](http://cdcs.ed.ac.uk)

Course Instructor: Lucy Havens

Course Dates: March-April 2022

****

### Regular Expresssions (RegEx)

* **WAT? Pattern matching strings in Python**
* **WHY? To find specific words or phrases, or variations of a particular word or phrase**
    * Once found, they can be replaced, so this is useful for cleaning text with digitization errors.  Optical Character Recognition (OCR) and Handwriting Recognition (HWT or HRT) technologies are imperfect, so you will find errors in digitized text corpora (unless of course they've been manually reviewed and corrected).
* **HOW? Combinations of special characters with a RegEx compiler**
    * In programming, a *compiler* translates code from one programming language to another.  In a sense, RegEx is a language that can sit on top of Python.  RegEx works with Python data types and syntax but it also has its own special characters and methods that plain Python doesn't use.
    
My favorite resource for practice with and testing Regular Expressions is [Pythex.org](https://pythex.org) and the cheat sheet it provides!

In [1]:
# To use Regular Expressions (RegEx)
import re

# To perform text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


On Wednesday we noticed digitization errors with the odd placement of the character `â` in the text.  Let's load that corpus and see if we can clean up some of those errors!

In [2]:
data_directory = "nls-text-gibbon"
wordlists = PlaintextCorpusReader(data_directory, "\d.*", encoding="latin1")
corpus_tokens = wordlists.words()  # .words() is a method for tokenization
print(corpus_tokens[:20])

['R', ',', '17U', '\\(', 'o', 'First', 'journey', 'National', 'Library', 'of', 'Scotland', 'â', '\x80\x98', 'B000054136', '*', 'TIMBUCTOO', 'NIGER', 'BY', 'THE', 'SAME']


Notice two quite obvious digitization errors: `â` and `\x80\x98`.  Let's look at where these occur throughout our corpus:

In [17]:
lgg = nltk.Text(corpus_tokens)
print(lgg[:100])

['R', ',', '17U', '\\(', 'o', 'First', 'journey', 'National', 'Library', 'of', 'Scotland', 'â', '\x80\x98', 'B000054136', '*', 'TIMBUCTOO', 'NIGER', 'BY', 'THE', 'SAME', 'AUTHOR', 'Novels', 'forming', 'the', 'trilogy', ',', 'A', 'Scots', 'Quair', 'Part', 'I', '.', 'Sunset', 'Song', '(', '1932', ')', 'Part', 'II', '.', 'Cloud', 'Howe', '(', '1933', ')', 'Part', 'III', '.', 'Grey', 'Granite', '(', '1934', ')', 'MUNCiO', 'PARK', 'NIGER', 'THE', 'LIFE', 'OF', 'MUNGO', 'PARK', 'BY', 'LEWIS', 'GRASSIC', 'GIBBON', 'EDINBURGH', 'THE', 'PORPOISE', 'PRESS', 'FIRST', 'PUBLISHED', 'IN', '1034', 'BY', 'THE', 'PORPOISE', 'PRESS', 'I33A', 'GEORGE', 'STREET', ',', 'EDINBURGH', 'LONDON', '*.', 'FABER', 'AND', 'FABER', 'LIMITED', '24', 'RUSSELL', 'SQUARE', ',', 'W', '.', 'G', '.', 'I', 'PRINTED', 'IN', 'SCOTLAND']


In [13]:
lgg.concordance("\x80\x98")

Displaying 25 of 4920 matches:
ourney National Library of Scotland â  B000054136 * TIMBUCTOO NIGER BY THE S
h biography . A much better book is â  H . B . â  s â  Life of Mungo Par
 the room , and Mungo himÂ ¬ self . â  Youâ  re destroying the book , â 
ight of all his bio - 14 graphers , â  Ay , you , or somebody else , will on
racÂ ¬ teristic , it was national : â  You poor useless thing , do you think
surrounding hills , he went often , â  to read poetry â . This was mostly 
upon the record the affirÂ ¬ mation â  Some sparks of latent spirit which oc
 ¬ ously qualified by ( yet again ) â  the gravity and steady decorum of his
o study medicine . ! 9 3His gravity â  not altogether thrown awayâ , he wa
thematics , in both of which he was â  very apt . For the rest he no doubt p
e a letter to Anderson in Selkirk : â  I have now got upon the first step of
 just before the Worcester sailed : â  I wish you may be able to look upon t
e Great River , croco

In [15]:
lgg.concordance("â")

Displaying 25 of 20332 matches:
t journey National Library of Scotland â  B000054136 * TIMBUCTOO NIGER BY THE
ngth biography . A much better book is â  H . B . â  s â  Life of Mungo P
y . A much better book is â  H . B . â  s â  Life of Mungo Park ( 1835 ),
uch better book is â  H . B . â  s â  Life of Mungo Park ( 1835 ), sincer
 a i woman of great prudence and sense â , â  else indeed she would not hav
oman of great prudence and sense â , â  else indeed she would not have surv
 in the room , and Mungo himÂ ¬ self . â  Youâ  re destroying the book , â 
 â  Youâ  re destroying the book , â  Mungo proÂ ¬ tested . The servant t
 4 Theyâ  re only old Flavelâ  s . â  To which the dark - faced boy retor
delight of all his bio - 14 graphers , â  Ay , you , or somebody else , will 
characÂ ¬ teristic , it was national : â  You poor useless thing , do you thi
think that you will ever write books ? â  Mungoâ  s retort is not recorded 
he surro

In [18]:
lgg.concordance("Â")

Displaying 25 of 20332 matches:
t journey National Library of Scotland â  B000054136 * TIMBUCTOO NIGER BY THE
ngth biography . A much better book is â  H . B . â  s â  Life of Mungo P
y . A much better book is â  H . B . â  s â  Life of Mungo Park ( 1835 ),
uch better book is â  H . B . â  s â  Life of Mungo Park ( 1835 ), sincer
 a i woman of great prudence and sense â , â  else indeed she would not hav
oman of great prudence and sense â , â  else indeed she would not have surv
 in the room , and Mungo himÂ ¬ self . â  Youâ  re destroying the book , â 
 â  Youâ  re destroying the book , â  Mungo proÂ ¬ tested . The servant t
 4 Theyâ  re only old Flavelâ  s . â  To which the dark - faced boy retor
delight of all his bio - 14 graphers , â  Ay , you , or somebody else , will 
characÂ ¬ teristic , it was national : â  You poor useless thing , do you thi
think that you will ever write books ? â  Mungoâ  s retort is not recorded 
he surro

In [19]:
lgg.concordance("¬")

Displaying 25 of 4700 matches:
n in mind and disregarding it as unimÂ ¬ portant . I followed the latter course
f Africa in 1805 , by Mungo Park ; toÂ ¬ gether with other documents , official
riodâ  even if economic fact had deÂ ¬ signed a more enduring costume . The p
on for educationâ  that passion comÂ ¬ pounded of a belief that education mea
an argument . It had , and has , someÂ ¬ thing of the same quality , this Scots
â  s biographers make of her a handÂ ¬ some personage , as they do of her con
which produced their family . They reÂ ¬ garded each other , no doubt , with th
 was quiet , restrained , grave of deÂ ¬ meanour , proper and shyâ  a fit sub
alysts as yet inapparent upon that unÂ ¬ happy era . No child should be any of 
abbath , of the hearty eating of tireÂ ¬ some food , of conventionality of expr
 of outward demeanour to hide the burÂ ¬ geon of his soul beneath . He was a ha
handsome boy , as later a man , brownÂ ¬ haired , tall for his age , with finel
 of child

To remove a substring (a selection of characters in a string), we can use an empty string (either `""` or `''`) as the second input for the `replace()` method.  Just remember to set the Text object followed by this method to a variable, otherwise your changes won't be saved!

In [26]:
# .replace() must be used on a string object, not a Text object
lgg_str = wordlists.raw()
lgg_str = lgg_str.replace("Â ¬","")

In [28]:
# .concordance() must be used on a Text object, not a string object
corpus_tokens = word_tokenize(lgg_str)
lgg = nltk.Text(corpus_tokens)
lgg.concordance("¬")

no matches


It worked!

Let's try using RegEx to clean the text now:

In [19]:
sequence = "[a-zA-Z]+â"
re_pattern = re.compile(sequence)

In [20]:
digit_errors = re_pattern.findall(lgg_str)

In [21]:
len(digit_errors)

22700

In [24]:
unique_errors = list(set(digit_errors))
len(unique_errors)

4578

In [25]:
print(unique_errors[:100])

['bonesâ', 'capeâ', 'dearsâ', 'starvationâ', 'displeasedâ', 'calculationâ', 'Venetianâ', 'middleâ', 'sonâ', 'Cleghornâ', 'Nahâ', 'feetâ', 'palmsâ', 'mosquitoesâ', 'trayâ', 'Janissaryâ', 'singerâ', 'Behaimâ', 'bushâ', 'chairâ', 'townsâ', 'Betzâ', 'hillSâ', 'risingâ', 'anarchistâ', 'reposeâ', 'fightingâ', 'suicideâ', 'Natashaâ', 'filthâ', 'Whoâ', 'trackâ', 'preachedâ', 'unlessâ', 'lovelinessâ', 'Gambiaâ', 'Ledyardâ', 'Schroterâ', 'roomâ', 'moonlightâ', 'differentâ', 'Caâ', 'substancesâ', 'Bootyâ', 'sunlightâ', 'stupidityâ', 'futureâ', 'fightâ', 'cropâ', 'fellâ', 'Chilternsâ', 'fertileâ', 'Consulateâ', 'milesâ', 'shepherdâ', 'writerâ', 'expeditionsâ', 'liberalityâ', 'senseâ', 'Sempleâ', 'bandâ', 'amâ', 'Salomeâ', 'Realâ', 'enemyâ', 'attitudeâ', 'regretâ', 'Homelyâ', 'Hereâ', 'tourâ', 'belowâ', 'kindâ', 'Craigneuksâ', 'Bokharaâ', 'Seasâ', 'manyâ', 'friezeâ', 'riversâ', 'fingersâ', 'crewsâ', 'makerâ', 'hallâ', 'Yearâ', 'sixtyâ', 'landscapeâ', 'Registrarâ', 'Christâ', 'buildingâ', 'truthfull

Since the `â` character keeps appearing at the end of tokens, let's use `strip()` to remove them for practice with that method!

In [26]:
tokens_list = list(corpus_tokens)
print(tokens_list[:10])

['R', ',', '17U', '\\', '(', 'o', 'First', 'journey', 'National', 'Library']


In [29]:
subcorpus = tokens_list[:1000]
clean_subcorpus = []
for t in subcorpus:
    new_t = t.strip('â')
    clean_subcorpus += [new_t]
print(clean_subcorpus[:100])

['R', ',', '17U', '\\', '(', 'o', 'First', 'journey', 'National', 'Library', 'of', 'Scotland', '\x80\x98B000054136', '*', 'TIMBUCTOO', 'NIGER', 'BY', 'THE', 'SAME', 'AUTHOR', 'Novels', 'forming', 'the', 'trilogy', ',', 'A', 'Scots', 'Quair', 'Part', 'I.', 'Sunset', 'Song', '(', '1932', ')', 'Part', 'II', '.', 'Cloud', 'Howe', '(', '1933', ')', 'Part', 'III', '.', 'Grey', 'Granite', '(', '1934', ')', 'MUNCiO', 'PARK', 'NIGER', 'THE', 'LIFE', 'OF', 'MUNGO', 'PARK', 'BY', 'LEWIS', 'GRASSIC', 'GIBBON', 'EDINBURGH', 'THE', 'PORPOISE', 'PRESS', 'FIRST', 'PUBLISHED', 'IN', '1034', 'BY', 'THE', 'PORPOISE', 'PRESS', 'I33A', 'GEORGE', 'STREET', ',', 'EDINBURGH', 'LONDON', '*', '.', 'FABER', 'AND', 'FABER', 'LIMITED', '24', 'RUSSELL', 'SQUARE', ',', 'W.G', '.', 'I', 'PRINTED', 'IN', 'SCOTLAND', 'BY', 'ROBERT', 'MACLEHOSE']


Ta da!