# Using Regular Expressions to Clean a Text

In this notebook we will be using regular expressions to clean up a text document. The text we will be using is Alice's Adventures in Wonderland by Lewis Carroll. The document was retrieved from https://www.gutenberg.org/

Regular Expressions (or Regex) is a coding technique that functions in many programming languages. Regex makes use of metacharacters (!?^.) and literal strings to carry out its operations. For a full list of Regex metacharacters and their associated functions, please see the Regex cheatsheet: http://www.rexegg.com/regex-quickstart.html

## Libraries and Resources used

-  Python 3
-  re

## Note:

The document has already undergone some cleaning. This involved removing the additional notes made by Gutenberg (trademarks, notes about the book, branding) from the start and end of the novel.

Written February 14, 2018

In [226]:
# Import the required library
import re

## Load and Read In the Text File

We first need to load the Alice In Wonderland text file to begin the Regex cleaning process. 

In [227]:
# Read in the text file
Alice = open("Alice_in_Wonderland.txt").read()

#Print out the first 500 characters to confirm the text has been imported
print (Alice[:500])

CHAPTER I


[Sidenote: _Down the Rabbit-Hole_]

ALICE was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no pictures or
conversations in it, "and what is the use of a book," thought Alice,
"without pictures or conversations?"

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid) whether the pleasure of
making 


## Eliminate Whitespace

By examining the first 500 characters, there are obviously some unnecessary lines and spaces. This 'whitespace' can be eliminated with a couple of lines of Regex.

In [228]:
# Eliminate new line characters with re.sub
# This function works by substituting the new line character with a space
Alice = re.sub(r'\n', " ", Alice)

# Check the altered text
print (Alice[:500])

CHAPTER I   [Sidenote: _Down the Rabbit-Hole_]  ALICE was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?"  So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid) whether the pleasure of making 


In [229]:
# Remove all occurences of 2 or more spaces
# To grab counts of characters in a text you can use numbers in curly brackets
Alice = re.sub(r'\s{2,}', " ", Alice)

# Check the altered text
print (Alice[:500])

CHAPTER I [Sidenote: _Down the Rabbit-Hole_] ALICE was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?" So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid) whether the pleasure of making a da


## Isolating Dialogue

On its own, Regex is capable of some interesting text analysis techniques. By providing the correct combination of metacharacters and literal strings, we can pull certain parts out of a text. We do this with the re.findall function. 

In [230]:
# This may look strange, but the metacharacters are grabbing any character between quoation marks zero or more times
# This returns the dialogue of the book as a list
AliceDialogue = re.findall(r'".*?"', Alice)

# Check the first 5 results
print (AliceDialogue[:5])

['"and what is the use of a book,"', '"without pictures or conversations?"', '"Oh dear! Oh dear! I shall be too late!"', '"ORANGE MARMALADE,"', '"Well!"']


## Removing Punctuation 

There are a number of problematic characters still in the text. Additionaly, we want to create a version of the text that is free of punctuation so that we may analyze it later. 

Regex metacharacters are often found in text documents. To keep our code from failing we must always escape the Regex metacharacters with a backslash \. The backslash can also be used to turn literal string characters into Character Classes. These Character Classes can make changes to multiple characters, as will be shown below. 

In [231]:
# Note: We use the Alice variable that is not split into dialogue
# Remove all non-word characters
AliceClean = re.sub(r'\W+', " ", Alice)

# Check the altered text
print (AliceClean[:500])

CHAPTER I Sidenote _Down the Rabbit Hole_ ALICE was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thought Alice without pictures or conversations So she was considering in her own mind as well as she could for the hot day made her feel very sleepy and stupid whether the pleasure of making a daisy chain would b


### NOTE:
Above, you may notice that the underscore _ was not removed by the code. This is because the underscore character is considered to be a word character in Regex. To account for this, we need to add an additional line of code specifying the substitution of the underscore.

In [232]:
# Remove all non-word characters
AliceClean = re.sub(r'_', " ", AliceClean)

# Check the altered text
print (AliceClean[:500])

CHAPTER I Sidenote  Down the Rabbit Hole  ALICE was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thought Alice without pictures or conversations So she was considering in her own mind as well as she could for the hot day made her feel very sleepy and stupid whether the pleasure of making a daisy chain would b


## Split on Sentences

For some analysis methods you may want a list of the text's sentences. This can be done easily enough with Regex.

In [233]:
# Note: We use the Alice variable that still contains the non-word characters
# Use re.split to split the text on common sentence ending characters
# Placing the characters in square brackets [] creates a "class"
AliceSentence = re.split(r'[!?.]', Alice)

# Check the results by printing the first 10 sentences
print(AliceSentence[:5])

['CHAPTER I [Sidenote: _Down the Rabbit-Hole_] ALICE was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations', '" So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid) whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her', ' There was nothing so _very_ remarkable in that; nor did Alice think it so _very_ much out of the way to hear the Rabbit say to itself, "Oh dear', ' Oh dear', ' I shall be too late']


## Conclusion

As you can see, Regex is a versatile way to clean and slice text. It's strength lies in its succint code and similar expression across most programming languages. Regex is a good technique to use to clean up the document before a more thorough textual analysis. Don't forget to familiarize yourself with the metacharacters on the cheat sheet!