<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/08_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The importance of preprocessing



It's time to return to something we've already covered â€” tokenizing a text and defining what counts as a word. So far we've already been doing this with the `.split()` function, which has worked relatively well for us. But, there is one issue, which is that splitting on white space means that sometimes punctuation is included with our words.

For example, running `.split()` on the example below will retain commas and exclamation marks as part of the words:






In [3]:
turtles = """teenage mutant ninja turtles,
            teenage mutant ninja turtles,
            teenage mutant ninja turtles,
            heroes in a halfshell, turtle power!"""

turtles.split()

['teenage',
 'mutant',
 'ninja',
 'turtles,',
 'teenage',
 'mutant',
 'ninja',
 'turtles,',
 'teenage',
 'mutant',
 'ninja',
 'turtles,',
 'heroes',
 'in',
 'a',
 'halfshell,',
 'turtle',
 'power!']

Therefore, we might want to perform some operations on this text *before* we start processing it for linguistic information. These operations will work to normalize and standardize the text so that noise is removed. This is called preprocessing. Preprocessing comes in many options - you could remove just punctuation, or convert everything to lowercase, or remove very frequent words, or remove words that are not in the dictionary, or remove words that only occur one time, and so on. Different algorithms and approaches to NLP will all include their own methods and steps for preprocessing, which are tied to the goals of the analysis.

For now, let's focus on the issue of punctuation in the turtles text.

### Frequency and preprocessing

What will happen if we run `.split()` and create a `FreqDist` from the turtles text without any preprocessing?


Let's import the NLTK resources first...


In [8]:
# import the main nltk module
import nltk

# download the nltk.book resources
nltk.download('book')

# import the resources
from nltk.book import *

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\mingb\AppData\R

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [4]:
# make a frequency distro of our turtles
tfdist = nltk.FreqDist(turtles.split())

In [5]:
# we know that the world "turtles" occurs in the song, so why don't we see it?
tfdist['turtles']

0

In [6]:
# because the commas has been saved as part of the word! uhg!
tfdist['turtles,']

3

Using `.split()` clearly needs some help and might benefit from some preprocessing.

Why is this important? Well, if we want to calculate the frequency of a word in a corpus / text *properly*, we have to make sure all words are on an even playing ground. Before we even get into punctuation, consider the following:

In [None]:
nltk.FreqDist('Victoria University of WELLINGTON is in Wellington'.split())

Although the word "Wellington" occured twice in the string above, one version was in all capitals and one was not. The `FreqDist` function treated these as two separate words. Why? The answer reminds us about the way these strings are being compared by Python:

In [None]:
# These are two different values!
'WELLINGTON' == 'Wellington'

While we know that these are basically the same word, Python doesn't care because they are *not* the same word in terms of being 100% identical values. So, we want to consider performing some initial processing (i.e., *preprocessing*) on a text before counting the words as a means to normalize or control for these properties of words we might not care about. For example, we could solve the problem above by converting all of our words to lower case.

In [None]:
# Hey we're the same now!
'WELLINGTON'.lower() == 'Wellington'.lower()

### Lexical diversity and preprocessing

As another example, let's consider how pre-processing influences the effects of a measure we've already explored: lexical diversity. Compare what capitalization will do to measures of lexical diversity on these two texts:

In [7]:
# create two texts that only differ based on capitalization
version1 = ['Soda', 'soda', 'Onion', 'onion']
version2 = ['soda', 'soda', 'onion', 'onion']

Create a lexical diversity function.

In [8]:
# remember how to measure ttr?
def lexical_diversity(text):
  ld = len(set(text))/len(text)
  return ld

Preprocessing leads to very different TTRs values for the "same" texts.

In [9]:
lexical_diversity(version1)

1.0

In [10]:
lexical_diversity(version2)

0.5

We clearly would not want to think that `version1` is more lexically diverse than `version2`, unless we have strong reason to believe the capitalization results in a fundamentally different word.

Hence, normalization is needed to address these issues.

You might question this approach and wonder whether normalizing serves to remove important information about a text - perhaps capitalization matters? What if Soda is a proper name and soda is just the noun?

These are important things to take into consideration when doing any sort of NLP - the scope of your research questions and the nature of the linguistic features you are interested in (and how you measure them) should drive these decisions.




### Cleaning punctuation

But our problem above with `turtles` was also caused by the use of punctuation and `.split()`. What could we do? Well, we *could* remove all of the punctuation before splitting the text, and this would provide a satisfactory solution (for now).

Based on what we know now about Python, how could we remove all of the punctuation from a text? We can actually do this quite simply and quickly using a list comprehension.

We would want to set a condition that inspects each character in a string, and as long as that character is *not* a punctuation mark, keep it.

Here is some pseudocode that expresses our goal:


```
[character for character in string if character not punctuation]
```

To exectute this code, we'd need to tell Python what we mean by "punctuation". One way is to define a string containing all the puncuation marks we don't want.

At the same time, we can make sure to lower case everything in the same expression.


In [11]:
# define a string containing punctuation we don't like, in this case just commas and exclamation marks
punctuation = ',!'

If you run the cell below, you still see that the punctuation has been removed, but unfortunately the output is a list of characters, not words!

In [13]:
# write a list comprehension that only keeps characters that aren't in punctuation
# read on to the next section to see how to fix this output!
[character.lower() for character in turtles if character not in punctuation]

"['t', 'e', 'e', 'n', 'a', 'g', 'e', ' ', 'm', 'u', 't', 'a', 'n', 't', ' ', 'n', 'i', 'n', 'j', 'a', ' ', 't', 'u', 'r', 't', 'l', 'e', 's', '\\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 't', 'e', 'e', 'n', 'a', 'g', 'e', ' ', 'm', 'u', 't', 'a', 'n', 't', ' ', 'n', 'i', 'n', 'j', 'a', ' ', 't', 'u', 'r', 't', 'l', 'e', 's', '\\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 't', 'e', 'e', 'n', 'a', 'g', 'e', ' ', 'm', 'u', 't', 'a', 'n', 't', ' ', 'n', 'i', 'n', 'j', 'a', ' ', 't', 'u', 'r', 't', 'l', 'e', 's', '\\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'h', 'e', 'r', 'o', 'e', 's', ' ', 'i', 'n', ' ', 'a', ' ', 'h', 'a', 'l', 'f', 's', 'h', 'e', 'l', 'l', ' ', 't', 'u', 'r', 't', 'l', 'e', ' ', 'p', 'o', 'w', 'e', 'r']"

#### `.join()`

The list comprehension has returned a list of *characters*, but we wanted to retain the whitespace and other properties of the texts as a series of words. No worries, we can use the handy `string.join()` function to join a list of characters back into one string!

`.join()` is sort of the bizzare cousin of `.split()`. `.join()` is actually a string method, meaning you need to attach a string to the front part of the `.join()`. The string that you attach to `.join` represents the nature of the join...the character that you want to join everything by. Much like `.split()`, you can choose whatever you like to join stuff with.

But, if we simply wanted to glue back together a list of characters *without* making any other changes, we would then attach an empty string to `.join()`, indicated with two string delimiters: `''`, in which case we would type `''.join()`.

Then, the thing that you want to join goes inside the `()` part of `''.join()`.

```
''.join([list of characters])
```


In [15]:
# we just wrap the whole list comprehension in ''.join
remove_punctuation = ''.join([character.lower() for character in turtles if character not in punctuation])

In [19]:
# it looks different now...but it's been reformed back into what we first had without punctuation
remove_punctuation

'teenage mutant ninja turtles\n            teenage mutant ninja turtles\n            teenage mutant ninja turtles\n            heroes in a halfshell turtle power'

How else could we do this without using join?

One way would be to write a loop which analyses each word in a text, removing punctuation from that word, and then puts that word into a list. This is made slightly difficult because strings are `immutable`, meaning that we cannot remove or replace individual elements of a string.


In [20]:
# this returns an error because we cannot modify strings in place
'string'[0] = 'b'

TypeError: 'str' object does not support item assignment

One way to do this is scan through each character and then reconstruct the string as we go, only including characters that pass the test.

String concatenation can be used for this, which is just a fancy way of saying that you can add two strings together to make a larger string.



In [None]:
# create an output container
output = ''

# loop through each character in the whole string
for character in turtles:
  # if the character is NOT in this list:
  if character not in [',', '!']:
    # add the lowercased version of the character to the list
    output = output + character.lower()

# results are identical to the ''.join() method above
output

#### **using a regular expression**

Another way, and probably the more computationally efficient way to do this, is to use a regular expression to clean the string. Regular expressions are covered in a later lesson, but it is worth looking at this preview for now.

We will need to import the library for regular expressions, `re`

In [22]:
import re

We can now use the `re.sub` function, which will substitute patterns in a string with another pattern. The syntax for `re.sub` is:

`re.sub(pattern, string, replacement)`

So you first type the pattern that you want to search for, then the string you want to search in, and then what you would like the pattern to be replaced with.

If you say that the replacement should be an empty string, then the replacement will be nothing, meaning that you are effectively removing the pattern from the string. For example:

In [25]:
# remove all the 'a' from the string 'banana'
re.sub(pattern = 'a', string = 'banana', repl = '')

'bnn'

Using this same logic, we can remove all of the punctuation from a string. Crucially, be sure to save the results as a variable, otherwise the replacements will not be saved.


In [26]:
# original string
exclamation = 'too! many! exclamation! points!'
exclamation

'too! many! exclamation! points!'

In [27]:
# substitute out the exclamation marks and make a new string
exclamation = re.sub(pattern = '!', string = exclamation, repl = '')

In [28]:
# a cleaned string
exclamation

'too many exclamation points'

Now, if we want to remove more than one punctuation mark, we can define a pattern which says "anything in this pattern." To do so, write a string with brackets and put any character you want removed in those brackets, like this:

```
punctuation = [',!']
```

Then use that pattern in your `re.sub` call to replace those punctuation marks.

In [29]:
# original version of turtles
turtles

'teenage mutant ninja turtles,\n            teenage mutant ninja turtles,\n            teenage mutant ninja turtles,\n            heroes in a halfshell, turtle power!'

In [30]:
# cleaned version of turtles (not saved to a variable)
punctuation = '[,!]'
re.sub(pattern = punctuation, string = turtles, repl = '')

'teenage mutant ninja turtles\n            teenage mutant ninja turtles\n            teenage mutant ninja turtles\n            heroes in a halfshell turtle power'

### **a cleaned FreqDist**
Regardless of the method used to preprocess the text and remove punctuation, the resulting `FreqDist` will now look a bit different.

In [31]:
# create a new frequency distribution
cleaned_fdist = nltk.FreqDist(remove_punctuation.split())

In [32]:
# all the punctuation is gone, and all words are lowercased
cleaned_fdist

FreqDist({'teenage': 3, 'mutant': 3, 'ninja': 3, 'turtles': 3, 'heroes': 1, 'in': 1, 'a': 1, 'halfshell': 1, 'turtle': 1, 'power': 1})

In [33]:
# now we get proper results for turtles
cleaned_fdist['turtles']

3

## `nltk.word_tokenize()`

Now we have a better way to use `.split()`, or at least knowledge that preprocessing is a necessary step for a function like `.split()`.

However - what if we wanted to retain punctuation? Do you think it would be important to know the difference between words that come before / after punctuation? Could punctuation tell us something about the syntax of a sentence or the tone of voice of writing? These are questions without clear answers, but are worthy of consideration. Another more practical aspect of retaining punctuation is that punctuation markers could help with segmentation of strings into words and/or sentences. For this reason, we will actually stop using `.split()` as a means to create word tokens.

Instead, we can use the NLTK segmentation functions which are improvements upon `.split()`. These function are `nltk.word_tokenize()` and `nltk.sent_tokenize()`. They convert raw strings into tokens or sentences, respectively. Let's just focus on word tokenization for now.

In the cells below, compare the difference between using `.split()` and `nltk.word_tokenize()`:

In [2]:
import nltk
# What is the difference between using `.split()` and `nltk.word_tokenize()`?
pretzels = 'These pretzels are making me thirsty!'

split_tokens = pretzels.split()
nltk_tokens = nltk.word_tokenize(pretzels)

print(f"Using .split(): \n{split_tokens}\n\nUsing nltk: \n{nltk_tokens}")

Using .split(): 
['These', 'pretzels', 'are', 'making', 'me', 'thirsty!']

Using nltk: 
['These', 'pretzels', 'are', 'making', 'me', 'thirsty', '!']


The NLTK tokenizer has treated the punctuation as a separate word - so it is smart enough to recognise that words should be separated from punctuation. It does this using a set of additional rules as well as some splitting. This makes perfect sense for punctuation which occurs after words, such as commas, full stops, exclamation marks, and so on.

What's going on in the cell below?

In [3]:
# What is different about these tokens?
nltk.word_tokenize('I can\'t even.')

['I', 'ca', "n't", 'even', '.']

The word "can't" was split into two tokens! Why is that? Well, if we think about it, "can't" actually stands for *two* words - "can" and "not." The tokenizer has an additional set of rules to search these contractions and split them accordingly. Using `.split()`, on the other hand, would result in "can't" being stored as a single word. Moreover, removing the punctuation *before* tokenization would turn "can't" into "cant", and then `nltk.word_tokenize()` would treat "cant" as a single word. Is this an issue? Well, considering the word "cant" is its own word separate in meaning from "can't", it certainly could be.


The point is that the order of pre-processing and normalisation steps is important, as are the different things you might want to do to a text. Many modern NLP libraries perform pre-processing automatically, and it is fundamental to understand how your data is being normalised in order to use these functions properly.

As a general rule, using `nltk.word_tokenize()` is preferred to `.split()`, because with `word_tokenize()` you retain the punctuation as separate tokens, which you can then choose to use or not use in your analysis.

# **Stopwords**

Another form of preprocessing is to remove so-called stopwords. In English, stopwords are frequently occuring function words, such as determiners, articles, prepositions, and so on. Contrast these words with content words, such as nouns, verbs, and adjectives, and you should begin to see the difference.

Sometimes, text analytic and NLP approaches remove stopwords. For example, stopwords are highly frequent and occur in most texts, so removing them can be helpful for frequency analyses. Other times, stopwords are removed to help with applications such as sentiment analysis. However, as NLP has advanced, the need to remove stopwords becomes lessened, and in fact removing stopwords can now sometimes be a detriment towards text analysis.

Nonetheless, is is worthwhile to understand how to remove stopwords. The NLTK module has a list of stopwords built-in, run the cell below to see it.



In [4]:
# Load in and inspect the stopwords resource
import nltk
nltk.download(['stopwords'])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mingb\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
# import the entire stopwords resource
from nltk.corpus import stopwords

# loop through all the the English stopwords
[word for word in stopwords.words('english')]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Have a look through the list above - you can see that there are a lot of words and pieces of words identified as stop words. You can use this list as a check to remove stopwords via a list comprehension.

In [6]:
full_of_stopwords = """Far Out in the uncharted backwaters of the unfashionable end
of the Western Spiral arm of the galaxy lies a small unregarded yellow sun"""

# can you understand everything in the list comprehension?
[word for word in nltk.word_tokenize(full_of_stopwords) if word.lower() not in stopwords.words('english')]

['Far',
 'uncharted',
 'backwaters',
 'unfashionable',
 'end',
 'Western',
 'Spiral',
 'arm',
 'galaxy',
 'lies',
 'small',
 'unregarded',
 'yellow',
 'sun']

## **Your Turn**

Spend some time becoming familiar with the differences between `.split()` and `nltk.word_tokenize()`.

As part of your comparisons, create frequency distributions based on the results of `.split()` and `nltk.word_tokenize()` for the same strings.


In [43]:
import nltk
myText = ' '.join(text5)

split = nltk.word_tokenize(myText)

#freq = [{word, 0} if freq[word] else {word, freq[word]+1} for word in split]
#freq = [map for word in split if (map.get(word) == None)  else map.get(word) = map.get(word)+1]

freq = {word: split.count(word) for word in split}

In [54]:
freq2 = {}
for word in split:
    if(word not in freq2.keys()):
        freq2[word] = 1
    else:
        freq2[word] = freq2[word] + 1

In [57]:
freq2

{'now': 79,
 'im': 128,
 'left': 17,
 'with': 152,
 'this': 86,
 'gay': 30,
 'name': 27,
 ':': 343,
 'P': 16,
 'PART': 1016,
 'hey': 264,
 'everyone': 63,
 'ah': 8,
 'well': 81,
 'NICK': 24,
 'U7': 119,
 'is': 372,
 'a': 568,
 '.': 1435,
 'ACTION': 346,
 'gives': 6,
 'U121': 36,
 'golf': 2,
 'clap': 3,
 ')': 938,
 'JOIN': 1021,
 'hi': 546,
 'U59': 12,
 '26': 10,
 '/': 136,
 'm': 81,
 'ky': 2,
 'women': 7,
 'that': 274,
 'are': 178,
 'nice': 52,
 'please': 21,
 'pm': 109,
 'me': 415,
 'there': 120,
 'ya': 101,
 'go': 73,
 'do': 168,
 "n't": 141,
 'fuck': 15,
 'you': 635,
 '@': 85,
 'whats': 41,
 'up': 160,
 'to': 658,
 '?': 1103,
 'i': 648,
 "'ll": 38,
 'thunder': 1,
 'your': 137,
 'ass': 15,
 'and': 335,
 'dont': 75,
 'even': 35,
 'know': 103,
 'what': 183,
 'means': 5,
 'sounds': 9,
 'painful': 2,
 'any': 123,
 'ladis': 1,
 'wan': 107,
 'na': 144,
 'chat': 142,
 '29': 5,
 'my': 242,
 'cousin': 2,
 'drew': 2,
 'messed': 2,
 'pic': 26,
 'on': 186,
 'cast': 2,
 '24': 6,
 'boo': 5,
 'sexy