# ISSS609 : Text Analytics and Applications

## Lab 2: Natural Language Toolkit (NLTK)

### Objectives 

-   To be able to load your own text collections using NLTK.

-   To be able to perform basic text preprocessing operations (tokenization, stemming and stop word removal)
    using Python and NLTK.

### Background
<p>Many modern text analytics tools automate the pre-processing of text documents. In order to understand and customise modern tools, it is important to recognise common pre-processing steps.</p>
<p>The tools afforded by NLTK is our entry point.</p>
<p>NLTK is a library already included with the default Anaconda installation. If for some reason, you did not have NLTK installed, you can refer to <a href='https://www.nltk.org/install.html'>https://www.nltk.org/install.html</a>.</p>

## Loading Your Own Text

Although NLTK provides a number of interesting document collections, we always need to analyze our own text collections. The first part of this lab is to show you how to load your own text collections into NLTK.

We typically refer to a digitalized collection of documents as a *corpus*. A corpus contains a set of documents. While many real world documents are in the format of Microsoft Word or PDF, when we process and analyze documents, they are usually converted into plain text files. Here we assume that the corpus we plan to analyze contains only plain text files.

First, create one `.txt` file for each document in your collection. For example, if you have 100 documents, you can name them `1.txt`, `2.txt`, $\ldots$, `100.txt`. You are free to choose any document name as long as each document has a unique name. Place all the `.txt` files in a directory of your choice. Next, you would like to load these files using the NLTK package such that you can then directly work with the words inside these files. In the NLTK package, there are existing tools that can be used to read in textual files. Let us start with a simple one called <b> `PlaintextCorpusReader` </b> to deal with plain text files that do not have annotations such as HTML tags.

The following code shows how you can load plain text files using NLTK. In the example below, it is assume that there are two files named `haze.txt` and `mrt.txt` inside the directory `data/SampleText`, where `data` should be placed in the current directory, i.e., where this Jupyter notebook is placed. (Note that the `data` directory with the two text files inside should have been downloaded together with this Jupyter notebook.) If you have placed the data in a different directory, you can modify the code below to correspond to the correct directory where your files are.
We also encourage you to create different folders for different labs to avoid confusion.

For those of you new to Python, the lines starting with `#` are *comments*, which explain what the code does but is not executed.

In [1]:
# The following statement imports the NLTK package.
import nltk
# The following statement imports a class called PlaintextCorpusReader.
from nltk.corpus import PlaintextCorpusReader
 
# This variable specifies the directory where the text files are.
file_directory = 'data/SampleText'

# This variable is a regular expression that specifies the pattern of the filenames we consider. 
# Here this regular expression mathces only those filenames ending with '.txt'.
filename_pattern = '.+\.txt'

# We can now create a PlaintextCorpusReader object with the file directory and filename pattern defined above.
my_corpus = PlaintextCorpusReader(file_directory, filename_pattern)

### Note:

- For those of you who are familiar with Python programming and would like to learn more about the code above, you can refer to the following URL for the documentation of the `PlaintextCorpusReader` class: http://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.plaintext.PlaintextCorpusReader
- You can also find the source code of this class on the following page: http://www.nltk.org/_modules/nltk/corpus/reader/plaintext.html#PlaintextCorpusReader
- For Python regular expressions, you can refer to the link below: https://docs.python.org/3/library/re.html
- Recommended online resource for Python documentation is https://kite.com/python/docs

Now the three documents `gates.txt`, `haze.txt` and `mrt.txt` have been read into `my_corpus`. We can use the code below to verify this. Here `my_corpus.fileids()` shows the IDs of all the documents stored inside `my_corpus`. You can see that the IDs of the documents are simply the names of the files in the directory that has been loaded.

In [2]:
my_corpus.fileids()

['gates.txt', 'haze.txt', 'mrt.txt']

### How to class the functions in this class, PlaintextCorpusReader?

`my_corpus.words('haze.txt')` returns all the words inside the document `haze.txt`, which are stored into the variable `haze`. Using `haze[0:30]`, we show the first 30 words in `haze`. You can open the file `haze.txt` directly to verify that these are indeed the first 30 words in the original file.

In [3]:
haze = my_corpus.words('haze.txt')
print(haze[0:30])

['Singapore', 'can', 'expect', 'more', 'rain', 'and', 'less', 'haze', 'in', 'the', 'coming', 'weeks', 'with', 'the', 'south', '-', 'west', 'monsoon', 'season', 'transitioning', 'into', 'inter', '-', 'monsoon', 'conditions', '.', 'The', 'inter', '-', 'monsoon']


You can also use a function called `len` to find out the length, i.e., the total number of words, insize this document:

In [4]:
print(len(haze))

117


We can use the function `FreqDist()` provided by NLTK.
`FreqDist()` can be applied to any list in Python. We can now use
`FreqDist()` on our own text as shown below.

In [5]:
fdist = nltk.FreqDist(haze)
print(fdist.most_common(10))

[('the', 11), ('and', 7), ('in', 5), ('.', 4), ('Singapore', 3), ('haze', 3), ('-', 3), ('monsoon', 3), ('season', 3), ('of', 3)]


You can see that the code above displays the most frequent 10 words
inside the document `haze.txt`.

What if you would like to get the words from *all* the files in
`my_corpus`? You can simply use `my_corpus.words()` without specifying any document ID. Give it a try. Can you find out how many words there are in total in both `haze.txt` and `mrt.txt`? Can you find out the most frequent 10 words?

In [7]:
# enter your code here
wholeCorpus=my_corpus.words()
print(len(wholeCorpus))
print(nltk.FreqDist(wholeCorpus).most_common(10))

613
[('.', 31), ('the', 30), (',', 28), ('and', 26), ('of', 18), ('to', 17), ('in', 11), ('Gates', 9), ('-', 9), ('Mr', 8)]


## Your Turn

This data set in the `SGNews_Apr2012` folder contains a set of Singapore news articles in April 2012. Load this document collection using NLTK. Can you find out the following information of this collection?

-   Number of documents in the collection.
-   Total number of words in the collection.
-   The top-20 most frequent words in the file, `14011.txt`.

### Tips:

-   To use `FreqDist`, you can either use `nltk.FreqDist()` as shown above or type `from nltk.probability import FreqDist` first and then directly use
    `FreqDist()`. This is because the `FreqDist` class is defined by the `probability` module under NLTK.

-   You may wonder how to find out the number of unique words. You can first define a `FreqDist` object from all the words and then find out the length of the
    `FreqDist` object.
    
- You may google for the ideas on the unique words.
    

In [8]:
# Enter your code here to answer the questions above. 

# The following statement imports a class called PlaintextCorpusReader (not needed if already imported)


# Define the directory variable 
file_directory = "data/SGNews_Apr2012/"

# Define the file patterns
filename_pattern = ".+\.txt"

# Define the corpus variable using PlaintextCorpusReader with the directory and file patterns
newsCorpus = PlaintextCorpusReader(file_directory,filename_pattern)

# Print the number of files (use fileids)
print(len(newsCorpus.fileids()))

# Print total number of words in corpus
print(len(newsCorpus.words()))

# Define a variable (file14011) with words from specific file 
file14011 = newsCorpus.words("14011.txt")

# Define the freq dist on that file words. Call it file14011Freq
file14011Freq = nltk.FreqDist(file14011)

# Print top 20 words in file14011Freq
print(file14011Freq.most_common(20))

267
111485
[(',', 27), ('.', 24), ('he', 12), (':', 10), ('to', 10), ("'", 9), ('at', 9), ('George', 9), ('"', 9), ('a', 8), ('-', 8), ('in', 8), ('and', 8), ('of', 7), ('t', 7), ('1', 6), ('IQ', 6), ('is', 6), ('the', 6), ('his', 6)]


### Tokenization

You must have noticed that by using the `PlaintextCorpusReader`, tokenization is done while loading the files, that is, the original text is split into individual words (more formally, *tokens*) and stored as a list of tokens in Python.

### Sentence Splitting 

What if you would like to split the text into sentences? For some text analysis such as extractive text summarization, it is useful to look at individual sentences. Actually this has also been done by the `PlaintextCorpusReader`. Take a look at the code below. Here `my_corpus.sents('haze.txt')` returns not the list of words inside `haze.txt` but the list of sentences inside `haze.txt`, where each sentence is itself a list of words. For example, `haze_sents[0]` is the first sentence in `haze.txt`, and this sentence is represented as a list of words, as shown below.

In [9]:
# From my_corpus, get the sentences in the document haze.txt
haze_sents = my_corpus.sents('haze.txt')

# Inspect the first sentence
try:
    print(haze_sents[0])
except LookupError:
    # if the punkt library is not downloaded, this segment will do so and then try again
    nltk.download('punkt')
    print(haze_sents[0])

# Display the number of sentences
print(len(haze_sents))

['Singapore', 'can', 'expect', 'more', 'rain', 'and', 'less', 'haze', 'in', 'the', 'coming', 'weeks', 'with', 'the', 'south', '-', 'west', 'monsoon', 'season', 'transitioning', 'into', 'inter', '-', 'monsoon', 'conditions', '.']
4


Can you find out how many sentences are there in `haze.txt` and in `mrt.txt`, respectively?

After loading your own collection of text, you may want to perform some simple text pre-processing steps to further clean or normalize the text so that it is easier to use the text in later analysis.

### Changing Everything to Lowercase 

In English, sentences start with capitalized words. However, for many text analysis tasks, we should not differentiate between a capitalized word and its original form. For example, the word “Yesterday” and “yesterday” in the following two sentences carry the same meaning and should be represented in the same form.

> *Yesterday I went to a party.*
>
> *My friend Sally left for New York yesterday.*

A simple solution is to change all letters into lowercase when tokenizing text. Python provides a simple function called `lower()` that does the job.

We can see below that the words “Singapore” and “The” in the original list `haze` has been changed to “singapore” and
“the” in the transformed list `haze_lower`. Here `[w.lower() for w in haze]` defines a new list by taking every word in
the list `haze` and changing it to lowercase.

In [10]:
# The words of the document haze.txt is now in the variable haze
haze = my_corpus.words('haze.txt')

# Inspect them
print(haze[0:30])

['Singapore', 'can', 'expect', 'more', 'rain', 'and', 'less', 'haze', 'in', 'the', 'coming', 'weeks', 'with', 'the', 'south', '-', 'west', 'monsoon', 'season', 'transitioning', 'into', 'inter', '-', 'monsoon', 'conditions', '.', 'The', 'inter', '-', 'monsoon']


In [11]:
# Implied for loop using list comprehension to convert words to lowercase
haze_lower = [w.lower() for w in haze]

# We should see the same words as above, but in lowercase
print(haze_lower[0:30])

['singapore', 'can', 'expect', 'more', 'rain', 'and', 'less', 'haze', 'in', 'the', 'coming', 'weeks', 'with', 'the', 'south', '-', 'west', 'monsoon', 'season', 'transitioning', 'into', 'inter', '-', 'monsoon', 'conditions', '.', 'the', 'inter', '-', 'monsoon']


### Removing Punctuation Marks 

You will also notice that after tokenization, all punctuation marks are kept inside the list of words. For many text analysis tasks such as text classification and document retrieval, punctuation marks are not useful. (For some other tasks such as information extraction, punctuation marks can be very useful.)

If we would like to remove punctuation marks and other non-word tokens such as numbers, what shall we do? One solution is to first list down all the special tokens such as comma and period that we want to remove, and then remove all the tokens in our document collection that are in this list. However, we may not be able to enumerate all possible tokens that need to be removed. Another solution is to simply keep tokens that contain only alphabetic characters (letters) or alphanumerical characters (letters and numbers), depending on the need of the analysis task.

To achieve this goal, people often use something called *regular expressions*. Using regular expressions, you can define a wide range of patterns to match strings. Many programming languages have support for regular expressions. In Python, you need to import the `re` library before you can use regular expressions.

In [12]:
import re

In [13]:
haze_words_only = [w for w in haze_lower if re.search('^[a-z]+$', w)]
print(haze_words_only[0:30])   
print(len(haze_words_only))

['singapore', 'can', 'expect', 'more', 'rain', 'and', 'less', 'haze', 'in', 'the', 'coming', 'weeks', 'with', 'the', 'south', 'west', 'monsoon', 'season', 'transitioning', 'into', 'inter', 'monsoon', 'conditions', 'the', 'inter', 'monsoon', 'season', 'typically', 'lasts', 'from']
109


In [14]:
# see if you can achieve the same effect using built-in Python string function
haze_words_only = [w for w in haze_lower if w.isalpha()]
print(haze_words_only[0:30])
print(len(haze_words_only))

['singapore', 'can', 'expect', 'more', 'rain', 'and', 'less', 'haze', 'in', 'the', 'coming', 'weeks', 'with', 'the', 'south', 'west', 'monsoon', 'season', 'transitioning', 'into', 'inter', 'monsoon', 'conditions', 'the', 'inter', 'monsoon', 'season', 'typically', 'lasts', 'from']
109


## Your Turn

Using the `SGNews_Apr2012` corpus that you have loaded,  and the file `14011.txt`
1. First change every token to lower case. 
2. Use regular expressions to keep only alphabetic tokens. 
3. Find out the most frequent tokens after these steps.

Compare them with the most frequent tokens you have found earlier.

In [20]:
# Enter your code here to answer the questions above.
# Use file14011 from the previous code, which was the list of words from 14011.txt

# 1. First change every token to lower case.
fileLower = [w.lower() for w in newsCorpus.words("14011.txt")]
print("Words from 14011.txt in lowercase:\n",fileLower, "\n")

# 2.Use regular expressions to keep only alphabetic tokens.
fileLowerWordsOnly = [w for w in fileLower if w.isalpha()]
print("Words from 14011.txt (no puctuations):\n", fileLowerWordsOnly, "\n")

# 3. Find out the most frequent tokens after these steps (fdist and FreqDist)
fdist1 = nltk.FreqDist(fileLowerWordsOnly)
print(fdist.most_common(1))


Words from 14011.txt in lowercase:
 ['htmlid', ':', '14011', 'title', ':', 'he', "'", 's', 'a', 'mensa', 'member', 'at', '7', 'url', ':', 'http', '://', 'www', '.', 'asiaone', '.', 'com', '/', 'news', '/', 'latest', '%', '2bnews', '/', 'singapore', '/', 'story', '/', 'a1story20120430', '-', '342911', '.', 'html', 'numofpages', ':', '1', 'content', ':', 'this', 'primary', '1', 'boy', 'does', 'mathematics', 'problems', 'meant', 'for', 'secondary', '1', 'students', 'without', 'batting', 'an', 'eyelid', '.', 'at', 'just', 'seven', 'years', 'old', ',', 'george', 'yeo', ',', 'with', 'an', 'iq', 'of', '130', ',', 'is', 'one', 'of', 'the', 'youngest', 'members', 'in', 'mensa', 'singapore', '.', 'the', 'high', '-', 'iq', 'society', 'members', 'are', 'mostly', 'between', 'the', 'ages', 'of', '18', 'and', '35', '.', 'the', 'average', 'iq', '-', 'intelligence', 'quotient', '-', 'is', '100', '.', 'when', 'asked', 'why', 'he', 'joined', 'mensa', ',', 'george', 'just', 'blinks', 'owlishly', 'from', '

OPTIONAL: What if you want to keep all alphanumerical tokens, that is, you do not mind keeping tokens that contain digits? Try replacing `[a-z]` with `[a-z0-9]` and repeat the steps above.

But do this after you have completed the lab exercise.

### Stop Word Removal

NLTK also has a built-in stop word list for English that can come in
handy when we need to remove stop words from a text collection. The
following code shows how we remove all the stop words from the list
`haze_words_only`.

In [22]:
from nltk.corpus import stopwords

# Stop words are laguage-specific
try:
    stop_list = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
    stop_list = stopwords.words('english')

# Remove the word w in haze_words_only if it is in the stop_list
haze_stopremoved = [w for w in haze_words_only if w not in stop_list]

# Inspect our handiwork
print(haze_stopremoved[0:30])

['singapore', 'expect', 'rain', 'less', 'haze', 'coming', 'weeks', 'south', 'west', 'monsoon', 'season', 'transitioning', 'inter', 'monsoon', 'conditions', 'inter', 'monsoon', 'season', 'typically', 'lasts', 'october', 'november', 'weather', 'period', 'characterised', 'rainfall', 'light', 'variable', 'winds', 'meteorological']


<p>You can see that stop words such as 'can' and 'in' have been removed.</p>
<p>Instead of using the built-in stop word list, you can also use your own stop word list if necessary.</p>

In [23]:
# haze_stopremoved is a list of words (tokens).
# to get the frequency of each token, we can use the built-in collections library
from collections import Counter

word_freq2 = Counter(haze_stopremoved)

print(word_freq2.most_common(10))

[('singapore', 3), ('haze', 3), ('monsoon', 3), ('season', 3), ('inter', 2), ('rainfall', 2), ('expect', 1), ('rain', 1), ('less', 1), ('coming', 1)]


## Your Turn

1. Remove the stop words from `SGNews_Apr2012` file, `14011.txt` using NLTK’s built-in stop word list.

2. Afterwards, find the most frequent 20 words in this file and see whether they give a good summary of the file.

In [25]:
# Enter your code here to answer the questions above.

# Import the stopwords
from nltk.corpus import stopwords
# Define variable for storing stopwords
try:
    stop_list = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
    stop_list = stopwords.words('english')
# Use file14011 from the previous code and remove the stopwords.
# You can store the result in file14011_StopRemove

file14011_StopRemove = [w for w in file14011 if w not in stop_list]
# Print the new variable with stopwords removed.
print(file14011_StopRemove)
# Check the top 20 words
# Before removing the stop words
print("20 most common words from 14011.txt (all words):\n", nltk.FreqDist(file14011).most_common(20))

# After removing them
fdist2 = nltk.FreqDist(file14011_StopRemove)

print("20 most common words from 14011.txt (with no stop words):\n", fdist2.most_common(20))

['htmlID', ':', '14011', 'Title', ':', 'He', "'", 'Mensa', 'member', '7', 'URL', ':', 'http', '://', 'www', '.', 'asiaone', '.', 'com', '/', 'News', '/', 'Latest', '%', '2BNews', '/', 'Singapore', '/', 'Story', '/', 'A1Story20120430', '-', '342911', '.', 'html', 'NumOfPages', ':', '1', 'Content', ':', 'This', 'Primary', '1', 'boy', 'mathematics', 'problems', 'meant', 'Secondary', '1', 'students', 'without', 'batting', 'eyelid', '.', 'At', 'seven', 'years', 'old', ',', 'George', 'Yeo', ',', 'IQ', '130', ',', 'one', 'youngest', 'members', 'Mensa', 'Singapore', '.', 'The', 'high', '-', 'IQ', 'society', 'members', 'mostly', 'ages', '18', '35', '.', 'The', 'average', 'IQ', '-', 'intelligence', 'quotient', '-', '100', '.', 'When', 'asked', 'joined', 'Mensa', ',', 'George', 'blinks', 'owlishly', 'behind', 'glasses', 'says', 'matter', '-', '-', 'fact', 'tone', ':', '"', 'Because', 'I', 'high', 'IQ', '."', 'His', 'IQ', 'exceeds', '98', 'per', 'cent', 'children', 'age', '.', 'George', 'joined', 

OPTIONAL: You can try the same stop word removal text-preprocessing for entire collection and get the summary of the collection.

### Stemming

NLTK has a built-in Porter stemmer we can use.

In [26]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer() # creates an instance of the stemmer and assign it to a variable

In [27]:
haze_stemmed = [stemmer.stem(w) for w in haze_stopremoved]
print(haze_stemmed[0:30])

['singapor', 'expect', 'rain', 'less', 'haze', 'come', 'week', 'south', 'west', 'monsoon', 'season', 'transit', 'inter', 'monsoon', 'condit', 'inter', 'monsoon', 'season', 'typic', 'last', 'octob', 'novemb', 'weather', 'period', 'characteris', 'rainfal', 'light', 'variabl', 'wind', 'meteorolog']


<p>We can see from the code above that using the Porter stemmer, “coming” is changed to “come,” “weeks” is changed to “week,” and “transitioning” is changed to “transit.” We can also see that after stemming, some words are no longer correct. For example, “singapore” is changed to “singapor,” “conditions” is changed to “condit,” and so on.</p><p>Although for humans, these words no longer make sense, for computers, this is usually not a problem. As long as all occurrences of “singapore” are changed to “singapor” and all occurrences of “conditions” or “condition” are changed to “condit,” we can still perform many analysis tasks. For example, to search for relevant documents about “singapore,” after stemming, we just need to search for documents containing the word “singapor.”</p>

## Your Turn

1. Perform stemming on `SGNews_Apr2012`, file `14011.txt` using NLTK’s Porter stemmer. 
2. Find the most frequent 20 (stemmed) words in this file again.
3. Are they very different from your results earlier?
4. Apply the same for the collection.

In [29]:
# Enter your code here to answer the questions above.

# Import the porter stemmer
from nltk.stem.porter import PorterStemmer
# Define variable to store the stemmer
stemmer = PorterStemmer()
# Use file14011 from the previous code and stem the words.
# You can store the result in file14011_stem

file14011_stem = [stemmer.stem(w) for w in file14011]

# Check the top 20 words
# Before stemming
print("20 most common words from 14011.txt (with no stemming):\n", nltk.FreqDist(file14011).most_common(20))

# After stemming
fdist3 = nltk.FreqDist(file14011_stem)
print("20 most common words from 14011.txt (with stemming):\n", fdist3.most_common(20))

20 most common words from 14011.txt (with no stemming):
 [(',', 27), ('.', 24), ('he', 12), (':', 10), ('to', 10), ("'", 9), ('at', 9), ('George', 9), ('"', 9), ('a', 8), ('-', 8), ('in', 8), ('and', 8), ('of', 7), ('t', 7), ('1', 6), ('IQ', 6), ('is', 6), ('the', 6), ('his', 6)]
20 most common words from 14011.txt (with stemming):
 [(',', 27), ('.', 24), ('he', 14), (':', 10), ('at', 10), ('hi', 10), ('to', 10), ("'", 9), ('a', 9), ('georg', 9), ('the', 9), ('in', 9), ('"', 9), ('-', 8), ('and', 8), ('of', 7), ('t', 7), ('1', 6), ('iq', 6), ('is', 6)]


OPTIONAL: You can try the same stemming text-preprocessing for entire collection.

### Reflective Practice:
#### No need to submit.
1. Create a Python file in Spyder or VS Code (or any other editor/IDE you like) to perform the same series of steps on a text file, e.g., `mrt.txt`.
2. Make the code reusable by keeping the file name as a variable.
3. Make it even more reusable by refactoring the code into a function.
4. Loop over the files in `SGNews_Apr2012` to perform these tasks.
5. Advanced: create appropriate visualizations from your work in the previous step; e.g., word-frequency bar charts.

## Gensim

<p>Gensim is another popular Python text analytics library (already installed with default Anaconda installation) that provides some built-in functions for easily converting documents to vectors and computing cosine similarities. Although you can always write your own code to do this, it is much easier for beginners to make use of existing libraries. It is also very common for programmers to re-use libraries developed by other programmers.</p>



In [18]:
import gensim

<p>Gensim automates common text preprocessing via the simple_preprocess function.</p>

Let's try it on the original `haze.txt`.

In [None]:
# my_corpus.raw('haze.txt') = the text read from haze.txt before any processing
print( gensim.utils.simple_preprocess( my_corpus.raw('haze.txt') )[:30] )

<p>Gensim also has a built-in stop word list for English that can come in handy when we need to remove stop words from a text collection.</p>
<p>Practice: compare Gensim's list of stop words with NLTK's.</p>

In [54]:
# stop_list = NLTK's stop words (from above)
stop_list_gs = gensim.parsing.preprocessing.STOPWORDS

Gensim also has a built-in Porter stemmer we can use.

In [62]:
from gensim.parsing.porter import PorterStemmer

stemmer = PorterStemmer()

# usage is the same as NLTK
# haze_stemmed = [stemmer.stem(w) for w in haze_stopremoved]
# print(haze_stemmed[0:30])