# ISSS609 : Text Analytics and Applications

## Lab: Week 3

### Agenda 

- Download NLTK Data

- Install gensim

- Load text data
   - To be able to load your own text collections.
   - Explore the files in your corpus
   - Freq distribution of words

- NLP Concepts - Basics
    - To be able to perform basic text processing operations (tokenization, lemmatization, stop words removal and punctuation removal) using Python and Gensim.

- NLP Concepts - Advanced 
    - POS Tagging
    - Syntax parse tree (Visualization) 
     
- Optional Labs
    - Syntax Parse Tree Generation (Advanced)
    
- Exercise 1

## Install NLTK
### Some Basic Usage of NLTK

NLTK is a natural language toolkit for building programs in Python that work with natural language text. We will use NLTK for text pre-processing and some other tasks this term.
When you installed Anaconda, NLTK should have already been installed.
However, for the exercises below, you will need to download some data.

Before we download the data, please create a directory called __nltk_data__ and leave this directory empty.
- Windows: In 'C:\'. 
![windows](images/nltk-folder-win.png)

- Mac: In your home folder ~
![mac](images/nltk-folder.png)


Now run the code below. Note that the code may take some to run, and you'll need to wait until you don't see __In [\*]__ anymore on the left of the cell below. The code does not generate any output.

In [1]:
import nltk

Now let us download some data using the following code. When you run the code `nltk.download()` below, you'll see a pop-up window as shown below. Select "book" and start downloading. It will take some time to download the data. After the download is finished, you <b>close</b> the "NLTK Downloader" window.
![mac](images/nltk-download.png)

In [2]:
# Please pay Attention: A window should open after you enter the following command. If you can't see, 
# it might be behind the browser window. Choose book, check the directories and then click 
# download. Also, be patient. This is slow.

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Loading Your Own Text - SampleData

The first part of this lab is to show you how to load your own text collections by opening the file in Python and loading it into the NLTK model.

We typically refer to a digitalized collection of documents as a
*corpus*. A corpus contains a set of documents. While many real world
documents are in the format of Microsoft Word or PDF, when we process
and analyze documents, they are usually converted into plain text files.
Here we assume that the corpus we plan to analyze contains only plain
text files.

First, create one `.txt` file for each document in your collection. For
example, if you have 100 documents, you can name them `1.txt`, `2.txt`,
$\ldots$, `100.txt`. You are free to choose any document name as long as
each document has a unique name. Place all the `.txt` files in a
directory of your choice. 

Next, you would like to load these files into the NLTK model by reading them. This is because there are in built tool in NLTK that can aid in the preprocessing of the documents. Let us start with a simple model  to deal with plain text files that do not have annotations such as HTML tags.

The following code shows how you can load plain text files using NLTK. In
the example below, it is assume that there are two files named `haze.txt` and `mrt.txt`inside the directory `data/SampleText`, where `data` should be placed in the current directory, i.e., where this Jupyter notebook is placed.
(Note that the `data` directory with the two text files inside should have been downloaded together with this Jupyter notebook.)
If you have placed the data in a different directory, you can modify the code below to correspond to the correct directory where your files are.
We also encourage you to create different folders for different labs to avoid confusion.

For those of you new to Python, the lines starting with `#` are *comments*, which explain what the code does but cannot be executed.

For documentation (e.g. on `PlaintextCorpusReader`), go to https://www.kite.com/python/docs/ and search.

## Installation of Gensim

For this lab, you will use Gensim, a Python library that provides some
built-in functions for easily converting documents to vectors and
computing cosine similarities. Although you can always write your own
code to do this, it is much easier for beginners to make use of existing
libraries. It is also very common for programmers to re-use libraries
developed by other programmers.

To install Gensim under Anaconda, open your Anaconda Prompt window and type the following command inside the Anaconda Prompt window::

`conda install -c anaconda gensim`

When you're asked whether you want to proceed with the installation as shown below, please answer `y`.

`Proceed ([y]/n)? y`


The installation process may take some time so please be patient.

After Gensim is installed, try the following code to see if it can be imported. You may get a warning message that says `'warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")'`. You can ignore the warning message.

This lab is based on the tutorial - https://www.machinelearningplus.com/nlp/gensim-tutorial/

Check if the installation is sucessful by importing gensim. A warning can be ignored.

Let's start loading in the files.

In [3]:
# import nltk # Already done in the first cell
from nltk.corpus import PlaintextCorpusReader

# Define the file directory
file_directory = 'data/SampleText/'

filename_pattern = '.+\.txt'

my_corpus = PlaintextCorpusReader(file_directory, filename_pattern)
# print(my_corpus.fileids())

# Number of documents in the collection.
print("Number of files:", len(my_corpus.fileids()))

# Total number of words in the collection.
print("Number of words:", len(my_corpus.words()))

# Print the filenames in the corpus
print("Names of files:", my_corpus.fileids())

Number of files: 2
Number of words: 258
Names of files: ['haze.txt', 'mrt.txt']


First, we can take a look at the file named haze.txt

In [4]:
# Code to read a single file - Read the file that you want to look at eg haze.txt
with open(file_directory + 'haze.txt', 'r') as file_to_read:
    haze2 = file_to_read.read()
    
print("Text from Python method:\n", haze2)

# Print some words in specific file
haze = my_corpus.words('haze.txt')
print("Text from NLTK Corpus Reader:\n", haze[0:30])

Text from Python method:
 Singapore can expect more rain and less haze in the coming weeks with the south-west monsoon season transitioning into inter-monsoon conditions.

The inter-monsoon season typically lasts from October to November and the weather during the period is characterised by more rainfall and light and variable winds.

The Meteorological Service Singapore said on Monday in an advisory that this transition signals the end of traditional dry season in the region, and the likelihood of transboundary haze affecting Singapore for the rest of the year will be low.

This is because the increased rainfall will help alleviate the hotspot and haze situation in Sumatra and Kalimantan in Indonesia.

Text from NLTK Corpus Reader:
 ['Singapore', 'can', 'expect', 'more', 'rain', 'and', 'less', 'haze', 'in', 'the', 'coming', 'weeks', 'with', 'the', 'south', '-', 'west', 'monsoon', 'season', 'transitioning', 'into', 'inter', '-', 'monsoon', 'conditions', '.', 'The', 'inter', '-', 'monso

### Freq Distribution of words

The function FreqDist() provided by NLTK. FreqDist() can be applied to any list in Python. We can now use FreqDist() on our own text as shown below.


In [5]:
# Word counts distribution for haze.txt
fdist = nltk.FreqDist(haze)
print("Freq. Dist. from NLTK:\n", fdist.most_common(10))

Freq. Dist. from NLTK:
 [('the', 11), ('and', 7), ('in', 5), ('.', 4), ('Singapore', 3), ('haze', 3), ('-', 3), ('monsoon', 3), ('season', 3), ('of', 3)]


You can see that the code above displays the most frequent 10 words
inside the document `haze.txt`.

What if you would like to get the words from *all* the files in
`my_corpus`? You can simply use `my_corpus.words()` without specifying any document ID. Give it a try. 

#### The above list has punctuations and stopwords. Let's apply NLP concepts to process the data We shall use gensim API for text processing.

## NLP Concepts - Basics

You must have noticed that by using the `PlaintextCorpusReader`,
tokenization is done while loading the files, that is, the original text
is split into individual words and stored as a list of words in Python.

We are using gensim now and this requires explicit tokenization

### Tokenise

In [6]:
import gensim
from gensim.parsing.preprocessing import remove_stopwords

# Cast this into a list as the result would be a generator object otherwise
haze2 = list(gensim.utils.tokenize(haze2))

In [7]:
print("Tokens from Gensim:\n", haze2)

Tokens from Gensim:
 ['Singapore', 'can', 'expect', 'more', 'rain', 'and', 'less', 'haze', 'in', 'the', 'coming', 'weeks', 'with', 'the', 'south', 'west', 'monsoon', 'season', 'transitioning', 'into', 'inter', 'monsoon', 'conditions', 'The', 'inter', 'monsoon', 'season', 'typically', 'lasts', 'from', 'October', 'to', 'November', 'and', 'the', 'weather', 'during', 'the', 'period', 'is', 'characterised', 'by', 'more', 'rainfall', 'and', 'light', 'and', 'variable', 'winds', 'The', 'Meteorological', 'Service', 'Singapore', 'said', 'on', 'Monday', 'in', 'an', 'advisory', 'that', 'this', 'transition', 'signals', 'the', 'end', 'of', 'traditional', 'dry', 'season', 'in', 'the', 'region', 'and', 'the', 'likelihood', 'of', 'transboundary', 'haze', 'affecting', 'Singapore', 'for', 'the', 'rest', 'of', 'the', 'year', 'will', 'be', 'low', 'This', 'is', 'because', 'the', 'increased', 'rainfall', 'will', 'help', 'alleviate', 'the', 'hotspot', 'and', 'haze', 'situation', 'in', 'Sumatra', 'and', 'Kalim

As seen above, the tokenize function in Gensim already removes punctuation for you.

### Changing Everything to Lowercase

In [None]:
haze2_lower = [w.lower() for w in haze2]
print("Lowercase tokens from Gensim:\n", haze2_lower)

### Stop Word Removal

Gensim also has a built-in stop word list for English that can come in
handy when we need to remove stop words from a text collection. The
following code shows how we remove all the stop words from the list
`haze2_lower`.

In [None]:
# Load in stop word list (note that this is a set rather than a list)
stop_list = gensim.parsing.preprocessing.STOPWORDS

haze2_stopremoved = [w for w in haze2_lower if w not in stop_list]

In [None]:
print("Lowercase tokens with stop words removed (from Gensim):\n", haze2_stopremoved)

### Stemming

Gensim also has a built-in Porter stemmer we can use.


In [None]:
from gensim.parsing.porter import PorterStemmer

stemmer = PorterStemmer()
haze2_stemmed = [stemmer.stem(w) for w in haze2_stopremoved]
print("Lowercase tokens with stop words removed and stemmed (from Gensim):\n", haze2_stemmed)

We can see from the code above that using the Porter stemmer, “coming”
is changed to “come,” “weeks” is changed to “week,” and “transitioning”
is changed to “transit.” We can also see that after stemming, some words
are no longer correct. For example, “singapore” is changed to
“singapor,” “conditions” is changed to “condit,” and so on. Although for
humans, these words no longer make sense, for computers, this is usually
not a problem. As long as all occurrences of “singapore” are changed to
“singapor” and all occurrences of “conditions” or “condition” are
changed to “condit,” we can still perform many analysis tasks. For
example, to search for relevant documents about “singapore,” after
stemming, we just need to search for documents containing the word
“singapor.”

### Freq Distritbution of Words
We use a function called Counter for this task

In [None]:
from collections import Counter

words2 = [token for token in haze2_stemmed]
word_freq2 = Counter(words2)
common_words2 = word_freq2.most_common(10)

print("Freq. Dist (counter) from Gensim:\n", common_words2)

# NLP Concepts - Advanced

## Part 1 - POS Tagging

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. Our emphasis in this lab is on exploiting tags, and tagging text automatically.
Refer to Penn Tree bank for tags - https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
import nltk
from nltk.tokenize import word_tokenize

text = word_tokenize("John ate the cake with spoon")

print(text)

tagsText= nltk.pos_tag(text)
print(tagsText);

## Your Turn
 Write the code to generate tags for the text "After she ate the cake, Emma visited Tony in his room"

In [None]:
# Enter your code here

text2 = 
tagsText2 =
print(tagsText2);

## Part 2 - Syntactically Parsed Tree

### NLTK parser - Parsing with context free grammar

A parser processes input sentences according to the productions of a grammar, and builds one or more constituent structures that conform to the grammar. 


### A Simple Grammar
1. We need to first define a context free grammar rules. we use pattern to define the rules. 
2. We then use the chunker to chunk the POS tagged sentence that uses the grammar pattern to generate the syntax parse tree.

A grammar based chunk parser. chunk.RegexpParser uses a set of regular expression patterns to specify the behavior of the parser. The chunking of the text is encoded using a ChunkString, and each rule acts by modifying the chunking in the ChunkString. The rules are all implemented using regular expression matching and substitution.

A grammar contains one or more clauses in the following form:
Examples: 

1.    NP:
       {<DT|JJ>}          # chunk determiners and adjectives
2.    VP:
        {VBD| VBD NP}     # Chunk verbs and noun phrases  
  
https://kite.com/python/docs/nltk.chunk.RegexpParser

3. To convert into a better visuals, we can use the Tree class. Below is the code.

After you execute the code, what do you observe in the output? Does the output show you the correct syntax tree? What should we do to improve it?

In [None]:
import nltk
from nltk import Tree

pattern = """NP: {<DT>?<JJ>*<NN>}
 VBD: {<VBD>}
 IN: {<IN>}"""

# Code to call the Chunker to interpret the grammar and generate the syntax structure on the tagged text. 
NPChunker = nltk.RegexpParser(pattern) 
result = NPChunker.parse(tagsText)
result.pprint()

# To print the visual, as art
parse = Tree.fromstring(str(result))
parse.pretty_print()


## Your Turn
Write the code to draw the parse tree for the second sentence (`text2`) "After she ate the cake, Emma visited Tony in his room"  
Use its tagged output `tagsText2`

In [None]:
# Enter your code here

result2 = 
result2.pprint()

# To print the visual, as art
parse2 = 
parse2.pretty_print()

# Optional Labs

## Part 3 - Parse tree (Improve the tree)
We notice that the syntaxt structure is incomplete. Add more clause and generate better syntax parsed tree

 Write the code to display the parse tree for the text 
 1. John ate the cake with spoon
 2. After she ate the cake, Emma visited Tony in his room
 
 We will have various answers according to the grammar clauses you have defined.

In [None]:
pattern = """NP: {<DT>?<JJ>*<NN>}
 IN: {<IN>}
 PP: {<IN>*<IN>?<NP>}
 VB: {<VBP>}
 VBD: {<VB>*<VB>?<NP>*<VB>?<NP>?<PP>}
 """
NPChunker = nltk.RegexpParser(pattern) 

result = NPChunker.parse(tagsText)
result.pprint()

parse = Tree.fromstring(str(result))
parse.pretty_print()

## Part 4  - Stanford Parser (Totally Optional!)
The tree may be still incomplete. This is because the grammar clauses are incomplete. Creating a complete list of grammar clauses  is the first solution to this problem. The second one to download a better parser and use it. Let's try Stanford CoreNLP.  

Here are the steps:
- Go to https://stanfordnlp.github.io/CoreNLP/ and download the latest version of CoreNLP
- Unzip it and move it where Jupyter Notebook can access it. Example: `~/TAA/Labs/CoreNLP` or `C:\TAA\Labs\CoreNLP`
- Start a Core NLP Server using the downloaded files (easier than it sounds!)
- Use the `CoreNLPParser` from NLTK to commnunicate with the server
- Parse anything now:
    - For constituency-parsing, use `CoreNLPParser`
    - For dependency parsing, use `CoreNLPDependencyParser`

Credit: Parts 4 and 5 of the lab are based on https://bbengfort.github.io/snippets/2018/06/22/corenlp-nltk-parses.html

#### Notes
- The folder where you store `CoreNLP` has to be visible to the Jupyter. In other words, you should be able to browse to it from the Jupyter NB file browser window.
- You need Java, which is a pain to install in M1 MacBooks.
- On Windows, you need GhostScript. Download it from https://www.ghostscript.com/download/gsdnld.html and install it using default setup.
- You may have to specify the paths (on Windows again - maybe you should get a Mac!). The code is specified in the code cells below.

In [None]:
from nltk.parse.corenlp import CoreNLPServer
import os
import platform

myOS = platform.system()

if (myOS == "Windows"):
    os.environ["PATH"] += os.pathsep + r'C:\Program Files\gs\gs9.52\bin'

# Use the relative path name for the model and other files 
# (CoreNLP is two levels up from the current working directory)
STANFORD = os.path.join("../..", "stanford-corenlp-4.5.1/")

# Create the server
# The server needs to know the location of the following files:
#   - stanford-corenlp-X.X.X.jar
#   - stanford-corenlp-X.X.X-models.jar

server = CoreNLPServer(
   os.path.join(STANFORD, "stanford-corenlp-4.5.1.jar"),
   os.path.join(STANFORD, "stanford-corenlp-4.5.1-models.jar"),    
)

# Start the server in the background
server.start()

In [None]:
# Start the server in the background (if dead)
# server.start()

from  nltk.parse.corenlp import CoreNLPParser

parser = CoreNLPParser()
parse = next(parser.raw_parse("John ate the cake with spoon"))
print(parse)

# Stop the server when done
# server.stop()

parse

###  Analysis
We see only one tree. This tree has various grammar rules that you have learned during the class activities and almost complete. The resultant tree works for "John ate the cake with spoon". 

Try the same with code with "John ate the cake with cherry".
The tree generated will be incorrect.

Hence we need the semantics that can help us to disambiguate the words ate, cake, cherry, spoon. This aspect is left to you as a Homework (see the slides).

## Your Turn
Write the code to draw the parse tree for the second sentence: "After she ate the cake, Emma visited Tony in his room"  
It is way too easy!

In [None]:
# Start the server in the background again
# server.start()

## Enter your code below
parse =

# Stop the server when done
# server.stop()

parse

## Part 5 - Discourse analysis
We will now uncover the discourse structure of a text.

I love data science because I find it very useful for my company.

To achieve this:
1. Generate the parse tree for the given  text from stanford parser.
2. Call the StanfordDependencyParser to generate dependencies. List of dependencies are generated. look out for "mark" 


In [None]:
# Start the server in the background again
# server.start()

sentences = next(parser.raw_parse("I love data science because I find it very useful for my company."))
print(sentences)
# sentences.draw()

# Stop the server when done
# server.stop()

sentences

### Dependency Parser

The Stanford typed dependencies representation was designed to provide a simple description of the
grammatical relationships in a sentence that can easily be understood and effectively used by people
without linguistic expertise who want to extract textual relations. I

The details of the representation are available in the manual. 
https://nlp.stanford.edu/software/dependencies_manual.pdf

More information about the universal dependencies can be found in https://nlp.stanford.edu/pubs/schuster2016enhanced.pdf

Let us continue to work with our example and generate the dependencies using our CoreNLP server.

In [None]:
# Start the server in the background again
# server.start()

from  nltk.parse.corenlp import CoreNLPDependencyParser
dependency_parser = CoreNLPDependencyParser()
result = next(dependency_parser.raw_parse('I love data science because I find it very useful for my company.'))

# Stop the server when done
# server.stop()

#### Note
You will need to install `graphviz` to visualize the `result` variable, which is a dependency graph.
- Open an Anaconda prompt on Windows. Or a terminal on Mac.
- Type in `conda install graphviz`
- You may need `pip install graphviz` as well. (Do this `pip` step only if the next cell fails.)

In [None]:
# On Windows, you need to add GraphViz to the path as well
if (myOS == "Windows"):
    username = 'lenovo'
    os.environ["PATH"] += os.pathsep + f'''C:\\Users\\{username}\Anaconda3\Library\bin\graphviz'''
    
# Visualize the result
result

### Stanford Discourse

<b>mark: marker </b>

A marker is the word introducing a finite clause subordinate to another clause. For a complement clause,
this will typically be “that” or “whether”. For an adverbial clause, the marker is typically a preposition
like “while” or “although”. The mark is a dependent of the subordinate clause head.

https://nlp.stanford.edu/pubs/schuster2016enhanced.pdf

## Your Turn
 Write the code to generate tags for the text "After she ate the cake, Emma visited Tony in his room"

## Exercise
### Loading our own dataset - SGNews

The `data.zip` file downloaded from eLearn contains two folders: `SampleText` and `SGNews_Apr2012`. 
This `SGNews_Apr2012` data
set contains a set of Singapore news articles in April 2012. Load this
document collection using NLTK. Can you find out the following
information of this collection?
-   Number of documents in the collection.
-   Total number of words in the collection.
-   The top-20 most frequent words in the file, 14011.txt.
### Tips:

-   You can use NLTK or gensim
-   To use `FreqDist`, you can either use `nltk.FreqDist()` as shown
    above or type `from nltk.probability import FreqDist` first and then directly use
    `FreqDist()`. This is because the `FreqDist` class is defined by the
    `probability` module under NLTK.
-   You may use `counter` from the gensim package as well.

In [None]:
# Enter your code here to answer the questions above. 


