# IS450 - Text Mining and Language Processing

## Lab 6

### Objectives 

-   To learn to define functions in Python.

-   To learn to write Python scripts and import them. Explore two python programs

    `1. preprocess.py`

    `2. k_means.py`

-   To perform K-means document clustering using a provided
    Python module, k_means.py.

### More on Python Programming

In this section, we are going to learn a bit more basic Python
programming concepts so that we can re-use Python code. First of all,
let us visit the concept of *function*. If you have programming
background, then *function* is not new to you. Depending on the
programming language(s) you have used before, functions may also be
referred to as *methods* (e.g., in Java) or *subroutines* (e.g., in Perl).
Even if you have not programmed before, for those of you familiar with
Microsoft Excel, you probably have used pre-defined functions in Excel
such as `SUM` and `AVERAGE` to help data processing. Generally speaking,
a function takes in a list of parameters and performs some computation
based on the parameters. A function may or may not return a value.

We have been calling many functions in Python so far. For example, we can
use `len(lst)` to obtain the length of a list `lst`. Here `len` is a
function that takes in a list object and returns its length. We have
also used the function `print` to display an object without any returned
value.

Let us now see how we can define our own function. In
the code below, a function called `square` is defined. We can see
that to define a function, we need to use the keyword `def` followed by
a pair of round brackets `()`. Inside the round brackets, we need to
specify the parameters of this function. After the closing round
bracket, we need a colon `:`.

See the code below for the defintion of a function called `square()`:

In [12]:
def square(x):
    return x * x

We can see that in the line below the `def` statement, we need to give the
definition of the function. 
Pay attend to the extra space in front of the `return` statement in
the code below. Because Python relies on code indentation to
interpret the code, it is important to properly indent your code when
necessary. Here we need the extra space before `return` because the
`return` statement is part of the `square` function. Once a function is
defined, it can be called as shown in the code below.

In [13]:
y = square(4)
print(y)

16


It is possible for us to define a function in a separate Python file and `import` the file or the function for us to re-use the function.
The Python file containing the function definition is essentially a text file with the file extension `.py`. 


For example, check the file `lab6_demo.py` that you should have downloaded together with this Jupyter notebook.
You can open the file using Notepad++.

After examine `lab6_demo.py`, you can now import this file as follows:

In [14]:
import lab6_demo

Now you can call the function `my_square()` defined in `lab6_demo.py` as follows:

In [15]:
y = lab6_demo.my_square(3.5)
print(y)

12.25


Generally speaking, you can place any Python code inside a `.py` file.
This is called a Python module.
You can then import this Python module to use the functions defined inside.
A Python module can also contain variables that can be re-used.
See the code below:

In [16]:
print(lab6_demo.pi)

3.14


After you import a module, you need to use the name of the module
together with a function name or variable name to refer to that function
or variable. If you need to frequently refer to a
function or a variable from a module, you can directly import that
function or variable from the module, as shown in the code below.

In [None]:
from lab6_demo import my_square
print(my_square(3))

### Text Preprocessing using a Module

Examine another script `preprocess.py` that comes together with Lab 5. Study the functions defined inside. Think about how you can use these functions to pre-process a corpus.

You can try to use the functions inside `preprocess.py` to process the text files inside `SGNewsForClustering`, which you can download from eLearn.

In the next section, we will illustrate how we can use `preprocess.py` to process these files.

### Text Clustering using K-Means


During the lecture we have explained the basic idea behind the K-means clustering
algorithm. In this section, we will test out this algorithm. A Python
module called `k_means` has been written for you together with Lab 6. 
If you are interested, you can take a look at how K-means is
implemented in this script, but you are not required to understand
everything in the script.

You can compare the code with the algorithm described in http://www.onmyphd.com/?p=k-means.clustering.

Now let us see how we can use the function `k_means` defined in this
module to perform K-means text clustering. To begin with, we need to load a
corpus that contains documents. A corpus called `SGNewsForClustering` has been prepared for you. Download the zipped
corpus from eLearn. Inside the directory `SGNewsForClustering`, you will
see 40 documents. Currently the documents are named after the category
they belong to. However, when we load the corpus, the category
information is not used. We will see how based on only the content of
these documents we can cluster them into two groups and the two groups
more or less correspond to the two categories.

First, we use the module `preprocess` to load
this corpus and transform the documents into TF-IDF-based document
vectors. Notice that here we use `from preprocess import *` to import
all the variables and functions from `preprocess.py`. Then we can
directly use the names of those variables and functions.

In [18]:
from preprocess import *

corpus = load_corpus('data/SGNewsForClustering')

docs = corpus2docs(corpus)

dictionary = gensim.corpora.Dictionary(docs)

vecs = docs2vecs(docs, dictionary)
print(len(docs))

40


Next, we import the `k_means` module and call the function `k_means` to
cluster the documents represented by `vecs`. To do so, we need to pass
three parameters to `k_means`. The first is `vecs` itself. The second is
the dimension of these vectors, which is the same as the number of
unique words contained in `dictionary`. The last parameter is the number
of clusters we want to generate.


The code below shows how we call the `k_means` function to do
clustering. We set the number of clusters
to be 2. What is returned is a list where each element represents a
cluster of documents. A cluster of documents is represented by the
indices of the documents in the original document vector list passed to
`k_means`. 

In [19]:
import k_means

num_tokens = len(dictionary.token2id)
clusters = k_means.k_means(vecs, num_tokens, 2)



The code below shows how we can take out the first
cluster and the second cluster, and display the original file IDs of the documents belonging to each cluster.

In [21]:
fids = corpus.fileids()

#The below prints the file ids in each cluster

cluster1 = clusters[0]
print("Cluster 1:", [fids[d] for d in cluster1])

cluster2 = clusters[1]
print("Cluster 2:", [fids[d] for d in cluster2])



Cluster 1: ['SGNewsForClustering/Crime_9843.txt', 'SGNewsForClustering/Crime_9844.txt', 'SGNewsForClustering/Crime_9845.txt', 'SGNewsForClustering/Crime_9846.txt', 'SGNewsForClustering/Crime_9847.txt', 'SGNewsForClustering/Crime_9863.txt', 'SGNewsForClustering/Crime_9864.txt', 'SGNewsForClustering/Crime_9865.txt', 'SGNewsForClustering/Crime_9866.txt', 'SGNewsForClustering/Crime_9867.txt', 'SGNewsForClustering/Crime_9868.txt', 'SGNewsForClustering/Crime_9873.txt', 'SGNewsForClustering/Crime_9929.txt', 'SGNewsForClustering/Crime_9930.txt', 'SGNewsForClustering/Crime_9931.txt', 'SGNewsForClustering/Crime_9939.txt', 'SGNewsForClustering/Crime_9940.txt', 'SGNewsForClustering/Crime_9946.txt', 'SGNewsForClustering/Science_9201.txt', 'SGNewsForClustering/Science_9265.txt', 'SGNewsForClustering/Science_9323.txt']
Cluster 2: ['SGNewsForClustering/Crime_9842.txt', 'SGNewsForClustering/Crime_9890.txt', 'SGNewsForClustering/Science_9060.txt', 'SGNewsForClustering/Science_9165.txt', 'SGNewsForCluste

You can check if most of the files in the same cluster indeed come from the same category (either Crime or Science).

Note that because in the implementation of the K-means clustering
algorithm, we randomly initialize the cluster centers, the result
clusters are not always the same. You may see a different list of
documents in your first cluster. If you call `k_means` again to cluster
the same set of documents, you may get very different results from the
first run. All these are normal and nothing to be concerned.

What you would like to check is the quality of the clustering results.
Although there is no universal truth about what result is considered
correct, we can roughly use the documents’ category labels to judge the
clustering quality. You can repeat the call to `k_means` several times and
check the names of the documents assigned to each cluster. Do you
usually see Crime documents grouped together in one cluster and Science
documents grouped together in another cluster?

### Interpreting the Clusters 

Generally we cluster documents for which we do not have any category
information. In this case, how do we check the quality of the generated
clusters? Or a related question is how we can interpret or summarize the
generated document clusters. A simple way is to find the most frequent
words of a document cluster as an overview of that cluster. Using the
two document clusters you have obtained in the previous section, can you
use NLTK’s `FreqDist` class to check the most frequent words of each
document cluster?

In [22]:
# Write your code here.

# Take all the file IDs in cluster1
cluster1_fids = [fids[d] for d in cluster1]

# Create an empty list clust1_words =[]
cluster1_all_words = []

# Add the words from all files to this list. Use corpus.words and extend method
for fid in cluster1_fids:
    cluster1_all_words.extend(corpus.words(fid))

    
#
# Remove the Stopwords from this list

stop_list = nltk.corpus.stopwords.words('english')

all_words1 = [w.lower() for w in cluster1_all_words]
all_words2 = [w for w in all_words1 if w not in stop_list and len(w)>3]

#Call freq distribution metod and display top 10 words or 20 words
fdist = nltk.FreqDist(all_words2)
print(fdist.most_common(10))

[('police', 37), ('said', 33), ('arrested', 20), ('found', 20), ('suspect', 20), ('year', 18), ('also', 17), ('june', 16), ('investigations', 14), ('years', 14)]
