## Topic Modeling Using MALLET <a id='table'></a>
For this exercise, we will be using data contained in the "homework" database on the Big Data for Social Science Class Server. This notebook will walk you through topic modeling NIH abstracts using [MALLET.](#http://mallet.cs.umass.edu/topics.php)

## Table of Contents

- [Initialization](#Initialization)
- [Getting Data](#Getting-Data)

    - [Exercise 1](#Exercise-1)

- [Generating Topics](#Generating-Topics)

    - [Exercise 2](#Exercise-2)
    - [Exercise 3](#Exercise-3)

- [Inferring Topics](#Inferring_Topics)

    - [Exercise 4](#Exercise-4)

- [Resources](#resources)

## Initialization

Before we begin, we'll need to run the following code cells, one of which imports Python libraries we'll be using, and one which defines a function, `terminal()`, that we'll use to run commands on the server.  Please run the following code cell before proceeding:

In [None]:
# Importing the modules we will use in this workbook
from subprocess import Popen, PIPE
import os
import MySQLdb
import string
import nltk
import re
from nltk.corpus import stopwords

# download punkt tokenizers
nltk.download( "punkt" )
nltk.download( "stopwords" )

# and, defining our terminal() function
def terminal(args):
    pipe = Popen(args, stdout = PIPE, stderr=PIPE)
    text, err = pipe.communicate()    
    text = text.decode()
    err = err.decode()
    if len(text) > 2:
        return text
    elif len(err) > 2:
        return err
    else:
        print("No output returned")

## Getting Data

* Back to the [Table of Contents](#Table-of-Contents)

We will be using a sample of abstracts from NIH grants stored in the 'TextAnalysis' table in the 'homework' database to explore automated text analysis.

This table was created by taking a sample of abstracts from the grants located in the broader 'umetricsgrants' database.

For text analysis, we'll be automatically deriving a list of topics based on these abstracts using MALLET, a Java based text analysis tool that makes topic modeling very easy.  MALLET is primarily a command line tool and requires a specific format for its data.  We'll use our `terminal()` function to run it, and we'll be creating appropriately formatted text files for each abstract as part of this exercise.  You can read more about importing data into MALLET [on the MALLET web site's "Importing Data" page.](http://mallet.cs.umass.edu/import.php)

Let us first create a temporary directory in our home folder using the `terminal()` function.

A brief explanation of how the `terminal()` function works: Python can send commands to the terminal using "subprocess" so that you never have to leave the iPython notebook.  For your convenience, a `terminal()` function is defined above that implements this for you. It takes in a list of arguments where the first is the name of the command you want to run and subsequent items are the arguments you want to pass to that command.  It returns the output from executing the command. 

The following `terminal()` call will make a temporary directory named "`temp`" in your current working directory.

In [None]:
terminal(['mkdir', 'temp'])

Now lets retrieve the abstracts, their ids, and store each abstract in a file with the id as the filename

In [None]:
# Create MySQL connection
user = "jmorgan"
password = "hUDpr7TUpbhpsoQfw$cTWbyBb2WCvsgE"
database = "homework"

# invoke the connect() function, passing parameters in variables.
db = MySQLdb.connect( user = user, passwd = password, db = database )

# output basic database connection info.
print( db )

# create a database cursor.
cursor = db.cursor( MySQLdb.cursors.DictCursor )

Next, we'll set up a few functions that we'll use in this exercise.  The first is `writeFile()`, a function that takes in a filename and text, and creates a new file populated with the text.  Run the cell below to define `writeFile()`.

In [None]:
def writeFile(filename, data):

    with open(filename, "w") as f:
        
        f.write(str(data))
        
    #-- END with (automatically closes file) --#

We also wrote a function to do some initial cleaning of the abstract text - `cleanAbstract()`. This function:

- accepts an abstract's text
- removes words that would be very common in NIH abstracts, because we dont want them to bias the results
- removes stopwords (MALLET can also do that)
- removes punctuation
- returns the resulting cleaned string

Run the cell below to define `cleanAbstract()`.

In [None]:
def cleanAbstract(text):
    
    # common words to remove
    commonWords = ['study', 'project', 'experiment', 'abstract', 'description', 'studies', \
                  'abstracts', 'projects', 'experiments', 'descriptions']

    # remove white space.
    text = re.sub('[\n\t\r\f]+', '', text)
    
    # convert to all lower case.
    text = text.lower()
    
    # break text up into tokens (words)
    tokens = nltk.word_tokenize(text)
    
    # retrieve list of stop words.
    stop = stopwords.words('english')

    # remove stop words from list of tokens.
    tokens = [t for t in tokens if t not in stop]

    # remove punctuation from tokens
    exclude = set(string.punctuation)
    tokenNew=[]
    for s in tokens:
        snew = ''.join(ch for ch in s if ch not in exclude)
        if snew!="":
            tokenNew.append(snew)

    # remove common words
    tokenNew = [t for t in tokenNew if t not in commonWords]

    # tie tokens back together.
    abstract  = ' '.join(t for t in tokenNew)

    return abstract

#-- END function cleanAbstract() --#

### Exercise 1

* Back to the [Table of Contents](#Table-of-Contents)

Retrieve the abstracts one by one from the database and write them to text files in the temp directory. For your convenience, the writeFile() function has already been created. You just need to call it with the path where you want the file stored and the contents of the abstract.

When writing files, write each to the `temp` directory we created inside the current directory (path is "./temp").  Set the name of the files to the application ID from their grant (stored in the "`APPLICATION_ID`" column in the `TextAnalysis` database table), followed by ".txt".

So, the path for a given file should be:

    ./temp/<application_id>.txt
    
The text of each abstract is stored in the "`ABSTRACT_TEXT`" column in the `TextAnalysis` table.  Make sure to clean the text using `cleanAbstract()` before you write it out to a file.

In [None]:
# First create the query that you need to get the abstracts
query = 'SELECT * FROM TextAnalysis where TextAnalysis.ABSTRACT_TEXT LIMIT 1000;'

#Execute the query
cursor.execute(query)

### BEGIN SOLUTION
#Fetch the results one by one and write them to a file
row = cursor.fetchone()
while (row is not None):
    ID = row['APPLICATION_ID']
    abstract = row['ABSTRACT_TEXT']
    abstract = cleanAbstract(abstract)
    filename = './temp/' + str(ID) + ".txt"
    writeFile(filename, abstract)
    row = cursor.fetchone()
### END SOLUTION

# clean up
cursor.close()
db.close()

In [None]:
# Test to see if file was successfully written
f = open('./temp/6187933.txt', 'r')

## Generating Topics

* Back to the [Table of Contents](#Table-of-Contents)

We have now created a number of .txt files in the temp directory, each of which contains a single abstract.  We will be using the set of these abstracts together as a corpus of data for machine learning.

Our next task is to transform these individual files into a single file in MALLET format. To achieve this, we will use MALLET's import command. The import command can read in an entire directory, turn it into a MALLET file, and can also strip out common english stopwords. Our command will look something like this:

    /bin/mallet/bin/mallet import-dir --input path/to/temp/directory --output data.mallet --keep-sequence --remove-stopwords

Lets break down this command into each of the separate tokens it contains (where tokens are words separated by spaces):

- **`/bin/mallet/bin/mallet`** ==> is the path to the MALLET program
- **`import-dir`** ==> the first argument to the program mallet specifies what command the program is being asked to do.  The `import-dir` command tells MALLET to import an entire directory of files into a MALLET data file.
- **`--input`** ==> "--" are used in MALLET to signify parameter names, usually followed by a space and a parameter value.  `--input` is a parameter used to tell MALLET the directory in which the corpus of data is located.
- **`/path/to/temp/directory`** ==> Path to the directory that contains the corpus of data (the value for the parameter `--input`).
- **`--output`** ==> tells MALLET where to store the output
- **`data.mallet`** ==> name of file we'll store the MALLET data in (the value for the parameter `--outout`).
- **`--keep-sequence`** ==> parameter that tells MALLET to keep the original texts in the order in which they were listed in the directory.  This is an example of a parameter that doesn't have an associated value.
- **`--remove-stopwords`** ==> parameter that tells MALLET to remove common english stopwords like "a", "an", and "the".  Another parameter with no subsequent value.


### Exercise 2

* Back to the [Table of Contents](#Table-of-Contents)

Now use the terminal() function to run the MALLET import-dir command on your "./temp" directory.  Remember, the `terminal()` function accepts a list of arguments, with the command to be run the first item in the list, and then subsequent details of the command after, with each space-delimited part of the command an item in this list.  Given the above breakdown of the `import-dir` command, break that command into a list of arguments and invoke the command using `terminal()`, reading from your "./temp" directory and outputting the resulting MALLET data file to "data.mallet".

In [None]:
# store argument list in args[]
args = None

### BEGIN SOLUTION
args  = ['/bin/mallet/bin/mallet', 'import-dir', '--input',  './temp', '--output', \
         'data.mallet', '--keep-sequence', '--remove-stopwords']
### END SOLUTION

# run terminal() on args and print out results.
print( terminal( args ) )

In [None]:
# Test to see if file data.mallet was successfully written
f = open('./data.mallet', 'r')

If you go to your working directory now, you should find a file named "`data.mallet`".  This is the MALLET data file that we will use as input when we ask MALLET to generate topics based on a corpus of text.

We will use the `train-topics` command in MALLET to generate our very own topic models.

In the following example, we execute this command using its default settings:

In [None]:
args = ['/bin/mallet/bin/mallet', 'train-topics', '--input', 'data.mallet']
print(terminal(args))

This command opens `data.mallet` and repeatedly runs MALLET's topic modeling algorithm on the corpus of documents in  it using default settings, printing out the results as it goes and using the results of each run to train a topic detection model to detect topics based on words used in texts in the corpus.  By default, MALLET prints out the top 10 topics every 50th iteration. A good way to judge if the algorithm has converged is to look at the output. Each time it outputs topics, for each of the ten topics, it outputs the topic ID, tIf it stops changing much between iterations, it means that the algorithm has converged.

You can read more about the different options that can be used to fine tune the results [on the MALLET web site's "Topic Modeling" page.](http://mallet.cs.umass.edu/topics.php)

### Exercise 3

* Back to the [Table of Contents](#Table-of-Contents)

In the above example, we ran the base topic modeling algorithm but we didn't save the output anywhere.  If you look at the documentation pointed to above, it gives you different options to store the output.  Using this documentation as a guide, modify the MALLET command list used to invoke "`train-topics`" to output:

- topic keys
- topic composition of documents
- a serialized MALLET topic trainer object

Also add the option to:

- enable hyperparameter optimization
- increase the number of sampling iterations to 20,000
- increase the number of topics to 20

Store the output of `--output-doc-topics` in a file called `docTopics.txt`.

**_NOTE: the topic modeling in this code cell could take a long time to complete - as long as there is an asterisk in the square brackets to its left ("In [*]"), it should still be running.  Give it some time._**

In [None]:
#  Modify the MALLET command to output topic keys, topic composition of documents, and a serialized MALLET topic trainer object.
# Add the option to enable hyperparameter optimization, increase the number of sampling iterations to 20,000,
# and increase the number of topics to 20.

# store argument list in args[]
args = None

### BEGIN SOLUTION
args = ['/bin/mallet/bin/mallet', 'train-topics', '--input', 'data.mallet', '--optimize-interval', '10', \
        '--output-topic-keys', 'topicKeys.txt', '--output-doc-topics', 'docTopics.txt', '--num-topics', '20', \
       '--num-iterations', '20000']
### END SOLUTION

# run terminal() on args and print out results.
print( terminal( args ) )

In [None]:
# Test to see if file data.mallet was successfully written
f = open('./docTopics.txt', 'r')

Now lets look at some results. Your execution of MALLET should have resulted in two output files:

- `topicKeys.txt` - a list of the topics detected in the abstracts, along with their weights and the words associated with each.
- `docTopics.txt` - for each file in the corpus, lists each of the detected topics and a relavance score that indicates how likely it is that a given abstract relates to that topic.

In `topicKeys.txt`, each topic gets a tab-delimited line in the file.  In a given topic's line, the first number is the numeric identifier of the topic (0, 1, 2, etc.), the second number gives an indication of the weight of that topic, and then after a tab, the line is completed with a list of the keywords associated with that topic.  An example (topic 0, weight 0.01502):

    14	0.05863	health care cancer data risk patients individuals outcomes genetic disease women testing aging older unreadable lung effects participants intervention community 

In `docTopics.txt`, each abstract, represented by its file path in the original directory, has a line in the file that lists the topics associated with that article and a relevance score for each topic, in order of decreasing relevance.  The relevance score runs from 0 to 1, where 0 is not at all relevant and 1 is perfectly related.  Example:

    1	file:/home/jmorgan/nbgrader/courses/2015-fall-big_data/source/05.%20Text%20Analysis/./temp/6287560.txt	14	0.7923786992036733	16	0.10258021829417482	11	0.0414781353290313	12	0.04106461349323023	17	0.020602274154395795	1	3.1281525697325206E-4	7	2.5520223298362594E-4	15	2.0605374283754973E-4	2	1.4476870589053855E-4	18	1.3480683006926762E-4	6	1.216816985277047E-4	10	1.0585995154023939E-4	9	1.0155273496603757E-4	5	9.406085778593105E-5	3	8.487431366553085E-5	19	7.761498905199808E-5	0	7.675511604129449E-5	13	7.071872110668556E-5	4	5.972207592082584E-5	8	4.957229813413215E-5

In this example, the abstract with ID 6287560 (found in the file path) is most highly related to topic 14 (our example above) with 0.792... relevance score.  In aggregate, this output could help you to find connections between documents based on these detected topics that you might not have otherwise noticed.

## Inferring Topics

* Back to the [Table of Contents](#Table-of-Contents)

You can use your newly trained model to infer topics for unseen documents. Since we got the first 1000 abstracts to train the model, let us use the model to infer topics on the 1001th abstract.

In [None]:
cursor.execute('SELECT * FROM TextAnalysis LIMIT 1 OFFSET 1000')
data = cursor.fetchone()

#Creating a new inference directory
terminal(['mkdir', 'infer'])
writeFile('./infer/' + str(data["APPLICATION_ID"]) +".txt", data["ABSTRACT_TEXT"])

The documentation for topic inference can be found [here](http://mallet.cs.umass.edu/topics.php)

### Exercise 4

* Back to the [Table of Contents](#Table-of-Contents)

Use the MALLET documentation to set up a call to `mallet` to infer topics for the file we just created in the `./infer` folder:

- First, we provide a mallet command that will re-run our topic model with an additional parameter to output a topic inference model specification to a file named `model.mallet`.
- Next, you'll create and run a mallet command that executes the `import-dir` command to  create a new MALLET data file named `one.mallet`, based on your `./infer` directory rather than your `./temp` directory, that will contain the article whose topics you want to infer.  Use the option `--use-pipe-from data.mallet` to specify our original data file as a training file for this corpus.
- Finally, you'll create and run a mallet command that executes the `infer-topics` command, running for 10,000 iterations, using `one.mallet` as the input, the model inferencer we created first as the inferencer, and that outputs the topics for the one abstract to a file named `inf-one.txt`.

As mentioned in the documenation, make sure that the new data is compatible with your training data. Use the option `--use-pipe-from [MALLET TRAINING FILE]` in the MALLET command import-dir to specify a training file. Store the inference topics for the abstracts in a file called `inf-one.txt`.

**_NOTE: Each of the topic modeling steps outlined above that will be implemented in this code cell could take a long time to complete - as long as there is an asterisk in the square brackets to its left ("In [*]"), the code in this cell should still be running on the server.  Give it some time._**

In [None]:
# We will first need to rerun our model with the --inferencer-filename option
args = ['/bin/mallet/bin/mallet', 'train-topics', '--input', 'data.mallet', '--optimize-interval', '10', \
        '--output-topic-keys', 'topicKeys.txt', '--output-doc-topics', 'docTopics.txt', '--num-topics', '20', \
       '--num-iterations', '20000', '--inferencer-filename', 'model.mallet']
print(terminal(args))

# Now use import-dir to pull our one file into a mallet data file,
#    then use infer-topics to run the model.mallet inferencer on it.

### BEGIN SOLUTION
args = ['/bin/mallet/bin/mallet', 'import-dir', '--input', './infer', '--output', 'one.mallet', '--use-pipe-from', \
        'data.mallet']
print(terminal(args))

args = ['/bin/mallet/bin/mallet', 'infer-topics', '--input', 'one.data', '--inferencer', 'model.mallet', \
  '--output-doc-topics', 'inf-one.txt', '--num-iterations', '10000']
print(terminal(args))

### END SOLUTION

In [None]:
# Test to see if file data.mallet was successfully written
f = open('./inf-one.txt', 'r')

## Resources for Topic Modeling  <a id='resources'></a>

* Back to the [Table of Contents](#Table-of-Contents)

Below you will find some tutorials and resources for topic modeling.
- [General Introduction to Topic Modeling](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf)
- [Topic Modeling for Humanists](http://www.scottbot.net/HIAL/?p=19113)
- [Interpretation of Topic Models](http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf)