## Topic Modeling Using MALLET <a id='table'></a>
For this exercise, we will be using data contained in the "homework" database on the Big Data for Social Science Class Server. This notebook will walk you through topic modeling NIH abstracts using [MALLET.](#http://mallet.cs.umass.edu/topics.php)

## Table of Contents
- [Getting Data](#Getting-Data)
  - [Exercise 1](#Exercise-1)
- [Generating Topics](#Generating-Topics)
  - [Exercise 2](#Exercise-2)
  - [Exercise 3](#Exercise-3)
- [Inferencing Topics](#inferencing_topics)
  - [Exercise 4](#Exercise-4)
- [Resources](#resources)

## Getting Data

* Back to the [Table of Contents](#Table-of-Contents)

We will be using the NIH abstracts stored in the 'TextAnalysis' table in the 'homework' database. This table was created by taking a sample of abstracts from the broader 'umetricsgrants' database.

MALLET is a Java based text analysis tool that makes topic modeling very easy. However, MALLET is primarily a command line tool and requires a specific format for its data. You can read more about importing data into MALLET [here.](http://mallet.cs.umass.edu/import.php)

We will create a text file for reach individual abstract. Let us first create a temporary directory in our home folder.

Python can send commands to the terminal using "subprocess" so that you never have to leave the iPython notebook. We will use the 'subprocess' module to create an empty 'temp' directory. For your convenience, the terminal() function has already been created. It takes in a list of arguments and returns the output from executing the command. 

In [None]:
# Importing the modules we will use in this workbook
from subprocess import Popen, PIPE
import os
import MySQLdb
import string
import nltk
import re
from nltk.corpus import stopwords

In [None]:
def terminal(args):
    pipe = Popen(args, stdout = PIPE, stderr=PIPE)
    text, err = pipe.communicate()    
    text = text.decode()
    err = err.decode()
    if len(text) > 2:
        return text
    elif len(err) > 2:
        return err
    else:
        print("No output returned")

The following call will make a temporary directory in your current working directory.

In [None]:
terminal(['mkdir', 'temp'])

Now lets retrieve the abstracts, their ids, and store each abstract in a file with the id as the filename

In [None]:
# Create MySQL connection
user = "<user>"
password = "<password>"
database = "homework"

# invoke the connect() function, passing parameters in variables.
db = MySQLdb.connect( user = user, passwd = password, db = database )

# output basic database connection info.
print( db )

cursor = db.cursor( MySQLdb.cursors.DictCursor )

Now lets create a function that takes in a filename and text, and creates a new file populated with the text.

In [None]:
def writeFile(filename, data):
    f = open(filename, "w")
    f.write(str(data))
    f.close()

We wrote a function to do some initial cleaning of the abstracts. You will notice that we are removing words that would be very common in NIH abstracts, because we dont want them to bias the results. We are also removing stopwords (MALLET can also do that), as well as punctuation.

In [None]:
def cleanAbstract(text):
    st = snowball.EnglishStemmer()
    commonWords = ['study', 'project', 'experiment', 'abstract', 'description', 'studies', \
                  'abstracts', 'projects', 'experiments', 'descriptions']
    text = re.sub('[\n\t\r\f]+', '', text).lower()
    tokens = nltk.word_tokenize(text)
    stop = stopwords.words('english')
    tokens = [t for t in tokens if t not in stop]
    exclude = set(string.punctuation)
    tokenNew=[]
    for s in tokens:
        snew = ''.join(ch for ch in s if ch not in exclude)
        if snew!="":
            tokenNew.append(snew)
    tokenNew = [t for t in tokenNew if t not in commonWords]
    abstract  = ' '.join(t for t in tokenNew)
    return abstract

### Exercise 1

* Back to the [Table of Contents](#Table-of-Contents)

Retrieve the abstracts one by one from the database and write them to text files in the temp directory. For your convenience, the writeFile() function has already been created. You just need to call it with the full path of the filename and the contents of the abstract.

In [None]:
# First create the query that you need to get the abstracts
query = 'SELECT * FROM TextAnalysis where TextAnalysis.ABSTRACT_TEXT LIMIT 1000;'

#Execute the query
cursor.execute(query)

### BEGIN SOLUTION
#Fetch the results one by one and write them to a file
row = cursor.fetchone()
while (row is not None):
    ID = row['APPLICATION_ID']
    abstract = row['ABSTRACT_TEXT']
    abstract = cleanAbstract(abstract)
    filename = './temp/' + str(ID) + ".txt"
    writeFile(filename, abstract)
    row = cursor.fetchone()
### END SOLUTION

In [None]:
# Test to see if file was successfully written
f = open('./temp/6187933.txt', 'r')

## Generating Topics

* Back to the [Table of Contents](#Table-of-Contents)

We have created a number of .txt files in the temp directoty. Each one of these files is a single abstract, and the set of all these files together is a corpus of data. Our next task is to transform these individual files into a single MALLET format. To achieve this, we will use the import command. The import command can read in an entire directory, turn it into a MALLET file, and can also strip out common english stopwords. Our command will look something like this:

/bin/mallet/bin/mallet import-dir --input path/to/temp/directory --output data.mallet --keep-sequence --remove-stopwords

Lets decompose this command:
- /bin/mallet/bin/mallet is the path to the MALLET program
- import-dir is a command that tells MALLET to import an entire directory
- --input "--" are used in MALLET to signify commands, and "-" is used to signify spaces. --input tells MALLET where the corpus of data is located
- /path/to/temp/directory Actual path of corpus of data
- --output tells MALLET where to store the output
- --keep-sequence keep the original texts in the order in which they were listed
- --remove-stopwords removes common english stopwords like a, an, the.


### Exercise 2

* Back to the [Table of Contents](#Table-of-Contents)

Now use the terminal() function to run the MALLET import command. Remember, terminal() takes in a list of arguments, so decompose the import command into a list of arguments.

In [None]:
### BEGIN SOLUTION
args  = ['/bin/mallet/bin/mallet', 'import-dir', '--input',  './temp', '--output', \
         'data.mallet', '--keep-sequence', '--remove-stopwords']
print(terminal(args))
### END SOLUTION

In [None]:
# Test to see if file data.mallet was successfully written
f = open('./data.mallet', 'r')

If you go to your working directory, you will find a data.mallet file. This is the input file that MALLET will use to generate topics.

You can use the train-topic command in MALLET to generate your very own topic models. We will execute this command below using only the default settings.

In [None]:
args = ['/bin/mallet/bin/mallet', 'train-topics', '--input', 'data.mallet']
print(terminal(args))

This command opens data.mallet, and runs the topic modeling algorithm on it using the default settings, printing out the results as it goes. By default, MALLET prints out the top 10 topics every 50th iteration. A good way to judge if the algorithm has converged is to look at the output. If it doesn't change much, it means that the algorithm converged.

You can read more about the different options that can be used to fine tune the results [here.](http://mallet.cs.umass.edu/topics.php)

### Exercise 3

* Back to the [Table of Contents](#Table-of-Contents)

We ran the topic modeling algorithm, but we didn't save the output anywhere. If you look at the documentation pointed to above, it gives you different options to store the output. Modify the MALLET command to output topic keys, topic composition of documents, and a serialized MALLET topic trainer object. Add the option to enable hyperparameter optimization, increase the number of sampling iterations to 20,000, and increase the number of topics to 20. Store the output of --output-doc-topics in a file called `docTopics.txt`

In [None]:
#  Modify the MALLET command to output topic keys, topic composition of documents, and a serialized MALLET topic trainer object.
# Add the option to enable hyperparameter optimization, increase the number of sampling iterations to 20,000,
# and increase the number of topics to 20.

### BEGIN SOLUTION
args = ['/bin/mallet/bin/mallet', 'train-topics', '--input', 'data.mallet', '--optimize-interval', '10', \
        '--output-topic-keys', 'topicKeys.txt', '--output-doc-topics', 'docTopics.txt', '--num-topics', '20', \
       '--num-iterations', '20000']
print(terminal(args))
### END SOLUTION

In [None]:
# Test to see if file data.mallet was successfully written
f = open('./docTopics.txt', 'r')

Now lets look at some results. Look at the topicKeys.txt and the docTopics.txt file. In topicKeys.txt, the first number is the topic (topic 0), and the second number gives an indication of the weight of that topic. 

docTopics.txt shows what topics compose your corpus of data. For example, abstract id 630248 had topic 0 as its main topic, at about 52%. Using this output, you can find connections between documents that you might not have realized otherwise. 

## Inferencing Topics <a id='inferencing_topics'></a>

* Back to the [Table of Contents](#Table-of-Contents)

You can use your newly trained model to infer topics for unseen documents. Since we got the first 1000 abstracts to train the model, let us use the model to infer topics on the 1001th abstract.

In [None]:
cursor.execute('SELECT * FROM TextAnalysis LIMIT 1 OFFSET 1000')
data = cursor.fetchone()

#Creating a new inference directory
terminal(['mkdir', 'infer'])
writeFile('./infer/' + str(data["APPLICATION_ID"]) +".txt", data["ABSTRACT_TEXT"])

The documentation for topic inference can be found [here](http://mallet.cs.umass.edu/topics.php)

### Exercise 4

* Back to the [Table of Contents](#Table-of-Contents)

Use the MALLET documentation to  <a id='exercise_1'></a>infer topic for the file we just downloaded in the infer folder. As mentioned in the documenation, make sure that the new data is compatible with your training data. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command import-dir to specify a training file. Store the inference topics for the abstracts in a file called `inf-one.txt`.

In [None]:
# We will first need to rerun our model with the --inferencer-filename option
args = ['/bin/mallet/bin/mallet', 'train-topics', '--input', 'data.mallet', '--optimize-interval', '10', \
        '--output-topic-keys', 'topicKeys.txt', '--output-doc-topics', 'docTopics.txt', '--num-topics', '20', \
       '--num-iterations', '20000', '--inferencer-filename', 'model.mallet']
print(terminal(args))
# Now import the one file that is in the infer folder and run the inferencer on it.
### BEGIN SOLUTION
args = ['/bin/mallet/bin/mallet', 'import-dir', '--input', './infer', '--output', 'one.mallet', '--use-pipe-from', \
        'data.mallet']
print(terminal(args))

args = ['/bin/mallet/bin/mallet', 'infer-topics', '--input', 'one.data', '--inferencer', 'model.mallet', \
  '--output-doc-topics', 'inf-one.txt', '--num-iterations', '10000']
print(terminal(args))

### END SOLUTION

In [None]:
# Test to see if file data.mallet was successfully written
f = open('./inf-one.txt', 'r')

## Resources for Topic Modeling  <a id='resources'></a>

* Back to the [Table of Contents](#Table-of-Contents)

Below you will find some tutorials and resources for topic modeling.
- [General Introduction to Topic Modeling](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf)
- [Topic Modeling for Humanists](http://www.scottbot.net/HIAL/?p=19113)
- [Interpretation of Topic Models](http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf)