# Topic Modeling DFR Data with Gensim

This tutorial introduces the [Gensim](https://radimrehurek.com/gensim/index.html) topic modeling library with data obtained from JSTOR's [Data for Research](http://dfr.jstor.org) portal.


Assumptions:
- You have Python installed and configured so you can install additional packages (If you dont' know how to do this exactly I recommend installing the [Anaconda Python Distribution](https://www.continuum.io/why-anaconda).)
- You have or can get some Data from JSTOR (described below).
- You know a little somthing about [topic modeling](http://mcburton.net/blog/joy-of-tm/)

## Installing all the things

There is a small pile of libraries, scripts, and data we need to install before we can start causing trouble with topic models.

### First things first, get some DATA

JSTOR offers a "Data For Research" service which lets researchers download derived data from their digital collections. By "derived data" I mean they provide metadata, wordcounts, and ngram counts from a selection of articles (only 1000 for self-service users). 

The basic workflow is as follows:
1. Use the web based search tool to narrow down a selection of articles. 
2. Once the set of documents has been determined, submit a *Dataset Request.*
3. Specify the **Download Options** for the dataset request.
    - For the **Data Type** select "Citations" and "Word Counts" (you could request the others, but we won't be using those data in this tutorial.
    - For the **Output Format** select "CSV" (this is a much easier format to work with than XML).
    - Give your request a **Job Title** (this is for your own purposes so you can keep track of requests)
    - Set the max number of articles to 1000. 
4. Submit the request and wait for the job status to be "Completed" on your [DFR Requests page](http://dfr.jstor.org/fsearch/myrequests)

Eventually you'll be able to download a ZIP file containing the following files:

```
.
├── citations.tsv
├── MANIFEST.txt
├── README.txt
└── wordcounts/
    ├── wordcounts_10.2307_800826.CSV
    ├── wordcounts_10.2307_800827.CSV
    └── ...
```
*Note: if you included bigrams, trigrams, qudgrams, or keyterms you'll see those folders in the directory as well.*

The files in the wordcount directory are the ones we are most interested in for this tutorial. These files look like this:

```
WORDCOUNTS,WEIGHT
the,83
of,65
english,32
a,31
in,28
and,24
composition,19
school,18
for,14
high,13
v,13
i,12
...
```

That is, a column of words and a column of word counts (also called WEIGHT).

OK, we have data!

### The Gensim Topic Modeling library

Gensim is main player in this tutorials, it is a text analysis library for python that implements many interesting algorithms for "distant reading." Gensim can do various kinds of [topic modeling](https://radimrehurek.com/gensim/tut2.html), [similarity searchers](https://radimrehurek.com/gensim/tut3.html), [TF-IDF](https://radimrehurek.com/gensim/models/tfidfmodel.html), and even ["deep learning" with the word2vec](http://rare-technologies.com/word2vec-tutorial/). Basically, `gensim` is like a swiss army knife for doing machine learning on text. It is worth spending some time reading the [tutorials](https://radimrehurek.com/gensim/tutorial.html), [API documentation](https://radimrehurek.com/gensim/apiref.html), and the [authors's blog](http://rare-technologies.com/blog/).

In [None]:
# if Anacoda is on your PATH. If you don't know what that means run this cell.
!pip install gensim

In [None]:
# if Anaconda is not on your PATH
!~/anaconda/bin/pip install gensim

### A Stopwords list

When doing bag-of-words text processing you need a [stopword](https://en.wikipedia.org/wiki/Stop_words) list so you can clean out all the words that lose their meaning when you slice up the sequences of text.

We are going to use the stopword list that is part of the [MALLET topic modeling toolkit.](https://github.com/mimno/Mallet/blob/master/stoplists/en.txt).

I have already included the stopword list in the repository for your convenience.

## Boilerplate

In [11]:
# You need to put the name 
DFR_DATA_DIRECTORY = "2015.10.29.JzHzfAzZ/" # CHANGE ME OR USE INCLUDED DEMO DATA
SINGLE_CORPUS_FILE = "all_documents.txt"
STOPWORD_FILE = "en.txt"

In [12]:
# import various libraries
import gensim
import glob
import logging


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logger = logging.getLogger()
logger.setLevel(logging.INFO)

## Data Preparation

In order to do topic modeling you need to re-shape the data so that the gensim library can work with the data. What this means in practice is combining all the individual CSV files containing wordcounts into a master file.

In [13]:

with open(SINGLE_CORPUS_FILE, 'w') as f:
    for csv_file in glob.glob(DFR_DATA_DIRECTORY+"wordcounts/*.CSV"):
        with open(csv_file, 'r') as csvfile:
            csvfile.readline() # skip the first line
            document = ""
            for line in csvfile:
                word, count = line.split(',')
                reshaped_document = (word + " ") * int(count)
                document += reshaped_document
            f.write(document+"\n")

In [18]:
with open(SINGLE_CORPUS_FILE) as f:
    dictionary = gensim.corpora.Dictionary(line.split()[1:] for line in f)


INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(42727 unique tokens: ['uplifter', 'targets', 'vievr', 'motoring', 'sulpment']...) from 1000 documents (total 1978518 corpus positions)


In [19]:
# load stopwords into a python list
with open(STOPWORD_FILE) as f:
    stopwords = f.read().split('\n')
len(stopwords)


525

In [20]:
# to remove the stopwords I need to know their ID numbers
stop_ids = [dictionary.token2id[word] 
                for word in stopwords 
                if word in dictionary.token2id]
# use the filter_tokens function to remove the stopwords
dictionary.filter_tokens(stop_ids)
dictionary.compactify() # run this whenever you remove tokens to clean up gaps

In [17]:
# filter out words that only appear once
#dictionary.filter_extremes(no_below=2, no_above=1, keep_n=None)

INFO:gensim.corpora.dictionary:discarding 20117 tokens: [('uplifter', 1), ('vievr', 1), ('sulpment', 1), ('jamess', 1), ('pettifogging', 1), ('projectsul', 1), ('whig', 1), ('spicer', 1), ('lukewarm', 1), ('curbed', 1)]...
INFO:gensim.corpora.dictionary:keeping 22103 tokens which were in no less than 2 and no more than 1000 (=100.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(22103 unique tokens: ['targets', 'obligation', 'impersonally', 'egd', 'enlivens']...)


In [21]:
len(dictionary)

42220

In [22]:
with open(SINGLE_CORPUS_FILE) as f:
    corpus = [dictionary.doc2bow(line.split()[1:]) for line in f]

## Training the model

Now that the data in in the right shape we can now finally "train" the model and generate a set of topics

In [23]:
NUM_TOPICS = 20
NUM_PASSES = 4

In [24]:
# train the model, NOTE: THIS STEP TAKES A LONG TIME
topic_model = gensim.models.LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=NUM_PASSES)

INFO:gensim.models.ldamodel:using symmetric alpha at 0.05
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamodel:running online LDA training, 20 topics, 4 passes over the supplied corpus of 1000 documents, updating model once every 1000 documents, evaluating perplexity every 1000 documents, iterating 50x with a convergence threshold of 0.001000
INFO:gensim.models.ldamodel:-12.863 per-word bound, 7451.5 perplexity estimate based on a held-out corpus of 1000 documents with 870584 words
INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #1000/1000
INFO:gensim.models.ldamodel:topic #5 (0.050): 0.016*english + 0.008*school + 0.007*work + 0.006*literature + 0.006*high + 0.005*teachers + 0.005*teacher + 0.004*time + 0.004*study + 0.004*schools
INFO:gensim.models.ldamodel:topic #14 (0.050): 0.012*english + 0.008*school + 0.006*work + 0.005*teachers + 0.005*literature + 0.004*pupils + 0.004*teacher + 0.004*class + 0.004*college + 0.004*student
INFO:

In [25]:
for topic in topic_model.show_topics(num_topics=-1, formatted=False):
    print(" ".join([word[1] for word in topic]))

english school university cents chicago high postage single copies literature
english literature work teacher teachers teaching school composition students time
english study teacher work teachers student literature composition life reading
english grammar school speech teaching work study language teacher words
english school work student speech college class teachers public speaking
english school literature work high teacher class time teachers study
school thou english high creon debate thee work thy speaking
english work teaching literature teacher teachers class school time journal
english literature college school composition teachers schools study reading high
english work class teacher composition pupils school time pupil students
english life time work man good illustrations literature school examinations
english school teachers high work committee schools college council association
cd english play school plays en men work women york
english bathe page squire schools bird wo