# Topic Modeling Editors' Choice articles from *Digital Humanities Now*

This notebook steps through the process of analyzing articles from DHNow using a couple of techniques.

###Assemble a corpus of articles
First we'll get our articles from the file system. We'll use the `glob` module to grab just the numbered .txt files. 

In [1]:
import os
import glob
filenames = glob.glob('inputs/[0-9]*.txt')

We can quickly check to make sure we've got the right files by inspecting the `filenames` list we've just created.

Test file size. We'll only keep the ones above a certain size, say 300 bytes.

In [3]:
filenames = [x for x in filenames if os.path.getsize(x) > 300]

### LDA Topic Modeling
Now let's try running the same corpus through LDA. Here we'll use the gensim module, which provides a [standard API](https://radimrehurek.com/gensim/apiref.html) for working with corpora. The gensim module has its own implementation of LDA and also includes a nice wrapper for MALLET, which means if we must use MALLET we don't have to mess with Java and the command line at all if we don't want to, and we don't have to keep track of all the temporary text files it generates.

In [4]:
from gensim import corpora, models, utils
import nltk

We'll use gensim to create a quick corpus from our text files. We could customize the tokenization however we want. We'll need a stoplist, and can use whatever we want. Here we'll use one from the nltk module.

In [5]:
stoplist = set(nltk.corpus.stopwords.words("english"))

class MyCorpus(corpora.TextCorpus): 
    def get_texts(self): 
        for filename in filenames:
            yield (x for x in 
                utils.tokenize(open(filename).read(), lowercase=True, deacc=True, errors="ignore")
            if x not in stoplist)

corpus = MyCorpus(filenames) 

We've already removed stopwords; now let's get rid of any words that appear in just one document.

In [6]:
corpus.dictionary.filter_extremes(no_below=1)

In [7]:
print(corpus.dictionary)

Dictionary(90343 unique tokens: ['kaigaya', 'piwik', 'wallenberger', 'demasiado', 'ma³n']...)


The gensim module merely points to MALLET, which you'll still need to have installed, and you'll need to tell gensim where to find it.

In [8]:
mallet_path = '/Users/stakats/Development/mallet-2.0.7/bin/mallet'

Now we'll use the gensim wrapper to run MALLET on our newly created corpus. We can set whatever MALLET parameters we want right from here.

In [9]:
model = models.wrappers.LdaMallet(mallet_path, corpus, num_topics=40, optimize_interval=10, id2word=corpus.dictionary)

Now that we've run MALLET, we can take a quick look at the topic keys generated.

In [10]:
doctopics = models.wrappers.LdaMallet.fdoctopics(model)
topickeys = models.wrappers.LdaMallet.ftopickeys(model)

Now we'll dump the doctopics into a dataframe, and then add the filenames as a column because gensim doesn't keep track of this metadata.

In [11]:
import pandas as pd
import numpy as np
import itertools
import operator
import os

We have to jump through a lot of hoops to rearrange the MALLET output into a form that's usable.

In [12]:
def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

In [13]:
doctopic_triples = []
mallet_docnames = []

In [14]:
with open(doctopics) as f:
    f.readline()  # read one line in order to skip the header
    for line in f:
        docnum, docname, *values = line.rstrip().split('\t')
        mallet_docnames.append(docname)
        for topic, share in grouper(2, values):
            triple = (docname, int(topic), float(share))
            doctopic_triples.append(triple)

In [15]:
doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))

In [16]:
num_docs = len(mallet_docnames)

In [17]:
num_topics = len(doctopic_triples) // len(mallet_docnames)

In [18]:
doctopic = np.zeros((num_docs, num_topics))

In [19]:
for triple in doctopic_triples:
    docname, topic, share = triple
    row_num = mallet_docnames.index(docname)
    doctopic[row_num, topic] = share

In [20]:
df = pd.DataFrame(doctopic_triples)

In [21]:
table = pd.pivot_table(df, values=2, index=[0], columns=[1], aggfunc=np.sum)

In [22]:
table.index.name = 'id'

In [23]:
table.reset_index(inplace=True)

In [24]:
table = table.applymap(float)

In [25]:
table = table.sort(['id'], ascending=[1])

In [26]:
indexed_df = table.set_index(['id'])

In [27]:
indexed_df

1,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.000021,0.011385,0.000077,0.001170,0.184208,0.123002,0.000166,0.000191,0.000576,0.000145,...,0.000132,0.000160,0.000243,0.000452,0.015714,0.000094,0.000276,0.000210,0.051770,0.000249
1,0.000086,0.490842,0.000320,0.004853,0.003526,0.003167,0.000688,0.000791,0.002389,0.000603,...,0.000549,0.000662,0.001009,0.001873,0.001799,0.000390,0.001146,0.000869,0.003474,0.001033
2,0.000051,0.002928,0.000188,0.164677,0.002078,0.001866,0.000405,0.162283,0.013855,0.000355,...,0.000323,0.000390,0.000594,0.225158,0.013507,0.000230,0.000676,0.000512,0.051837,0.013056
3,0.000116,0.034956,0.000428,0.006501,0.004724,0.004242,0.000921,0.001059,0.003200,0.000808,...,0.000735,0.000886,0.001351,0.002509,0.087305,0.000523,0.001536,0.029463,0.004653,0.001383
4,0.000126,0.007275,0.000468,0.038028,0.005162,0.004636,0.001007,0.001158,0.003497,0.000883,...,0.000804,0.031892,0.001477,0.002741,0.002633,0.000571,0.001678,0.001273,0.036009,0.001512
5,0.000062,0.003546,0.000228,0.003463,0.002516,0.002260,0.000491,0.000564,0.001705,0.000430,...,0.000392,0.000472,0.000720,0.001336,0.227400,0.000278,0.000818,0.060918,0.002479,0.000737
6,0.000012,0.054870,0.000046,0.235392,0.235203,0.000451,0.000098,0.000113,0.102646,0.000086,...,0.000078,0.000094,0.000144,0.000267,0.265047,0.000056,0.000163,0.000124,0.000495,0.003156
7,0.000032,0.001819,0.000117,0.001776,0.009022,0.001159,0.000252,0.000289,0.000874,0.000221,...,0.000201,0.000242,0.000369,0.000685,0.000658,0.000143,0.000420,0.000318,0.001271,0.000378
8,0.000110,0.327693,0.000405,0.006153,0.004470,0.030798,0.000872,0.001003,0.003029,0.000765,...,0.000696,0.000839,0.001279,0.082722,0.002281,0.000495,0.001453,0.001102,0.004404,0.001309
9,0.000116,0.006658,0.000428,0.006501,0.004724,0.004242,0.000921,0.001059,0.003200,0.000808,...,0.000735,0.000886,0.001351,0.002509,0.002410,0.142015,0.001536,0.001165,0.004653,0.001383


In [28]:
filenamesseries = pd.Series(filenames)

In [29]:
indexed_df.insert(0, 'filenames', filenamesseries, allow_duplicates=False)

In [30]:
indexed_df

1,filenames,0,1,2,3,4,5,6,7,8,...,30,31,32,33,34,35,36,37,38,39
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,inputs/1.txt,0.000021,0.011385,0.000077,0.001170,0.184208,0.123002,0.000166,0.000191,0.000576,...,0.000132,0.000160,0.000243,0.000452,0.015714,0.000094,0.000276,0.000210,0.051770,0.000249
1,inputs/10.txt,0.000086,0.490842,0.000320,0.004853,0.003526,0.003167,0.000688,0.000791,0.002389,...,0.000549,0.000662,0.001009,0.001873,0.001799,0.000390,0.001146,0.000869,0.003474,0.001033
2,inputs/100.txt,0.000051,0.002928,0.000188,0.164677,0.002078,0.001866,0.000405,0.162283,0.013855,...,0.000323,0.000390,0.000594,0.225158,0.013507,0.000230,0.000676,0.000512,0.051837,0.013056
3,inputs/1002.txt,0.000116,0.034956,0.000428,0.006501,0.004724,0.004242,0.000921,0.001059,0.003200,...,0.000735,0.000886,0.001351,0.002509,0.087305,0.000523,0.001536,0.029463,0.004653,0.001383
4,inputs/1005.txt,0.000126,0.007275,0.000468,0.038028,0.005162,0.004636,0.001007,0.001158,0.003497,...,0.000804,0.031892,0.001477,0.002741,0.002633,0.000571,0.001678,0.001273,0.036009,0.001512
5,inputs/1006.txt,0.000062,0.003546,0.000228,0.003463,0.002516,0.002260,0.000491,0.000564,0.001705,...,0.000392,0.000472,0.000720,0.001336,0.227400,0.000278,0.000818,0.060918,0.002479,0.000737
6,inputs/1008.txt,0.000012,0.054870,0.000046,0.235392,0.235203,0.000451,0.000098,0.000113,0.102646,...,0.000078,0.000094,0.000144,0.000267,0.265047,0.000056,0.000163,0.000124,0.000495,0.003156
7,inputs/101.txt,0.000032,0.001819,0.000117,0.001776,0.009022,0.001159,0.000252,0.000289,0.000874,...,0.000201,0.000242,0.000369,0.000685,0.000658,0.000143,0.000420,0.000318,0.001271,0.000378
8,inputs/1010.txt,0.000110,0.327693,0.000405,0.006153,0.004470,0.030798,0.000872,0.001003,0.003029,...,0.000696,0.000839,0.001279,0.082722,0.002281,0.000495,0.001453,0.001102,0.004404,0.001309
9,inputs/1012.txt,0.000116,0.006658,0.000428,0.006501,0.004724,0.004242,0.000921,0.001059,0.003200,...,0.000735,0.000886,0.001351,0.002509,0.002410,0.142015,0.001536,0.001165,0.004653,0.001383


In [31]:
indexed_df.to_csv(path_or_buf='doctopics.csv', sep='\t')

In [34]:
import shutil
shutil.copyfile(topickeys, 'topickeys.txt')

'topickeys.txt'