# Mallet through command line

MALLET is a Java-based package, which includes a part for topic modeling.

Download and install mallet
http://mallet.cs.umass.edu/download.php

**In order for gensim to use mallet correctly, put the mallet folder into a directory that python can access, like /usr/**


The following examples are created with reference to Getting Started with Topic Modeling and MALLET at https://programminghistorian.org/lessons/topic-modeling-and-mallet#mac-instructions, Topic Modelling using LDA with MALLET at https://diliprajbaral.com/2017/06/04/topic-modelling-lda-mallet/ , and mallet documentation at http://mallet.cs.umass.edu/topics.php

The data used, which can be downloaded from the github subdirectory for this week, is the Reuters-21578 "ApteMod" corpus for text categorization provided in nltk data.


## Import data

In [1]:
!/usr/mallet-2.0.8/bin/mallet import-dir --input ./nltk_data/corpora/reuters/training/ --output reuters.mallet --keep-sequence --remove-stopwords

Labels = 
   ./nltk_data/corpora/reuters/training/


### Note on stopwords

http://mallet.cs.umass.edu/import-stoplist.php

--remove-stopwords invokes a standard list of English stopwords that is included in the compiled Java code.

--extra-stopword [filename] adds additional stopwords to the standard list. Note that this option is only active if one or both of --remove-stopwords or --stoplist-file are also used.

--stoplist-file [filename] allows us to completely control the stoplist. The option can be used by itself, without --remove-stopwords. The deault, in memory-list is not used, even if --remove-stopwords is invoked.

The format for the stoplist: words are separated by spaces, tabs, or newlines

The command line below preprocesses the data imported by using the stopwords coming together with the reuters data in nltk.

In [2]:
!/usr/mallet-2.0.8/bin/mallet import-dir --input ./nltk_data/corpora/reuters/training/ --output reuters_self_stopwords.mallet --keep-sequence --stoplist-file ./nltk_data/corpora/reuters/stopwords

Labels = 
   ./nltk_data/corpora/reuters/training/


## Train the topic model

In [3]:
!/usr/mallet-2.0.8/bin/mallet train-topics --input reuters_self_stopwords.mallet --inferencer-filename inferencer.mallet --num-topics 10 --optimize-interval 10  --output-state topic-state.gz --output-topic-keys reuters_keys.txt --output-doc-topics reuters_composition.txt

Mallet LDA: 10 topics, 4 topic bits, 1111 topic mask
Data loaded.
max tokens: 732
total tokens: 536196
<10> LL/token: -8.47479
<20> LL/token: -8.17669
<30> LL/token: -8.08389
<40> LL/token: -8.03565

0	0.5	bank billion pct mln rate market stg rates banks money interest central marks dollar u.k dealers fed today week yen 
1	0.5	mln cts net loss dlrs shr profit qtr year revs note oper avg shrs sales prior share record div corp 
2	0.5	pct billion u.s year february january trade japan rise rose growth japanese deficit surplus exports fell imports december dollar prices 
3	0.5	gold pct spokesman plant union south company profits turnover statement expected group mine strike years year plans mining workers pay 
4	0.5	shares dlrs pct stock share mln company offer group common stake outstanding exchange corp securities investment buy bid cash tender 
5	0.5	corp sale unit company sell american u.s acquisition business subsidiary agreed canadian agreement group terms international buy pacific co

[beta: 0.0413] 
<300> LL/token: -7.72745
[beta: 0.04183] 
<310> LL/token: -7.72568
[beta: 0.04201] 
<320> LL/token: -7.72178
[beta: 0.04233] 
<330> LL/token: -7.71944
[beta: 0.04242] 
<340> LL/token: -7.71451

0	0.05938	bank pct market rate mln stg banks rates money billion dollar interest central dlrs exchange dealers fed week yen currency 
1	0.07919	mln net cts loss dlrs shr profit year qtr revs oper note share avg shrs sales includes corp gain mths 
2	0.06005	pct billion year january february mln rose dlrs rise growth fell december quarter compared deficit sales u.s surplus francs prices 
3	0.03382	gold south union mine spokesman strike copper china oil mining gulf ounces port company tons workers ships government pct shipping 
4	0.05229	shares offer pct dlrs share company stock group mln corp stake common board bid merger shareholders tender securities outstanding exchange 
5	0.07499	dlrs company mln corp unit sale acquisition pct sell american business subsidiary agreement assets 

[beta: 0.04315] 
<550> LL/token: -7.70375
[beta: 0.04318] 
<560> LL/token: -7.7022
[beta: 0.04319] 
<570> LL/token: -7.70261
[beta: 0.04319] 
<580> LL/token: -7.7032
[beta: 0.04333] 
<590> LL/token: -7.70149

0	0.05163	bank pct market rate stg mln rates banks money billion dollar exchange interest dlrs central dealers fed currency yen u.k 
1	0.07073	mln net cts loss dlrs shr profit year qtr revs oper note share avg shrs sales includes quarter corp gain 
2	0.05541	pct billion year mln january february rose dlrs rise growth fell quarter december compared sales earlier prices deficit increase u.s 
3	0.02929	gold union south mine china spokesman strike copper oil mining ounces workers tons company gulf port ships shipping tonnes plant 
4	0.04178	shares offer pct dlrs share company stock group corp stake mln common merger bid board tender shareholders securities outstanding exchange 
5	0.0721	dlrs company mln corp unit sale acquisition pct sell business agreement american subsidiary assets 

[beta: 0.0432] 
<800> LL/token: -7.6929
[beta: 0.04328] 
<810> LL/token: -7.69232
[beta: 0.0433] 
<820> LL/token: -7.69368
[beta: 0.04323] 
<830> LL/token: -7.69207
[beta: 0.04336] 
<840> LL/token: -7.69034

0	0.05116	bank pct market rate mln stg rates banks billion money dollar interest central exchange dealers fed dlrs currency yen week 
1	0.06666	mln cts net loss dlrs shr profit year qtr revs oper note share avg shrs sales includes corp gain mths 
2	0.0651	pct billion year mln dlrs january february rose quarter rise growth fell compared december sales earnings prices increase u.s earlier 
3	0.02812	gold south spokesman union mine strike china copper mining ounces port workers tons company ships gulf shipping ore plant tonnes 
4	0.04357	shares offer pct dlrs share stock company group mln stake corp common merger bid board tender securities shareholders outstanding exchange 
5	0.07168	dlrs mln company corp unit sale acquisition pct sell agreement business subsidiary assets group share

What the command line above did:

- open your reuters_self_stopwords.mallet file
- trains MALLET to find 10 (any number your specify) topics
- store a topic inferencer (inferencer.maller), which can apply a previously trained topic model to new documents.
- output every word in your corpus of materials and the topic it belongs to into a compressed file (topic-state.gz)
- output a text document showing your what the top key words are for each topic (reuters_keys.txt)
- output a text file indicating the breakdown, by percentage, of each topic within each original text file you imported (reuters_composition.txt)
- --optimize-interval [NUMBER] This option turns on hyperparameter optimization, which allows the model to better fit the data by allowing some topics to be more prominent than others. Optimization every 10 iterations is reasonable.

### Infer topic composition of new documents

In [4]:
# import the new documents into MALLET format
# It is almost the same as importing documenets for training, except the --use-pipe-from option, 
# which is used to make sure that the new data is compatible with the training data, 
# i.e. new data and training data have same alphabet mappings.
!/usr/mallet-2.0.8/bin/mallet import-dir --input ./nltk_data/corpora/reuters/test/ --output reuters_test.mallet --keep-sequence --remove-stopwords --use-pipe-from reuters_self_stopwords.mallet

Labels = 
   ./nltk_data/corpora/reuters/test/
 rewriting previous instance list, with ID = 7949ae3b1ac63c47:-1dc9272:16308c8a090:-7ff7


In [5]:
# infer the topic composition of the new documents and stores it in new-topic-composition.txt

!/usr/mallet-2.0.8/bin/mallet infer-topics --input reuters_test.mallet --inferencer inferencer.mallet --output-doc-topics test-topic-composition.txt

# Gensim Wrapper for MALLET

Gensim has a wrapper for LDA from MALLET, "allowing both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents, using an (opimized version of) collapsed gibbs sampling from MALLET.

The following example is created with reference to https://radimrehurek.com/gensim/models/wrappers/ldamallet.html and https://rare-technologies.com/tutorial-on-mallet-in-python/ and http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb

In [6]:
import os
from gensim import corpora, models, utils

In [7]:
def iter_documents(reuters_dir):
    """Iterate over Reuters documents, yielding one document at a time."""
    for fname in os.listdir(reuters_dir):
        # read each document as one big string
        document = open(os.path.join(reuters_dir, fname)).read()
        # parse document into a list of utf-8 tokens
        yield utils.simple_preprocess(document)

In [8]:
class ReutersCorpus(object):
    
    def __init__(self, reuters_dir):
        self.reuters_dir = reuters_dir
        self.dictionary = corpora.Dictionary(iter_documents(reuters_dir))
        self.dictionary.filter_extremes() # remove stopwords etc
        
    def __iter__(self):
        for tokens in iter_documents(self.reuters_dir):
            yield self.dictionary.doc2bow(tokens)

In [9]:
# set up the streamed corpus
corpus = ReutersCorpus('./nltk_data/corpora/reuters/training/')

In [10]:
# train 10 LDA topics using MALLET
mallet_path = "/usr/mallet-2.0.8/bin/mallet"
model = models.wrappers.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
#model = models.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary) --This does not work
# https://radimrehurek.com/gensim/models/wrappers/ldamallet.html

In [11]:
# now use the trained model to infer topics on a new document
doc = "Don't sell coffee, wheat nor sugar; trade gold, oil and gas instead."
bow = corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
print(model[bow]) # print list of (topic_id, topic_weight) pairs

[(0, 0.1016949152542373), (1, 0.1016949152542373), (2, 0.0847457627118644), (3, 0.08851224105461393), (4, 0.11864406779661017), (5, 0.1374764595103578), (6, 0.0903954802259887), (7, 0.0847457627118644), (8, 0.0847457627118644), (9, 0.10734463276836158)]


In [12]:
model.print_topics(num_topics=10, num_words=10) 

[(0,
  '0.039*"trade" + 0.021*"japan" + 0.013*"told" + 0.013*"japanese" + 0.012*"officials" + 0.011*"countries" + 0.010*"united" + 0.010*"states" + 0.009*"foreign" + 0.009*"official"'),
 (1,
  '0.013*"world" + 0.013*"government" + 0.012*"analysts" + 0.011*"tax" + 0.010*"expected" + 0.010*"growth" + 0.009*"economic" + 0.009*"debt" + 0.009*"report" + 0.009*"added"'),
 (2,
  '0.055*"bank" + 0.030*"market" + 0.026*"rate" + 0.023*"dollar" + 0.020*"exchange" + 0.019*"stg" + 0.018*"rates" + 0.016*"banks" + 0.016*"interest" + 0.015*"money"'),
 (3,
  '0.033*"shares" + 0.025*"company" + 0.022*"offer" + 0.020*"corp" + 0.019*"group" + 0.018*"stock" + 0.015*"share" + 0.014*"common" + 0.012*"stake" + 0.011*"acquisition"'),
 (4,
  '0.041*"tonnes" + 0.018*"wheat" + 0.015*"mln" + 0.014*"sugar" + 0.013*"export" + 0.012*"week" + 0.012*"department" + 0.012*"grain" + 0.011*"corn" + 0.011*"agriculture"'),
 (5,
  '0.054*"oil" + 0.014*"gas" + 0.013*"gold" + 0.012*"crude" + 0.012*"prices" + 0.012*"price" + 0.0

In [13]:
#show_topics(self, num_topics=10, num_words=10, log=False, formatted=True)
# |      Print the `num_words` most probable words for `num_topics` number of topics.
# |      Set `num_topics=-1` to print all topics.
model.show_topics(num_topics=-1, num_words=20)

[(0,
  '0.039*"trade" + 0.021*"japan" + 0.013*"told" + 0.013*"japanese" + 0.012*"officials" + 0.011*"countries" + 0.010*"united" + 0.010*"states" + 0.009*"foreign" + 0.009*"official" + 0.009*"agreement" + 0.009*"minister" + 0.008*"ec" + 0.007*"government" + 0.007*"bill" + 0.007*"house" + 0.006*"committee" + 0.006*"industry" + 0.006*"european" + 0.006*"international"'),
 (1,
  '0.013*"world" + 0.013*"government" + 0.012*"analysts" + 0.011*"tax" + 0.010*"expected" + 0.010*"growth" + 0.009*"economic" + 0.009*"debt" + 0.009*"report" + 0.009*"added" + 0.008*"years" + 0.008*"current" + 0.008*"prices" + 0.008*"year" + 0.008*"long" + 0.008*"economy" + 0.008*"higher" + 0.007*"industry" + 0.007*"continue" + 0.007*"lower"'),
 (2,
  '0.055*"bank" + 0.030*"market" + 0.026*"rate" + 0.023*"dollar" + 0.020*"exchange" + 0.019*"stg" + 0.018*"rates" + 0.016*"banks" + 0.016*"interest" + 0.015*"money" + 0.013*"yen" + 0.012*"central" + 0.012*"currency" + 0.011*"foreign" + 0.011*"today" + 0.009*"fed" + 0.009

# pyLDAvis for interactive topic model visualization

pyLDAvis is a python library for interactive topic model visualization.

Install pyLDAvis http://pyldavis.readthedocs.io/en/latest/readme.html#installation

The following example was created with reference to http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb

In [14]:
import pyLDAvis.gensim
import pyLDAvis

In [15]:
#convert the corpus as generator to list
corpus_list = list(corpus)

In [16]:
dictionary = corpus.dictionary
print(dictionary)

Dictionary(7203 unique tokens: ['after', 'again', 'against', 'all', 'almost']...)


In [17]:
#gensim.models.wrappers.ldamallet.malletmodel2ldamodel(mallet_model, gamma_threshold=0.001, iterations=50)
#Convert LdaMallet to LdaModel.
#This works by copying the training model weights (alpha, beta…) from a trained mallet model into the gensim model.
#Parameters:	
#mallet_model (LdaMallet) – Trained Mallet model
#gamma_threshold (float, optional) – To be used for inference in the new LdaModel.
#iterations (int, optional) – Number of iterations to be used for inference in the new LdaModel.
#Returns:	
#Gensim native LDA.
#Return type:	LdaModel
ldamodel = models.wrappers.ldamallet.malletmodel2ldamodel(model, gamma_threshold=0.001, iterations=50)

In [18]:
ldamodel.show_topics()

[(0,
  '0.000*"surrounding" + 0.000*"strengthen" + 0.000*"deliver" + 0.000*"thereby" + 0.000*"was" + 0.000*"albert" + 0.000*"crushers" + 0.000*"numbers" + 0.000*"assay" + 0.000*"amendments"'),
 (1,
  '0.000*"general" + 0.000*"onto" + 0.000*"filed" + 0.000*"deliverable" + 0.000*"stakes" + 0.000*"mentioned" + 0.000*"hkg" + 0.000*"economy" + 0.000*"chances" + 0.000*"miss"'),
 (2,
  '0.000*"wind" + 0.000*"timetable" + 0.000*"divided" + 0.000*"assays" + 0.000*"strike" + 0.000*"comprising" + 0.000*"extends" + 0.000*"ln" + 0.000*"attempting" + 0.000*"exchanges"'),
 (3,
  '0.000*"slowly" + 0.000*"puerto" + 0.000*"settlement" + 0.000*"announce" + 0.000*"strain" + 0.000*"resolving" + 0.000*"adjust" + 0.000*"saw" + 0.000*"drawdown" + 0.000*"french"'),
 (4,
  '0.000*"distorted" + 0.000*"chris" + 0.000*"charles" + 0.000*"harm" + 0.000*"heritage" + 0.000*"marshall" + 0.000*"meetings" + 0.000*"graphics" + 0.000*"served" + 0.000*"checked"'),
 (5,
  '0.000*"formally" + 0.000*"engaged" + 0.000*"hire" + 

In [19]:
vis_data = pyLDAvis.gensim.prepare(ldamodel, corpus_list, dictionary)

In [20]:
pyLDAvis.display(vis_data)