# Part 5.2 - Topic Modeling
---
### Papers Past Topic Modeling
<br/>

Ben Faulks - bmf43@uclive.ac.nz

Xiandong Cai - xca24@uclive.ac.nz

Yujie Cui - ycu23@uclive.ac.nz

In [1]:
import gc, subprocess
import pandas as pd
pd.set_option('display.max_columns', 120)
pd.set_option('display.max_colwidth', 120)

import datetime
print (datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

2019-02-02 19:54:24


**In this part, we will perform following operations:**

1. using MALLET to train the training set, getting a topic model and topic modeling result files;
1. inferring subsets, getting inferring result files.

**We do not think of the number of topics as a natural characteristic of corpora. The topic number is not really combinations of multinomial distributions, so there is no "right" topic number. We think of the number of topics as the scale of a map of corpora. If we want a broad overview, we use a small topic number. If we want more detail, use a larger topic number. The right number is the value that produces meaningful results that allow us to accomplish our goal.**

**There is a wide range of good values for us, here we will train the dataset to get a topic model with 200 topics.**

**Many metric methods and tools could help us to quantitatively tune the topic number,  such as [Hierarchical Dirichlet process](https://en.wikipedia.org/wiki/Hierarchical_Dirichlet_process), [ldatuning](https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html) and [topic coherence](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/), those evaluate work could be our future work.**

## 0 Latent Dirichlet Allocation

>**LDA**
>
>Latent Dirichlet Allocation, is an unsupervised generative model that assigns topic distributions to documents.
>
>At a high level, the model assumes that each document will contain several topics, so that there is topic overlap within a document. The words in each document contribute to these topics. The topics may not be known a priori, and needn't even be specified, but the **number** of topics must be specified a priori. Finally, there can be words overlap between topics, so several topics may share the same words.
>
>The model generates to **latent** (hidden) variables
>1. A distribution over topics for each document
>2. A distribution over words for each topics
>
>After training, each document will have a discrete distribution over all topics, and each topic will have a discrete distribution over all words.
>
>It is best to demonstrate this with an example. Let's say a document about the presidential elections may have a high contribution from the topics "presidential elections", "america", "voting" but have very low contributions from topics "himalayan mountain range", "video games", "machine learning" (assuming the corpus is varied enough to contain such articles); the topics "presidential elections" may have top contributing words ["vote", "election", "people", "usa", "clinton", "trump", ...] whereas the top contributing words in the topic "himalayan mountain range" may be ["nepal", "everest", "china", "altitude", "river", "snow", ...]. This very rough example should give you an idea of what LDA aims to do.
>
>An important point to note: although I have named some topics in the example above, the model itself does not actually do any "naming" or classifying of topics. But by visually inspecting the top contributing words of a topic i.e. the discrete distribution over words for a topic, one can name the topics if necessary after training. We will show this more later.
>
>There a several ways to implement LDA, however I will speak about collapsed gibbs sampling as I usually find this to be the easiest way to understand it.
>
>The model initialises by assigning every word in every document to a **random** topic. Then, we iterate through each word, unassign it's current topic, decrement the topic count corpus wide and reassign the word to a new topic based on the local probability of topic assignemnts to the current document, and the global (corpus wide) probability of the word assignments to the current topic. This may be hard to understand in words, so the equations are below.
>
>**The mathematics of collapsed gibbs sampling (cut back version)**
>
>Recall that when we iterate through each word in each document, we unassign its current topic assignment and reassign the word to a new topic. The topic we reassign the word to is based on the probabilities below.
>
>$$
>P\left(\text{document "likes" the topic}\right) \times P\left(\text{topic "likes" the word } w'\right)
>$$
>
>$$
>\Rightarrow \frac{n_{i,k}+\alpha}{N_i-1+K\alpha} \times \frac{m_{w',k}+\gamma}{\sum_{w\in V}m_{w,k} + V\gamma}
>$$
>
>where
>
>$n_{i,k}$ - number of word assignments to topic $k$ in document $i$
>
>$n_{i,k}$ - number of assignments to topic $k$ in document $i$
>
>$\alpha$ - smoothing parameter (hyper parameter - make sure probability is never 0)
>
>$N_i$ - number of words in document $i$
>
>$-1$ - don't count the current word you're on
>
>$K$ - total number of topics
>
>
>$m_{w',k}$ - number of assignments, corpus wide, of word $w'$ to topic $k$
>
>$m_{w',k}$ - number of assignments, corpus wide, of word $w'$ to topic $k$
>
>$\gamma$ - smoothing parameter (hyper parameter - make sure probability is never 0)
>
>$\sum_{w\in V}m_{w,k}$ - sum over all words in vocabulary currently assigned to topic $k$
>
>$V$ size of vocabulary i.e. number of distinct words corpus wide
>
>**Notes and Uses of LDA**
>
>LDA has many uses; understanding the different varieties topics in a corpus (obviously), getting a better insight into the type of documents in a corpus (whether they are about news, wikipedia articles, business documents), quantifying the most used / most important words in a corpus, and even document similarity and recommendation.
>
>LDA does not work well with very short documents, like twitter feeds, as explained here [[1]](https://pdfs.semanticscholar.org/f499/5dc2a4eb901594578e3780a6f33dee02dad1.pdf) [[2]](https://stackoverflow.com/questions/29786985/whats-the-disadvantage-of-lda-for-short-texts), which is why we dropped articles under 40 tokens previously. Very briefly, this is because the model infers parameters from observations and if there are not enough observations (words) in a document, the model performs poorly. For short texts, although yet to be rigoursly tested, it may be best to use a [biterm model](https://pdfs.semanticscholar.org/f499/5dc2a4eb901594578e3780a6f33dee02dad1.pdf).
>
>Unlike the word2vec algorithm, which performs extremely well with full structured sentences, LDA is a bag of words model, meaning word order in a document doesnt count. This also means that stopwords and rare words should be excluded, so that the model doesnt overcompensate for very frequent words and very rare words, both of which do not contribute to general topics.
>
>**Hyperparameters**
>
>LDA has 2 hyperparameters: $\alpha$ and $\eta$
>
>$\alpha$ - A low value for $\alpha$ means that documents have only a low number of topics contributing to them. A high value of $\alpha$ yields the inverse, meaning the documents appear more alike within a corpus.
>
>$\eta$ - A low value for $\eta$ means the topics have a low number of contributing words. A high value of $\eta$ yields the inverse, meaning topics will have word overlap and appear more alike.
>
>The values of $\alpha$ and $\eta$ really depend on the application, and may need to be tweaked several times before the desired results are found... even then, LDA is non-deterministic since parameters are randomly initialised, so the outcome of any run of the model can never be known in advance.

## 1 Training Topic model

**Since MALLET can take one instance per file or one file one instance per line, the only choice for us is one file one instance per line, we already prepared the .csv file for training at par5.1.**

**In practice we use notebook to execute bash command only if the sample set is small (under 10% of total), otherwise we execute command line in linux terminal directly and run the script in the background.**

**Check contents:**

In [2]:
pathi = r'../data/dataset/sample/train/train.csv'
patho = r'../models/train/'
print('Dataset size:', subprocess.check_output(['wc','-l', pathi]).split()[0].decode('utf-8'))
pd.read_table(pathi, header=None, nrows=5).head()

Dataset size: 3025602


Unnamed: 0,0,1,2
0,1854213,TO OUR HEADERS.,TO OUR HEADERS.; We have to apologize to our. numerous for the delay which has occurred in ' getting out the first n...
1,1854215,Page 1 Advertisements Column 1,"v-/ .ADVERTISEMENTS. •- I Advertisements will he inserted in the y \Gazette\"" at the nominal rate of Threepence for ..."
2,1854221,ORIGINAL POETRY.,"ORIGINAL POETRY.:- ' FAREWELL T() ENGLAND./ t . Farewell, to happy England ! . *, , For other lands I roam,' /'• To ..."
3,1854224,OUR OWN RIVER-ORUAWHARO !,"OUR OWN RIVER-ORUAWHARO !There was heard a song ou the'chiming sea, A mingled breathing of hopo and glee ; W Voices ..."
4,1854232,Page 1 Advertisements Column 1,"NOTICE.—This Ne?vspaper may b? sent Free by Post (within Seven days of date,) to any part of , • Great Britain, New ..."


In [None]:
%%capture capt
%%time
%%bash $pathi $patho
#! /bin/bash

bash ./model.sh -i $1 -o $2 -p 'train'
#bash ./model.sh -i '../data/dataset/sample/train/train.csv' -o '../models/train/' -p 'train'

In [4]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open(pahto+'log.txt', 'w') as f:
    f.write(capt.stdout)

**The output files are:**

1. `topicKeys.txt`: topics words;
1. `topicKeys.txt`: topics distribution per document;
1. `inferencer.model`: topic inferencer for inferring subset;
1. `stat.gz`corpus that topics belong to;
1. `diagnostics.xml`: statistic info;

## 2 Inferring Subset

**Except analyze and visualize topic model of training dataset, based on typical application scenario, we could extract several subsets from the training dataset to focus on specific point or features. We infer subset by inferencer to get doc-topic matrix to analyze and visualize topics.**

### 2.1 By Range of Time

**Check contents:**

In [2]:
pathi = r'../data/dataset/sample/subset/wwi/wwi.csv'
patho = r'../models/wwi/'
print('Dataset size:', subprocess.check_output(['wc','-l', pathi]).split()[0].decode('utf-8'))
pd.read_table(pathi, header=None, nrows=5).head()

Dataset size: 567878


Unnamed: 0,0,1,2
0,3024444,The New Year.,"The New Year.My Dear People,—-----t Although, the close of another civil year means little to us from a Church point..."
1,3025052,"S. Augustine's, Napier.","S. Augustine's, Napier.Vicar: Rev. Canon Tuke. Curate : Rev. C. L. Wilson. The Dawn of Day. has arrived at last and ..."
2,3026223,Tolago Bay.,Tolago Bay.Vicar: Rev. O. W. Davidson. The great effort to raise a substantial building fund is to take place on the...
3,3026372,P. & O. S.S. Morea.,"P. & O. S.S. Morea.My Dear People, I should like to take the opportunity of thanking the many friends who sent us su..."
4,3026678,BISHOP'S PRIZES.,"BISHOP'S PRIZES.The following have been awarded Bishop's Prizes : — Sunday. Schools.— Senior : None. Junior: 1, Kath..."


**Inferring:**

In [3]:
%%capture capt
%%time
%%bash -s $pathi $patho
#! /bin/bash

bash ./model.sh -i $1 -o $2 -p 'infer'

In [4]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open(patho+'log.txt', 'w') as f:
    f.write(capt.stdout)

### 2.2 By Region

#### 2.2.1 Otago

**Check contents:**

In [5]:
pathi = r'../data/dataset/sample/subset/Otago/Otago.csv'
patho = r'../models/otago/'
print('Dataset size:', subprocess.check_output(['wc','-l', pathi]).split()[0].decode('utf-8'))
pd.read_table(pathi, header=None, nrows=5).head()

Dataset size: 374495


Unnamed: 0,0,1,2
0,1857847,Page 2 Advertisements Column 4,"MISCELLANEOUS. TAMES GOODALL, Bread and Fanqy Biscuit Maker, Has much pleasure in informing the inhabitants of Tokom..."
1,1857876,THE TELEGRAPH.,"THE TELEGRAPH.Jli . lon said that as he understood the chief business of the meeting was over, he would take the opp..."
2,1857878,Original Correspondence.,Original Correspondence.Oub Correspondence Column is at all times open to tie temperate discussion of questions of p...
3,1857993,Page 1 Advertisements Column 3,"jj STALLIONS. j NOTICE TO FARMERS, 'I AND BREEDERS OF DRAUHT HOUSES. | j kffllll^fJll npHE (Imported) PURK |j j&^l^^..."
4,1858007,Markets.,"Markets.WHOLESALE. — Adelaide, £32 per ton ; colonial, none — brown, 60s per cwt; crystal, 60s. Tea — £9 to £11 ; ha..."


**Inferring:**

In [None]:
%%capture capt
%%time
%%bash -s $pathi $patho
#! /bin/bash

bash ./model.sh -i $1 -o $2 -p 'infer'

In [None]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open(patho+'log.txt', 'w') as f:
    f.write(capt.stdout)

#### 2.2.2 Canterbury

**Check contents:**

In [None]:
pathi = r'../data/dataset/sample/subset/Canterbury/Canterbury.csv'
patho = r'../models/canterbury/'
print('Dataset size:', subprocess.check_output(['wc','-l', pathi]).split()[0].decode('utf-8'))
pd.read_table(pathi, header=None, nrows=5).head()

Dataset size: 282791


Unnamed: 0,0,1,2
0,1854361,CRICKET,"CRICKET[rbotbr's tblegrams— oopyright.]MeLBOURyB, January 1. The return mutch between Shaw and LUly white's Eleven a..."
1,1854363,THE HOME CRISIS.,"THE HOME CRISIS.(KETOIK's TBLIORAWB— COPYRIGHT ]London, Deosraber 30.Mr Chamberlain has arrived m London, and is con..."
2,1854367,LONDON MARKETS.,"LONDON MARKETS.Contois remain at par.New Zealand securities — Four per cent. Inscribed Stock are jQi higher, 9S J A ..."
3,1854372,Page 2 Advertisements Column 4,TltY KIRKPATMOK'S NEW SEASON'S JAM. SPECIAL NOTICE. TTURKPATBICK'S NEW SEASON'S J\\. JAM is made from NISLSON G»IoWN...
4,1854375,O.J.O. NEW YEAR M4KTISG.,"0.J.0. NEW YEAR M4KTISG.Chhistchurch, January 1About 300 people were present at tbe OJ.O. Meeting tt-lay. £4100 paer..."


**Inferring:**

In [None]:
%%capture capt
%%time
%%bash -s $pathi $patho
#! /bin/bash

bash ./model.sh -i $1 -o $2 -p 'infer'

In [None]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open(patho+'log.txt', 'w') as f:
    f.write(capt.stdout)

#### 2.2.1 Manawatu-Wanganui

**Check contents:**

In [None]:
pathi = r'../data/dataset/sample/subset/Manawatu-Wanganui/Manawatu-Wanganui.csv'
patho = r'../models/manawatu-wanganui/'
print('Dataset size:', subprocess.check_output(['wc','-l', pathi]).split()[0].decode('utf-8'))
pd.read_table(pathi, header=None, nrows=5).head()

Dataset size: 344669


Unnamed: 0,0,1,2
0,1891145,MARRIAGE.,"b\MARRIAGE.T^mon\\xe2\\x80\\x94 WHfTsi'H'-i- ofa \\xe2\\x80\\xa2 tW ' i 9(h -Se'pi' \\xe2\\x96\\xa0 tetriber, at All..."
1,1891249,PATEA.,"b\PATEA.! :'\\xe2\\x96\\xa0'\\xe2\\x96\\xa0 fir.! Sfept^mbyrl9. The steamer Wakatu, in entering the J> Wver this aft..."
2,1891282,SHOCKING DEATH.,"b'SH OCKING DEAT H .i \\xe2\\x96\\xa0\\\xe2\\x80\\xa2 / \""\\'W*rinWKi 4 AyH.J ! *\\' 1!!1 ;* *\"" ; , Wellinotpn, Sep..."
3,1891315,COMMITTEE MEETING.,"b\COMMITTEE ME ETING.; .j A meeting^ JJieidoif mjttete w*s then \\xe2\\x96\\xa0 (eld, jfttfti 'stpgiail Mv k\\xc2\\x..."
4,1891367,BISMARCHIS HEALTH DECLINING.,"b'BISMARCH IS HEALTH DECLINING., ; ,. | . ; .{-BY..TEI l EattA*H,; \\xc2\\x84... : -1,.; , .... 1 .. : \\xe2\\x80\\x..."


**Inferring:**

In [None]:
%%capture capt
%%time
%%bash -s $pathi $patho
#! /bin/bash

bash ./model.sh -i $1 -o $2 -p 'infer'

In [None]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open(patho+'log.txt', 'w') as f:
    f.write(capt.stdout)

#### 2.2.4 Wellington

**Check contents:**

In [None]:
pathi = r'../data/dataset/sample/subset/Wellington/Wellington.csv'
patho = r'../models/wellington/'
print('Dataset size:', subprocess.check_output(['wc','-l', pathi]).split()[0].decode('utf-8'))
pd.read_table(pathi, header=None, nrows=5).head()

Dataset size: 634731


Unnamed: 0,0,1,2
0,2564904,Page 1 Advertisements Column 7,"WANTED Known— Your teeth extracted without pain by pure gas (perfectly harmless) at the London Dental Company, corne..."
1,2564905,SPORTING.,"SPORTING.♦ THE C J.C. SPRING MEETING. [BY TELEGRAPH — PBESS ASSOCIATION.] Christchttkch, 12tli November. At the thir..."
2,2564921,COMMERCIAL & FINANCIAL.,"COMMERCIAL & FINANCIAL.|_PBBBS ASSOCIATION. J (Keoeived November 14, 9 a.m.) London, 13th November. The following ar..."
3,2564922,MAIL NOTICES,"MAIL NOTICESSubject to necessiiry alterations mails' will close at the Chief Post Office as under :â .Monday, 14th..."
4,2564929,Page 4 Advertisements Column 1,"ITVAVTD ANDERSON & SON, TEA MERCHANTS & FAMILY GROCERS 40, MOLESWORTH-STREET, WELLINGTON. J. & A. WILSON, FUNERAL DI..."


**Inferring:**

In [None]:
%%capture capt
%%time
%%bash -s $pathi $patho
#! /bin/bash

bash ./model.sh -i $1 -o $2 -p 'infer'

In [None]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open(patho+'log.txt', 'w') as f:
    f.write(capt.stdout)

### 2.3 By Label

**Check contents:**

In [None]:
pathi = r'../data/dataset/sample/subset/ads/ads.csv'
patho = r'../models/ads/'
print('Dataset size:', subprocess.check_output(['wc','-l', pathi]).split()[0].decode('utf-8'))
pd.read_table(pathi, header=None, nrows=5).head()

Dataset size: 841233


Unnamed: 0,0,1,2
0,1854215,Page 1 Advertisements Column 1,"v-/ .ADVERTISEMENTS. •- I Advertisements will he inserted in the y \Gazette\"" at the nominal rate of Threepence for ..."
1,1854232,Page 1 Advertisements Column 1,"NOTICE.—This Ne?vspaper may b? sent Free by Post (within Seven days of date,) to any part of , • Great Britain, New ..."
2,1854233,Page 1 Advertisements Column 2,"TVT-OTR PAPER, Bill Paper, Envelopes _LV Memorandum Books, Pens, Ink, &c., on sale at the \ Gazette Office.\"" O-OPE ..."
3,1854245,Page 1 Advertisements Column 1,"NOTICE.—Tim Newspaper may bs sent Free by Post(roithin Seven days of date,) to any part of Great Britain, New Zealan..."
4,1854264,Page 2 Advertisements Column 1,NOTICE is hereby given that in case the following persons neglectto fulfill1 the Conditions on which all allotments ...


**Inferring:**

In [None]:
%%capture capt
%%time
%%bash -s $pathi $patho
#! /bin/bash

bash ./model.sh -i $1 -o $2 -p 'infer'

In [None]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
with open(patho+'log.txt', 'w') as f:
    f.write(capt.stdout)

---

In [None]:
gc.collect()