# Part 5.2 - Topic Modeling
---
### Papers Past Topic Modeling
<br/>

Ben Faulks - bmf43@uclive.ac.nz

Xiandong Cai - xca24@uclive.ac.nz

Yujie Cui - ycu23@uclive.ac.nz

In [1]:
import gc, subprocess
import pandas as pd
pd.set_option('display.max_columns', 120)
pd.set_option('display.max_colwidth', 120)

**In this part, we will perform following operations:**

1. using MALLET to train the training set, getting a topic model and result files;
1. inferring subsets, getting result files.

## 1 Training Topic model

**Since MALLET can take one instance per file or one file one instance per line, the only choice for us is one file one instance per line, we already prepared the .csv file for training at par5.1.**

**Check contents:**

In [2]:
path = r'../data/dataset/sample/train/train.csv'
print('Dataset size:', subprocess.check_output(['wc','-l', path]).split()[0].decode('utf-8'))
pd.read_table(path, header=None, nrows=5).head()

Dataset size: 1814086


Unnamed: 0,0,1,2
0,1854215,Page 1 Advertisements Column 1,"v-/ .ADVERTISEMENTS. •- I Advertisements will he inserted in the y \Gazette\"" at the nominal rate of Threepence for ..."
1,1854221,ORIGINAL POETRY.,"ORIGINAL POETRY.:- ' FAREWELL T() ENGLAND./ t . Farewell, to happy England ! . *, , For other lands I roam,' /'• To ..."
2,1854224,OUR OWN RIVER-ORUAWHARO !,"OUR OWN RIVER-ORUAWHARO !There was heard a song ou the'chiming sea, A mingled breathing of hopo and glee ; W Voices ..."
3,1854233,Page 1 Advertisements Column 2,"TVT-OTR PAPER, Bill Paper, Envelopes _LV Memorandum Books, Pens, Ink, &c., on sale at the \ Gazette Office.\"" O-OPE ..."
4,1854245,Page 1 Advertisements Column 1,"NOTICE.—Tim Newspaper may bs sent Free by Post(roithin Seven days of date,) to any part of Great Britain, New Zealan..."


**We do not think of the number of topics as a natural characteristic of corpora. The topic number is not really combinations of multinomial distributions, so there is no "right" topic number. We think of the number of topics as the scale of a map of corpora. If we want a broad overview, we use a small topic number. If we want more detail, use a larger topic number. The right number is the value that produces meaningful results that allow us to accomplish our goal.**

**There is a wide range of good values for us, here we will train the dataset to get a topic model with 200 topics.**

**Many metric methods and tools could help us to quantitatively tune the topic number,  such as [ldatuning](https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html) and [topic coherence](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/), those evaluate work could be our future work.**

In [3]:
%%time
%%bash
#! /bin/bash

bash ./model.sh -i '../data/dataset/sample/train/train.csv' -o '../models/train/' -p 'train'
#%%capture capt

InputFile=../data/dataset/sample/train/train.csv
OutputDir=../models/train/
Process=train
CORES=12
SEED1=1
SEED2=1
TOPICS=200
ITERATION=2000
INTERVAL=40
BURNIN=300
IDFMIN=0
IDFMAX=8
22:38:57 :: Start import dataset...
Import new data for training.
22:48:43 :: Imported.
22:48:43 :: Start prune model...
22:55:09 :: Pruned.
22:55:09 :: Start training dataset...
08:31:02 :: Trained.


Training portion = 1.0
Validation portion = 0.0
Testing portion = 0.0
Prune info gain = 0
Prune count = 0
Prune df = 0
idf range = 0.0-8.0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139

CPU times: user 648 ms, sys: 368 ms, total: 1.02 s
Wall time: 9h 52min 4s


In [4]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
#with open('../models/train/log.txt', 'w') as f:
#    f.write(capt.stdout)

**The output files are:**

1. `topicKeys.txt`: topics words;
1. `topicKeys.txt`: topics distribution per document;
1. `inferencer.model`: topic inferencer for inferring subset;
1. `stat.gz`corpus that topics belong to;
1. `diagnostics.xml`: statistic info;

## 2 Inferring Subset

**Except analyze and visualize topic model of training dataset, based on typical application scenario, we could extract several subsets from the training dataset to focus on specific point or features. We infer subset by inferencer to get doc-topic matrix to analyze and visualize topics.**

### 2.1 By Range of Time

**Check contents:**

In [5]:
path = r'../data/dataset/sample/subset/wwi/wwi.csv'
print('Dataset size:', subprocess.check_output(['wc','-l', path]).split()[0].decode('utf-8'))
pd.read_table(path, header=None, nrows=5).head()

Dataset size: 340011


Unnamed: 0,0,1,2
0,3024904,Diocesan Paper.,Diocesan Paper.Axchdeacon Ruddock begs to acknowledge receipt of the following amounts for the Waiapu Chtjkch Gazett...
1,3025071,Taradale.,Taradale.Vicar : Rev. A. P. Clarke. Lay Reader: Mr McCuteheon. The Vicar and his family will be away from the Parish...
2,3026239,Te Puke.,"Te Puke.Vicar: Rev. J. Hobbs. To tiie Parishioners— My Dear Friends,— We are of \Te Puke,\"" which I understand , mea..."
3,3026663,CERTIFICATES.,"CERTIFICATES.The following have been awarded Certificates : — Nellie Goldsmith, Kathleen Cox, Stella Trenwith, Lilia..."
4,3027602,Waipukurau.,Waipukurau.Vicar: Rev. F. W. Martin. Curate: Rev. H. GolUer. A meeting- of Parishioners of S. Mary's Church was held...


**Inferring:**

In [6]:
%%time
%%bash -s $path
#! /bin/bash

bash ./model.sh -i $1 -o '../models/wwi/' -p 'infer'
#%%capture capt

InputFile=../data/dataset/sample/subset/wwi/wwi.csv
OutputDir=../models/wwi/
Process=infer
TrainDir=../models/train
Inferencer=../models/train/inferencer.model
CORES=12
SEED1=1
SEED2=1
TOPICS=200
ITERATION=2000
INTERVAL=40
BURNIN=300
IDFMIN=0
IDFMAX=8
08:31:03 :: Start import dataset...
 Rewriting extended pipe from ../models/train/import.model
  Instance ID = 18d98806287d7863:3c05b5f:16884606bae:-7ffa
Import new data for inferring.
08:33:54 :: Imported.
08:33:54 :: Start prune model...
08:34:30 :: Pruned.
08:34:30 :: Start infering dataset...
08:46:09 :: Inferred.


Training portion = 1.0
Validation portion = 0.0
Testing portion = 0.0
Prune info gain = 0
Prune count = 0
Prune df = 0
idf range = 0.0-8.0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139

CPU times: user 16 ms, sys: 24 ms, total: 40 ms
Wall time: 15min 6s


In [7]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
#with open('../models/wwi/log.txt', 'w') as f:
#    f.write(capt.stdout)

### 2.2 By Region

**Check contents:**

In [8]:
path = r'../data/dataset/sample/subset/regions/regions.csv'
print('Dataset size:', subprocess.check_output(['wc','-l', path]).split()[0].decode('utf-8'))
pd.read_table(path, header=None, nrows=5).head()

Dataset size: 891892


Unnamed: 0,0,1,2
0,1854215,Page 1 Advertisements Column 1,"v-/ .ADVERTISEMENTS. •- I Advertisements will he inserted in the y \Gazette\"" at the nominal rate of Threepence for ..."
1,1854221,ORIGINAL POETRY.,"ORIGINAL POETRY.:- ' FAREWELL T() ENGLAND./ t . Farewell, to happy England ! . *, , For other lands I roam,' /'• To ..."
2,1854224,OUR OWN RIVER-ORUAWHARO !,"OUR OWN RIVER-ORUAWHARO !There was heard a song ou the'chiming sea, A mingled breathing of hopo and glee ; W Voices ..."
3,1854233,Page 1 Advertisements Column 2,"TVT-OTR PAPER, Bill Paper, Envelopes _LV Memorandum Books, Pens, Ink, &c., on sale at the \ Gazette Office.\"" O-OPE ..."
4,1854245,Page 1 Advertisements Column 1,"NOTICE.—Tim Newspaper may bs sent Free by Post(roithin Seven days of date,) to any part of Great Britain, New Zealan..."


**Inferring:**

In [9]:
%%time
%%bash -s $path
#! /bin/bash

bash ./model.sh -i $1 -o '../models/regions/' -p 'infer'
#%%capture capt

InputFile=../data/dataset/sample/subset/regions/regions.csv
OutputDir=../models/regions/
Process=infer
TrainDir=../models/train
Inferencer=../models/train/inferencer.model
CORES=12
SEED1=1
SEED2=1
TOPICS=200
ITERATION=2000
INTERVAL=40
BURNIN=300
IDFMIN=0
IDFMAX=8
08:46:11 :: Start import dataset...
 Rewriting extended pipe from ../models/train/import.model
  Instance ID = 18d98806287d7863:3c05b5f:16884606bae:-7ffa
Import new data for inferring.
08:52:50 :: Imported.
08:52:50 :: Start prune model...
08:55:08 :: Pruned.
08:55:08 :: Start infering dataset...
09:35:14 :: Inferred.


Training portion = 1.0
Validation portion = 0.0
Testing portion = 0.0
Prune info gain = 0
Prune count = 0
Prune df = 0
idf range = 0.0-8.0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139

CPU times: user 60 ms, sys: 56 ms, total: 116 ms
Wall time: 49min 3s


In [10]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
#with open('../models/regions/log.txt', 'w') as f:
#    f.write(capt.stdout)

### 2.3 By Label

**Check contents:**

In [11]:
path = r'../data/dataset/sample/subset/ads/ads.csv'
print('Dataset size:', subprocess.check_output(['wc','-l', path]).split()[0].decode('utf-8'))
pd.read_table(path, header=None, nrows=5).head()

Dataset size: 504386


Unnamed: 0,0,1,2
0,1854215,Page 1 Advertisements Column 1,"v-/ .ADVERTISEMENTS. •- I Advertisements will he inserted in the y \Gazette\"" at the nominal rate of Threepence for ..."
1,1854233,Page 1 Advertisements Column 2,"TVT-OTR PAPER, Bill Paper, Envelopes _LV Memorandum Books, Pens, Ink, &c., on sale at the \ Gazette Office.\"" O-OPE ..."
2,1854245,Page 1 Advertisements Column 1,"NOTICE.—Tim Newspaper may bs sent Free by Post(roithin Seven days of date,) to any part of Great Britain, New Zealan..."
3,1854264,Page 2 Advertisements Column 1,NOTICE is hereby given that in case the following persons neglectto fulfill1 the Conditions on which all allotments ...
4,1854289,Page 1 Advertisements Column 1,"NOTICE.—This Newspaper may be sent Free ly Post (within Sevsn days of date,) to any part of Great Britain, New Zeala..."


**Inferring:**

In [12]:
%%time
%%bash -s $path
#! /bin/bash

bash ./model.sh -i $1 -o '../models/ads/' -p 'infer'
#%%capture capt

InputFile=../data/dataset/sample/subset/ads/ads.csv
OutputDir=../models/ads/
Process=infer
TrainDir=../models/train
Inferencer=../models/train/inferencer.model
CORES=12
SEED1=1
SEED2=1
TOPICS=200
ITERATION=2000
INTERVAL=40
BURNIN=300
IDFMIN=0
IDFMAX=8
09:35:16 :: Start import dataset...
 Rewriting extended pipe from ../models/train/import.model
  Instance ID = 18d98806287d7863:3c05b5f:16884606bae:-7ffa
Import new data for inferring.
09:40:18 :: Imported.
09:40:18 :: Start prune model...
09:41:29 :: Pruned.
09:41:29 :: Start infering dataset...
10:10:28 :: Inferred.


Training portion = 1.0
Validation portion = 0.0
Testing portion = 0.0
Prune info gain = 0
Prune count = 0
Prune df = 0
idf range = 0.0-8.0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139

CPU times: user 56 ms, sys: 24 ms, total: 80 ms
Wall time: 35min 11s


In [13]:
# write training log to file. This way to avoid MALLET print very long log in notebook.
#with open('../models/ads/log.txt', 'w') as f:
#    f.write(capt.stdout)

---

In [14]:
gc.collect()

0