In [1]:
import logging
from gensim.models import EnsembleLda, LdaMulticore
from gensim.corpora import OpinosisCorpus
import os

enable the ensemble logger to show what it is doing currently

In [2]:
elda_logger = logging.getLogger(EnsembleLda.__module__)
elda_logger.setLevel(logging.INFO)
elda_logger.addHandler(logging.StreamHandler())

# Experiments on the Opinosis Dataset

Opinosis [1] is a small (but redundant) corpus that contains 289 product reviews for 51 products. Since it's so small, the results are rather unstable.

[1] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han, _Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions,_ Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, pp. 340–348.
http://kavita-ganesan.com/opinosis-opinion-dataset/ https://github.com/kavgan/opinosis

## Preparing the corpus

First, download the opinosis dataset. On linux it can be done like this for example:

In [3]:
!mkdir ~/opinosis
!wget -P ~/opinosis https://github.com/kavgan/opinosis/raw/master/OpinosisDataset1.0_0.zip
!unzip ~/opinosis/OpinosisDataset1.0_0.zip -d ~/opinosis

In [4]:
path = os.path.expanduser('~/opinosis/')

Corpus and id2word mapping can be created using the load_opinosis_data function provided in the package.
It preprocesses the data using the PorterStemmer and stopwords from the nltk package.

The parameter of the function is the relative path to the folder, into which the zip file was extracted before. That folder contains a 'summaries-gold' subfolder.

In [5]:
opinosis = OpinosisCorpus(path)

## Training

**parameters**

**topic_model_kind** ldamulticore is highly recommended for EnsembleLda. ensemble_workers and **distance_workers** are used to improve the time needed to train the models, as well as the **masking_method** 'rank'. ldamulticore is not able to fully utilize all cores on this small corpus, so **ensemble_workers** can be set to 3 to get 95 - 100% cpu usage on my i5 3470.

Since the corpus is so small, a high number of **num_models** is needed to extract stable topics. The Opinosis corpus contains 51 categories, however, some of them are quite similar. For example there are 3 categories about the batteries of portable products. There are also multiple categories about cars. So I chose 20 for num_topics, which is smaller than the number of categories.

The default for **min_samples** would be 64, half of the number of models. But since this does not return any topics, or at most 2, I set this to 32.

In [6]:
elda = EnsembleLda(corpus=opinosis.corpus, id2word=opinosis.id2word, num_models=128, num_topics=20,
                   passes=20, iterations=100, ensemble_workers=3, distance_workers=4,
                   topic_model_kind='ldamulticore', masking_method='rank', min_samples=32)

Generating 128 topic models...
Spawned worker to generate 42 topic models
Spawned worker to generate 43 topic models
Spawned worker to generate 43 topic models
Generating a 2560 x 2560 asymmetric distance matrix...
Spawned worker to generate 640 rows of the asymmetric distance matrix
Spawned worker to generate 640 rows of the asymmetric distance matrix
Spawned worker to generate 640 rows of the asymmetric distance matrix
Spawned worker to generate 640 rows of the asymmetric distance matrix
The given threshold of 0.11 covered on average 9.9% of tokens
The given threshold of 0.11 covered on average 9.8% of tokens
The given threshold of 0.11 covered on average 9.9% of tokens
The given threshold of 0.11 covered on average 9.8% of tokens
Fitting the clustering model
Generating stable topics
Generating classic gensim model representation based on results from the ensemble


In [7]:
# pretty print, note that the words are stemmed so they appear chopped off
for t in elda.print_topics(num_words=7):
    print('-', t[1].replace('*',' ').replace('"','').replace(' +',','), '\n')

- 0.132 staff, 0.082 servic, 0.081 friendli, 0.075 help, 0.021 good, 0.015 quick, 0.010 expens 

- 0.166 screen, 0.063 bright, 0.040 clear, 0.022 easi, 0.020 touch, 0.015 read, 0.014 new 

- 0.146 free, 0.043 park, 0.034 coffe, 0.031 wine, 0.025 even, 0.025 morn, 0.022 internet 

- 0.142 batteri, 0.099 life, 0.026 short, 0.020 charg, 0.018 kindl, 0.018 good, 0.017 perform 

- 0.123 room, 0.111 clean, 0.050 small, 0.035 comfort, 0.033 bathroom, 0.017 nice, 0.016 size 

- 0.147 seat, 0.086 comfort, 0.058 uncomfort, 0.045 firm, 0.031 long, 0.030 drive, 0.014 time 

- 0.161 mileag, 0.106 ga, 0.056 good, 0.028 expect, 0.019 hard, 0.018 road, 0.018 high 

