# Experimenting with LDA

This notebook contains some of the very simple experiments with existing LDA implementations from `gensim` library.

**Author:** Eemeli Saari

**Created:** 10.3.2019

**Edited:** 14.3.2019

---

In [15]:
import sys
import os

In [16]:
import numpy as np
import pandas as pd

In [20]:
sys.path.append('../detector/sHDP')
sys.path.append('../detector/')

Here we're trying to cipher the functionality of the [sHDP](https://github.com/Ardavans/sHDP) Python code

In [22]:
from HDP import models
from corpora import Corpora



In [24]:
data_path = 'M:\Projects\KeyTopicDetection\\parsed'
paths = [os.path.join(data_path, f) for f in filter(lambda x: 'NIPS' in x, os.listdir(data_path))]
corpora = [Corpora(p, keep_tokens=False).load() for p in paths]

In [25]:
N_DOCS = sum([c.n_docs for c in corpora])

In [26]:
N_DOCS

8250

Model takes in some von Mises-Fisher components that need to be initialized.

In [8]:
from core.core_distributions import vonMisesFisherLogNormal

In [12]:
np.random.seed(42)
num_dim = 50
K = 40

d = np.random.rand(num_dim,)
d = d/np.linalg.norm(d)
obs_hypparams = dict(mu_0=d,C_0=1,m_0=2,sigma_0=0.25)
%time components=[vonMisesFisherLogNormal(**obs_hypparams) for itr in range(K)]

None None [0.09998718 0.25380256 0.195413   0.15981779 0.04165071 0.04164428
 0.01550599 0.23123426 0.16047358 0.18902696 0.00549523 0.258927
 0.22222877 0.05668602 0.04853997 0.04896164 0.08122047 0.1400889
 0.11531198 0.07774649 0.16334016 0.03723926 0.07799089 0.09780391
 0.12175238 0.20961047 0.05330489 0.13727995 0.15815091 0.0124004
 0.16219009 0.0455231  0.01736616 0.25331435 0.25778499 0.21580964
 0.08131965 0.02607453 0.18266275 0.11750305 0.03257931 0.13219236
 0.00918035 0.24275205 0.06908387 0.17686686 0.08321435 0.13883729
 0.1459497  0.04934872] 1 2 0.25
None None [0.09998718 0.25380256 0.195413   0.15981779 0.04165071 0.04164428
 0.01550599 0.23123426 0.16047358 0.18902696 0.00549523 0.258927
 0.22222877 0.05668602 0.04853997 0.04896164 0.08122047 0.1400889
 0.11531198 0.07774649 0.16334016 0.03723926 0.07799089 0.09780391
 0.12175238 0.20961047 0.05330489 0.13727995 0.15815091 0.0124004
 0.16219009 0.0455231  0.01736616 0.25331435 0.25778499 0.21580964
 0.08131965 0.026

None None [0.09998718 0.25380256 0.195413   0.15981779 0.04165071 0.04164428
 0.01550599 0.23123426 0.16047358 0.18902696 0.00549523 0.258927
 0.22222877 0.05668602 0.04853997 0.04896164 0.08122047 0.1400889
 0.11531198 0.07774649 0.16334016 0.03723926 0.07799089 0.09780391
 0.12175238 0.20961047 0.05330489 0.13727995 0.15815091 0.0124004
 0.16219009 0.0455231  0.01736616 0.25331435 0.25778499 0.21580964
 0.08131965 0.02607453 0.18266275 0.11750305 0.03257931 0.13219236
 0.00918035 0.24275205 0.06908387 0.17686686 0.08321435 0.13883729
 0.1459497  0.04934872] 1 2 0.25
None None [0.09998718 0.25380256 0.195413   0.15981779 0.04165071 0.04164428
 0.01550599 0.23123426 0.16047358 0.18902696 0.00549523 0.258927
 0.22222877 0.05668602 0.04853997 0.04896164 0.08122047 0.1400889
 0.11531198 0.07774649 0.16334016 0.03723926 0.07799089 0.09780391
 0.12175238 0.20961047 0.05330489 0.13727995 0.15815091 0.0124004
 0.16219009 0.0455231  0.01736616 0.25331435 0.25778499 0.21580964
 0.08131965 0.026

None None [0.09998718 0.25380256 0.195413   0.15981779 0.04165071 0.04164428
 0.01550599 0.23123426 0.16047358 0.18902696 0.00549523 0.258927
 0.22222877 0.05668602 0.04853997 0.04896164 0.08122047 0.1400889
 0.11531198 0.07774649 0.16334016 0.03723926 0.07799089 0.09780391
 0.12175238 0.20961047 0.05330489 0.13727995 0.15815091 0.0124004
 0.16219009 0.0455231  0.01736616 0.25331435 0.25778499 0.21580964
 0.08131965 0.02607453 0.18266275 0.11750305 0.03257931 0.13219236
 0.00918035 0.24275205 0.06908387 0.17686686 0.08321435 0.13883729
 0.1459497  0.04934872] 1 2 0.25
None None [0.09998718 0.25380256 0.195413   0.15981779 0.04165071 0.04164428
 0.01550599 0.23123426 0.16047358 0.18902696 0.00549523 0.258927
 0.22222877 0.05668602 0.04853997 0.04896164 0.08122047 0.1400889
 0.11531198 0.07774649 0.16334016 0.03723926 0.07799089 0.09780391
 0.12175238 0.20961047 0.05330489 0.13727995 0.15815091 0.0124004
 0.16219009 0.0455231  0.01736616 0.25331435 0.25778499 0.21580964
 0.08131965 0.026

Also some parameters are tuned here.

In [27]:
hdp = models.HDP(alpha=1, gamma=2, obs_distns=components, num_docs=N_DOCS)

The data how ever that is fed up to the model is in pickled format which causes two issues:

1. You should not open any 3rd party pickle files on your computer.
2. The environment might cause some issues.

But since I'll need to find out what the data looks like:

In [30]:
import pickle

In [33]:
with open('../detector/sHDP/data/nips/texts.pk', 'rb') as f:
    texts = pickle.load(f)

UnpicklingError: the STRING opcode argument must be quoted

And it seems that the problem is the 2nd one meaning that the pickle was made in Ubuntu.

> Using VM Ubuntu 18.04 I was able to open the content.

The texts.pk and wordvec.pk content is easily understandable BOW format of the given text document and word vector dictionary.