<a href="https://colab.research.google.com/github/NikolasGialitsis/LDA2vec/blob/master/lda2vec_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup and Preprocessing

Install Dependencies



In [0]:
#!git clone https://github.com/maxent-ai/lda2vec
!python -m spacy download en
!pip install spacy
!pip install jellyfish
!pip install -r /content/lda2vec/requirements.txt 
%cd /content/lda2vec/
!pip install pylda2vec

Import Libraries

In [0]:
import os
import os.path
import pickle
import time
import shelve

import chainer
from chainer import cuda
from chainer import serializers
import chainer.optimizers as O
import numpy as np

from lda2vec import utils
from lda2vec import prepare_topics, print_top_words_per_topic, topic_coherence
from lda2vec import LDA2Vec
from lda2vec import preprocess, Corpus
#changed Preprocess.py line 'nlp = spacy.load("en_core_web_sm")' to solve error en not found
#and removed the import ... as en line 

import logging
logging.basicConfig()
import pickle
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from lda2vec import preprocess, Corpus

**CUDA** usage\
 *change Notebook Settings 'Hardware Acceleration' to 'GPU' to run*

In [0]:
gpu_id = int(os.getenv('CUDA_GPU', 0))
cuda.get_device(gpu_id).use()
print("Using GPU:" + str(gpu_id))

Using GPU:0


**Fetch Data**\
Set Stopwords\
>>**TODO**: replace texts with covid19 collection

In [0]:

# Fetch data
remove = ('headers', 'footers', 'quotes')
'''TODO'''
texts = fetch_20newsgroups(subset='train', remove=remove).data
# Remove tokens with these substrings
bad = set(["ax>", '`@("', '---', '===', '^^^'])


def clean(line):
    return ' '.join(w for w in line.split() if not any(t in w for t in bad))
    

Preprocessing


*   set max word num per document
*   remove stopwords (aka. words in 'bad' list above)



In [0]:
# Preprocess data
max_length = 10000   # Limit of 10k words per document
# Convert to unicode (spaCy only works with unicode)
texts = [str(clean(d)) for d in texts if len(str(clean(d))) > 0]


Calculate tokens (list) and vocabulary (dictionary)\
OR\
Load them if already calculated

In [0]:
tokens, vocab = preprocess.tokenize(texts, max_length, merge=False,n_threads=4)
#tokens = np.load("tokens.npy")
#vocab = np.load("vocab.npy")

run next line only if loaded

In [0]:
vocab = vocab.tolist()

Bag-of-Words and Normalization
* remove really rare words
* downsample frequent words
* remove word gaps for words not in the vocabulary 
* flatten

In [0]:
corpus = Corpus()
# Make a ranked list of rare vs frequent words
corpus.update_word_count(tokens)
corpus.finalize()
# The tokenization uses spaCy indices, and so may have gaps
# between indices for words that aren't present in our dataset.
# This builds a new compact index
compact = corpus.to_compact(tokens)
# Remove extremely rare words
pruned = corpus.filter_count(compact, min_count=30)
# Convert the compactified arrays into bag of words arrays
bow = corpus.compact_to_bow(pruned)
# Words tend to have power law frequency, so selectively
# downsample the most prevalent words
clean = corpus.subsample_frequent(pruned)
# Now flatten a 2D array of document per row and word position
# per column to a 1D array of words. This will also remove skips
# and OoV words
doc_ids = np.arange(pruned.shape[0])
flattened, (doc_ids,) = corpus.compact_to_flat(pruned, doc_ids)

Pre-trained Word Vectors\
**TODO**: add pre-trained words when given

In [0]:
assert flattened.min() >= 0
# Fill in the pretrained word vectors
n_dim = 300
fn_wordvc = '../../../../Downloads/vectors/GoogleNews-vectors-negative300.bin'
vectors, s, f = corpus.compact_word_vectors(vocab, filename=fn_wordvc)

Save all files (uncomment for vectors if pre-trained included)

In [0]:
# Save all of the preprocessed files
pickle.dump(vocab, open('vocab.pkl', 'wb'))
pickle.dump(corpus, open('corpus.pkl', 'wb'))
np.save("flattened", flattened)
np.save("doc_ids", doc_ids)
np.save("pruned", pruned)
np.save("bow", bow)
#np.save("vectors", vectors)

In [0]:
print(tokens.shape)
print(vocab)

In [0]:
#data_dir = os.getenv('data_dir', '../data/')
fn_vocab = 'vocab.pkl'
fn_corpus = 'corpus.pkl'
fn_flatnd = 'flattened.npy'
fn_docids = 'doc_ids.npy'
#fn_vectors = 'vectors.npy'
vocab = pickle.load(open(fn_vocab, 'rb'))
corpus = pickle.load(open(fn_corpus, 'rb'))
flattened = np.load(fn_flatnd)
doc_ids = np.load(fn_docids)
#vectors = np.load(fn_vectors)

##Defining Model Parameters



>> Parameters 
1.   Number of documents
2.   Number of unique words
3.   Dirichlet Prior 
4.   Number of topics
5.   Batchsize
6.   power for negative sampling 
7.   flag to include pre-trained or not
8.   sampling temperature
9.   number of dimensions of word vectors
10.  number of tokens in each document
11.  token frequencies



In [0]:
# Model Parameters
# Number of documents
n_docs = doc_ids.max() + 1
# Number of unique words in the vocabulary
n_vocab = flattened.max() + 1
# 'Strength' of the dircihlet prior; 200.0 seems to work well
clambda = 200.0
# Number of topics to fit
n_topics = int(os.getenv('n_topics', 20))
batchsize = 4096
# Power for neg sampling
power = float(os.getenv('power', 0.75))
# Intialize with pretrained word vectors
pretrained = bool(int(os.getenv('pretrained', True)))
# Sampling temperature
temperature = float(os.getenv('temperature', 1.0))
# Number of dimensions in a single word vector
n_units = int(os.getenv('n_units', 300))
# Get the string representation for every compact key
words = corpus.word_list(vocab)[:n_vocab]
# How many tokens are in each document
doc_idx, lengths = np.unique(doc_ids, return_counts=True)
doc_lengths = np.zeros(doc_ids.max() + 1, dtype='int32')
doc_lengths[doc_idx] = lengths
# Count all token frequencies
tok_idx, freq = np.unique(flattened, return_counts=True)
term_frequency = np.zeros(n_vocab, dtype='int32')
term_frequency[tok_idx] = freq

In [0]:
for key in sorted(locals().keys()):
    val = locals()[key]
    if len(str(val)) < 100 and '<' not in str(val):
        print(key, val)

# Training the LDA2vec Model

In [0]:
model = LDA2Vec(n_documents=n_docs, n_document_topics=n_topics,
                n_units=n_units, n_vocab=n_vocab, counts=term_frequency,
                n_samples=15, power=power, temperature=temperature)

Add pretrained data

In [0]:
if os.path.exists('lda2vec.hdf5'):
    print("Reloading from saved")
    serializers.load_hdf5("lda2vec.hdf5", model)

if pretrained:
    model.sampler.W.data[:, :] = vectors[:n_vocab, :]

Adam Optimizer

In [0]:
model.to_gpu()
optimizer = O.Adam()
optimizer.setup(model)
clip = chainer.optimizer.GradientClipping(5.0)
optimizer.add_hook(clip)

In [0]:
j = 0
epoch = 0
fraction = batchsize * 1.0 / flattened.shape[0]
progress = shelve.open('progress.shelve')

Training with cuda


*   Measure coherence between topics every 100 iterations
*   this is for only one epoch
*   save the model
*   backpropagation style 



In [0]:
n_epochs = 1
for epoch in range(1):
    data = prepare_topics(cuda.to_cpu(model.mixture.weights.W.data).copy(),
                          cuda.to_cpu(model.mixture.factors.W.data).copy(),
                          cuda.to_cpu(model.sampler.W.data).copy(),
                          words)
    top_words = print_top_words_per_topic(data)
    if j % 100 == 0 and j > 100:
        coherence = topic_coherence(top_words)
        for j in range(n_topics):
            print(j, coherence[(j, 'cv')])
        kw = dict(top_words=top_words, coherence=coherence, epoch=epoch)
        progress[str(epoch)] = pickle.dumps(kw)
    data['doc_lengths'] = doc_lengths
    data['term_frequency'] = term_frequency
    np.savez('topics.pyldavis', **data)
    print(epoch)
    for d, f in utils.chunks(batchsize, doc_ids, flattened):
        t0 = time.time()
        model.cleargrads()
        #optimizer.use_cleargrads(use=False)
        l = model.fit_partial(d.copy(), f.copy())
        print("after partial fitting:", l)
        prior = model.prior()
        loss = prior * fraction
        loss.backward()
        optimizer.update()
        msg = ("J:{j:05d} E:{epoch:05d} L:{loss:1.3e} "
               "P:{prior:1.3e} R:{rate:1.3e}")
        prior.to_cpu()
        loss.to_cpu()
        t1 = time.time()
        dt = t1 - t0
        rate = batchsize / dt
        logs = dict(loss=float(l), epoch=epoch, j=j,
                    prior=float(prior.data), rate=rate)
        print(msg.format(**logs))
        j += 1
    serializers.save_hdf5("lda2vec.hdf5", model)

In [0]:
#!zip -r myzip.zip /content/lda2vec
#from google.colab import files
#files.download('myzip.zip')

In [0]:
npz = np.load(open('/content/lda2vec/topics.pyldavis.npz', 'rb'))
dat = {k: v for (k, v) in npz.items()}
dat['vocab'] = dat['vocab'].tolist()

Print top 10 words for each topic

In [52]:
top_n = 10
topic_to_topwords = {}
for j, topic_to_word in enumerate(dat['topic_term_dists']):
    top = np.argsort(topic_to_word)[::-1][:top_n]
    msg = 'Topic %i '  % j
    top_words = [dat['vocab'][i].strip()[:35] for i in top]
    msg += ' '.join(top_words)
    print(msg)
    topic_to_topwords[j] = top_words

Topic 0 5.25 105 originally exercise 040 chance approved curve tone loud
Topic 1 corporation tissue b toplevel equipped owners continues blood died even
Topic 2 garbage properties glad probes similarly * body horn opposite galaxy
Topic 3 k name tough procedures masters paper his : x/ leap
Topic 4 127 vs. ct significant priest buildings dc affect occurred adjust
Topic 5 check alone batf adl deeds load ammunition movement conflicts accused
Topic 6 cost specified layer peter r5 edmonton consists preserve particularly timer
Topic 7 quite am alcohol yours relation dog uh 05 cartridge performed
Topic 8 90 ") maple licensed /./lib custom interpret vertical among 25.00
Topic 9 voted converter constitution involving ads accurate power leaders teacher loving
Topic 10 odd floor islanders apologize male cooperation evaluation immediately freedom personal
Topic 11 80-bit export 75 wolverine accelerator repair strategy charged a trading
Topic 12 conclusions protocol object process hence pointer writ

## Visualize Topics 
>> lower λ to get rare words appearing only on selected topics

In [0]:
#!pip install pyLDAvis
import pyLDAvis
pyLDAvis.enable_notebook()
#warnings.filterwarnings('ignore')
prepared_data = pyLDAvis.prepare(dat['topic_term_dists'], dat['doc_topic_dists'], 
                                 dat['doc_lengths'] * 1.0, dat['vocab'], dat['term_frequency'] * 1.0, mds='tsne')

In [58]:
pyLDAvis.display(prepared_data)