<div align='left' style="width:29%;overflow:hidden;">
<a href='http://inria.fr'>
<img src='https://github.com/lmarti/jupyter_custom/raw/master/imgs/inr_logo_rouge.png' alt='Inria logo' title='Inria logo'/>
</a>
</div>

# lda2vec

lda2vec is an extension of word2vec and LDA that **jointly learns word, document and topic vectors**.

lda2vec builds on top of the **skip-gram model of word2vec** to generate word vectors.

With lda2vec, instead of using the word vector directly to **predict context words**, we leverage a **context vector** to make the predictions. This context vector is created as the sum of two other vectors: the **word vector** (generated by the skip-gram model) and the **document vector**.

The **document vector is a weighted combination** of two components. A **document weight vector**, representing the weights of each topic in the document. And the **topic matrix**, representing each topic and its corresponding vector embedding.

The power of lda2vec lies in the fact that simultaneously learns words embeddings, document representations and topic representations.

![lda2vec](https://github.com/cemoody/lda2vec/raw/master/lda2vec_network_publish_text.gif)

In this notebook, we'll attempt to train an lda2vec model on the CORD-19 dataset.
First, we'll install the required libraries.

In [None]:
!pip install -q -r requirements.txt
!pip install pyLDAvis tensorflow chainer keras
!pip install --no-cache-dir git+https://github.com/cemoody/lda2vec.git@master#egg=lda2vec

In the following cells we'll define the lda2vec model based on the examples in the library repository.
A training function is also defined.

In [None]:
# Ref.: https://github.com/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec_model.py
# Ref.: https://github.com/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec_run.py

# Author: Chris Moody <chrisemoody@gmail.com>
# License: MIT

import os
import os.path
import pickle
import time
import shelve

import chainer
import chainer.links as L
import chainer.functions as F
import chainer.optimizers as O
from chainer import cuda
from chainer import serializers
from chainer import Chain
import numpy as np

from lda2vec import utils
from lda2vec.utils import move
from lda2vec import EmbedMixture, prepare_topics, print_top_words_per_topic, topic_coherence, dirichlet_likelihood


class LDA2Vec(Chain):
    def __init__(self, n_documents=100, n_document_topics=10,
                 n_units=256, n_vocab=1000, dropout_ratio=0.5, train=True,
                 counts=None, n_samples=15, word_dropout_ratio=0.0,
                 power=0.75, temperature=1.0):
        em = EmbedMixture(n_documents, n_document_topics, n_units,
                          dropout_ratio=dropout_ratio, temperature=temperature)
        kwargs = {}
        kwargs['mixture'] = em
        kwargs['sampler'] = L.NegativeSampling(n_units, counts, n_samples,
                                               power=power)
        super(LDA2Vec, self).__init__(**kwargs)
        rand = np.random.random(self.sampler.W.data.shape)
        self.sampler.W.data[:, :] = rand[:, :]
        self.n_units = n_units
        self.train = train
        self.dropout_ratio = dropout_ratio
        self.word_dropout_ratio = word_dropout_ratio
        self.n_samples = n_samples

    def prior(self):
        dl1 = dirichlet_likelihood(self.mixture.weights)
        return dl1

    def fit_partial(self, rdoc_ids, rword_indices, window=5,
                    update_only_docs=False):
        doc_ids, word_indices = move(self.xp, rdoc_ids, rword_indices)
        pivot_idx = next(move(self.xp, rword_indices[window: -window]))
        pivot = F.embed_id(pivot_idx, self.sampler.W)
        if update_only_docs:
            pivot.unchain_backward()
        doc_at_pivot = rdoc_ids[window: -window]
        doc = self.mixture(next(move(self.xp, doc_at_pivot)),
                           update_only_docs=update_only_docs)
        loss = 0.0
        start, end = window, rword_indices.shape[0] - window
        context = (F.dropout(doc, self.dropout_ratio) +
                   F.dropout(pivot, self.dropout_ratio))
        for frame in range(-window, window + 1):
            # Skip predicting the current pivot
            if frame == 0:
                continue
            # Predict word given context and pivot word
            # The target starts before the pivot
            targetidx = rword_indices[start + frame: end + frame]
            doc_at_target = rdoc_ids[start + frame: end + frame]
            doc_is_same = doc_at_target == doc_at_pivot
            rand = np.random.uniform(0, 1, doc_is_same.shape[0])
            mask = (rand > self.word_dropout_ratio).astype('bool')
            weight = np.logical_and(doc_is_same, mask).astype('int32')
            # If weight is 1.0 then targetidx
            # If weight is 0.0 then -1
            targetidx = targetidx * weight + -1 * (1 - weight)
            target, = move(self.xp, targetidx)
            loss = self.sampler(context, target)
            loss.backward()
            if update_only_docs:
                # Wipe out any gradient accumulation on word vectors
                self.sampler.W.grad *= 0.0
        return loss.data
    
    
def lda2vec_run():
    gpu_id = int(os.getenv('CUDA_GPU', 0))
    cuda.get_device(gpu_id).use()
    print("Using GPU " + str(gpu_id))

    data_dir = os.getenv('data_dir', '../data/')
    fn_vocab = '{data_dir:s}/vocab.pkl'.format(data_dir=data_dir)
    fn_corpus = '{data_dir:s}/corpus.pkl'.format(data_dir=data_dir)
    fn_flatnd = '{data_dir:s}/flattened.npy'.format(data_dir=data_dir)
    fn_docids = '{data_dir:s}/doc_ids.npy'.format(data_dir=data_dir)
    fn_vectors = '{data_dir:s}/vectors.npy'.format(data_dir=data_dir)
    vocab = pickle.load(open(fn_vocab, 'r'))
    corpus = pickle.load(open(fn_corpus, 'r'))
    flattened = np.load(fn_flatnd)
    doc_ids = np.load(fn_docids)
    vectors = np.load(fn_vectors)

    # Model Parameters
    # Number of documents
    n_docs = doc_ids.max() + 1
    # Number of unique words in the vocabulary
    n_vocab = flattened.max() + 1
    # 'Strength' of the dircihlet prior; 200.0 seems to work well
    clambda = 200.0
    # Number of topics to fit
    n_topics = int(os.getenv('n_topics', 20))
    batchsize = 4096
    # Power for neg sampling
    power = float(os.getenv('power', 0.75))
    # Intialize with pretrained word vectors
    pretrained = bool(int(os.getenv('pretrained', True)))
    # Sampling temperature
    temperature = float(os.getenv('temperature', 1.0))
    # Number of dimensions in a single word vector
    n_units = int(os.getenv('n_units', 300))
    # Get the string representation for every compact key
    words = corpus.word_list(vocab)[:n_vocab]
    # How many tokens are in each document
    doc_idx, lengths = np.unique(doc_ids, return_counts=True)
    doc_lengths = np.zeros(doc_ids.max() + 1, dtype='int32')
    doc_lengths[doc_idx] = lengths
    # Count all token frequencies
    tok_idx, freq = np.unique(flattened, return_counts=True)
    term_frequency = np.zeros(n_vocab, dtype='int32')
    term_frequency[tok_idx] = freq

    for key in sorted(locals().keys()):
        val = locals()[key]
        if len(str(val)) < 100 and '<' not in str(val):
            print(key, val)

    model = LDA2Vec(n_documents=n_docs, n_document_topics=n_topics,
                    n_units=n_units, n_vocab=n_vocab, counts=term_frequency,
                    n_samples=15, power=power, temperature=temperature)
    if os.path.exists('lda2vec.hdf5'):
        print("Reloading from saved")
        serializers.load_hdf5("lda2vec.hdf5", model)
    if pretrained:
        model.sampler.W.data[:, :] = vectors[:n_vocab, :]
    model.to_gpu()
    optimizer = O.Adam()
    optimizer.setup(model)
    clip = chainer.optimizer.GradientClipping(5.0)
    optimizer.add_hook(clip)

    j = 0
    epoch = 0
    fraction = batchsize * 1.0 / flattened.shape[0]
    progress = shelve.open('progress.shelve')
    for epoch in range(200):
        data = prepare_topics(cuda.to_cpu(model.mixture.weights.W.data).copy(),
                              cuda.to_cpu(model.mixture.factors.W.data).copy(),
                              cuda.to_cpu(model.sampler.W.data).copy(),
                              words)
        top_words = print_top_words_per_topic(data)
        if j % 100 == 0 and j > 100:
            coherence = topic_coherence(top_words)
            for j in range(n_topics):
                print(j, coherence[(j, 'cv')])
            kw = dict(top_words=top_words, coherence=coherence, epoch=epoch)
            progress[str(epoch)] = pickle.dumps(kw)
        data['doc_lengths'] = doc_lengths
        data['term_frequency'] = term_frequency
        np.savez('topics.pyldavis', **data)
        for d, f in utils.chunks(batchsize, doc_ids, flattened):
            t0 = time.time()
            optimizer.zero_grads()
            l = model.fit_partial(d.copy(), f.copy())
            prior = model.prior()
            loss = prior * fraction
            loss.backward()
            optimizer.update()
            msg = ("J:{j:05d} E:{epoch:05d} L:{loss:1.3e} "
                   "P:{prior:1.3e} R:{rate:1.3e}")
            prior.to_cpu()
            loss.to_cpu()
            t1 = time.time()
            dt = t1 - t0
            rate = batchsize / dt
            logs = dict(loss=float(l), epoch=epoch, j=j,
                        prior=float(prior.data), rate=rate)
            print(msg.format(**logs))
            j += 1
        serializers.save_hdf5("lda2vec.hdf5", model)


In [None]:
%load_ext autoreload
%autoreload 2

import lda2vec
dir(lda2vec)

The error above is caused because the library is written for Python 2.
We've got to figure out some way to reuse it in a Python 3 kernel.