# Hierarchichal Topic Modelling

In this notebook we're going to expand our previous topic modelling approaches in order to model hierarchic topics.
Even though a single level topic modelling is helpful to go over the vast amount of papers in the CORD-19 dataset, our hypothesis is that a hierarchical topic modelling will provide a much more easier way to sway through the papers.

The following references have caught our attention:

- https://radimrehurek.com/gensim/models/hdpmodel.html: an implementation of Hierarchical Dirichlet Processes (HDP) using the topic modelling library `gensim`.
- https://github.com/joewandy/hlda: an implementation of Hierarchical Latent Dirichlet Allocation (hLDA) in Python.
- https://datascience.stackexchange.com/questions/128/latent-dirichlet-allocation-vs-hierarchical-dirichlet-process: a comparison between LDA and HDP.
- https://developer.squareup.com/blog/inferring-label-hierarchies-with-hlda/: a write up about Square's experience using hLDA to hierarchically classify customer support articles.
- https://www.hindawi.com/journals/sp/2017/4382348/: a journal article.

LDA models documents as Dirichlet mixtures of a fized number of topics, which are modelled as Dirichlet mixtures of words.
hLDA is an adaptation of LDA that models topics as a mixture of a new, distinct level of topics.
HDP main difference with respecto to LDA is that the number of topics isn't an hyperparamenter, but is discarded because it doesn't build a hierarchical topic structure.

We'll first try a hierarchical topic modelling using hLDA and then we'll compare it to a manual LDA hierarchical modelling.

## Hierarchical Latent Dirichlet Allocation (hLDA)

This technique was presented in the 2004 NeurIPS paper "Hierarchical Topic Models and the Nested Chinese Restaurant Process" by David Blei et al. available at: https://papers.nips.cc/paper/2466-hierarchical-topic-models-and-the-nested-chinese-restaurant-process.pdf.

A quick Google search yields at least two implementations:

- https://github.com/blei-lab/hlda: implemented in C by the original authors. Last commit was in 2014.
- https://github.com/joewandy/hlda: implemented in Python. Last commit was in 2017.

We'll use the second one since it publishes a Jupyter Notebook with an example using the library.
First, we'll install it.

In [1]:
!pip install hlda

Collecting hlda
  Downloading https://files.pythonhosted.org/packages/a8/08/6287a6e93906b14d33ea3da2dd099d7a8d2f70ca270ca6fa5c0595b52919/hlda-0.2.tar.gz
Installing collected packages: hlda
  Running setup.py install for hlda ... [?25ldone
[?25hSuccessfully installed hlda-0.2
[33mYou are using pip version 19.0.3, however version 20.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
from risotto.references 

CORD19_DATASET_FOLDER = "./datasets/CORD-19-research-challenge"