# Analyzing the COVID-19 corpus with LDA and PCoA 

In this notebook, we perform Latent Dirichlet Allocation to infer high-level topics and visualize them <br>
to get some understanding of the general structure of the CORD19 corpus. <br>
Following the principal coordinate analysis based on the Jensen-Shannon divergence between topics, <br>
it seems that the two main axes along which the topics are organized are:
- PC1: To which extent the topic focuses on the propagation of the virus vs. the virus itself;
- PC2: To which extent the topic is related to some individuals.

## Reading the data

In [None]:
import numpy as np
import pandas as pd
import json
import os

datafiles = []
for dirname, _, filenames in os.walk("/kaggle/input/CORD-19-research-challenge/"):
    for filename in filenames:
        ifile = os.path.join(dirname, filename)
        if ifile.split(".")[-1] == "json":
            datafiles.append(ifile)
doc_ids = []
titles = []
abstracts = []
bodytexts = []
id2title = []
for file in datafiles:
    with open(file,'r')as f:
        doc = json.load(f)
    doc_ids.append(doc['paper_id']) 
    titles.append(doc['metadata']['title'])
    abstract = ''
    for item in doc['abstract']:
        abstract = abstract + item['text']
    abstracts.append(abstract)
    bodytext = ''
    for item in doc['body_text']:
        bodytext = bodytext + item['text']
    bodytexts.append(bodytext)
texts = np.array([t + ' ' + a for t, a in zip(titles, abstracts)], dtype='object')

# Preprocessing the text

We merge 2-word phrases based on the rough approximation of the pointwise mutual information proposed by Mikolov <br>
et al. in "Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality".

In [None]:
from nltk.tokenize import word_tokenize
from gensim.sklearn_api.phrases import PhrasesTransformer

bigrams = PhrasesTransformer(min_count=20, threshold=100)
processed_texts = bigrams.fit_transform([word_tokenize(text) for text in texts])

# Fitting LDA

We fit LDA with k=8 topics, with alpha=1/k and beta=0.1, using the batch implementation in scikit learn.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

vectorizer = CountVectorizer(stop_words="english", lowercase=True, max_df=0.1, min_df=20)
dtm = vectorizer.fit_transform([" ".join(text) for text in processed_texts])
k = 8
lda = LatentDirichletAllocation(n_components=k, 
                                learning_method="batch", 
                                doc_topic_prior=1/k,
                                topic_word_prior=0.1,
                                n_jobs=-1,
                                random_state=0)
lda.fit(dtm)

# Visualizing the topics

We compute a 2D map of the topics by multidimensional scaling, based on the Jensen-Shannon divergence between the <br>
per topic distributions over words, using the implementation in pyLDAVis.

- PC1: To which extent the topic focuses on the propagation of the virus vs. the virus itself;
- PC2: To which extent the topic is related to individuals.

In [None]:
from pyLDAvis import enable_notebook, display, sklearn

enable_notebook()
topic_model_info = sklearn.prepare(lda, dtm, vectorizer, mds='PCoA')
display(topic_model_info)