In [11]:
pip install gensim

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [12]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

In [13]:

# Load data
metadata = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', usecols=['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract', 'publish_time'])
metadata.fillna('', inplace=True)

  metadata = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', usecols=['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract', 'publish_time'])


In [14]:
# Preprocess text data
def preprocess_text(text):
    return [token for token in simple_preprocess(text) if len(token) > 3]

data_words = metadata['abstract'].apply(preprocess_text)


The LDA (Latent Dirichlet Allocation) model is a probabilistic generative model, which means that it generates text by sampling from probability distributions. The key formulas for LDA are as follows:

P(w | z, β): Probability of word w given topic z and vocabulary distribution β.
P(z | d, θ): Probability of topic z given document d and topic distribution θ.
P(β): Prior probability of the vocabulary distribution β.
P(θ): Prior probability of the topic distribution θ.
These probabilities are used to compute the joint probability distribution over all the latent variables in the model, which is given by:

P(w, z, θ, β | d) = P(w | z, β) * P(z | d, θ) * P(β) * P(θ)

The goal of inference in LDA is to compute the posterior distribution over the latent variables, given a set of observed documents. This involves computing the conditional probability distribution over the topics and word assignments for each word in each document, given the other words in the document and the parameters of the model. This can be done using techniques such as variational inference or Gibbs sampling.

Overall, LDA provides a flexible and powerful framework for modeling the underlying topics and structure of large collections of text data.

In [15]:

# Create dictionary and corpus for LDA
dictionary = gensim.corpora.Dictionary(data_words)
corpus = [dictionary.doc2bow(doc) for doc in data_words]


In [16]:

# Compute TF-IDF scores
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(metadata['abstract'])



In [17]:
# Train LDA model
num_topics = 12
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=dictionary,
                                       num_topics=num_topics, 
                                       random_state=42,
                                       passes=10,
                                       workers=2)


In [18]:

# Print top topics and their associated words
for i, topic in lda_model.show_topics(num_topics=num_topics, formatted=False):
    print('Topic {}: {}'.format(i, ', '.join([word for word, _ in topic])))



Topic 0: with, were, that, food, this, from, used, high, study, water
Topic 1: with, health, that, covid, their, pandemic, this, study, social, were
Topic 2: that, cells, cell, with, immune, expression, response, infection, this, inflammatory
Topic 3: with, patients, were, cancer, patient, after, treatment, this, case, surgery
Topic 4: were, with, covid, study, from, between, results, during, among, risk
Topic 5: covid, this, that, health, pandemic, have, public, from, disease, been
Topic 6: patients, covid, with, disease, severe, respiratory, clinical, sars, infection, acute
Topic 7: sars, that, with, virus, protein, viral, this, from, viruses, human
Topic 8: model, data, that, this, based, with, using, models, from, used
Topic 9: vaccine, vaccination, influenza, vaccines, sars, against, antibody, dose, infection, antibodies
Topic 10: care, were, with, patients, health, clinical, studies, patient, this, healthcare
Topic 11: sars, were, positive, with, samples, testing, detection, test

In [19]:
# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_words, dictionary=dictionary, coherence='u_mass')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score:', coherence_lda)

Coherence Score: 0.509997433570809
