# Topic Model apply in  COVID-19 DataSet

**Developer: Ivair Puerari**


Topic modeling is a machine learning problem, which aims to extract, given a collection of documents, the main topics that represent the subjects covered by a text collection.  [Blei,2012]

*  DataSet: COVID-19 Open Research Dataset Challenge (CORD-19).
*  Topic Model: Latent Dirichlet allocation (LDA);



In [0]:
# Load the Drive helper and mount
from google.colab import drive

#This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import numpy as np
import sys
import nltk
import string

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer


nltk.download('wordnet')
lemma = WordNetLemmatizer()

stemmer = nltk.stem.PorterStemmer()

exclude_punctuation = set(string.punctuation)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



### Pre-Processing

In LDA, we can be represent Documents as bag-of-words.

To this, we have pre-processing documents removing pontuacions, numbers and plurals.


In [0]:
def clean(doc):
    removed_punctuation = ''.join( ch for w in doc for ch in w if ch not in exclude_punctuation)

    removed_digits = " ".join(([(''.join([ch for ch in w if not ch.isdigit()])) for w in removed_punctuation.split() if not w.isdigit() ]))
    
    remove_words = " ".join([w for w in removed_digits.split() if len(w) > 3 ])

    remove_stops = " ".join([w for w in remove_words.lower().split() if w not in stop])
    
    normalized = " ".join(lemma.lemmatize(word) for word in remove_stops.split())
    
    stemed = " ".join(stemmer.stem(word) for word in normalized.split()) 

    result = " ".join([w for w in stemed.lower().split() if w not in stop])

    return result

**StopWords**: are a set of commonly used words in any language or in the subject.

In [0]:
FileWithStopWords = '/content/drive/My Drive/Colab Notebooks/StopWords.txt'

In [0]:
nltk.download('stopwords')
stop = set(stopwords.words('english'))

with open(FileWithStopWords,'r',  encoding='UTF-8') as fr:
  for line in fr:
    stop.add(line.replace('\n', ''))
stop

In [0]:
fileToRead =  '/content/drive/My Drive/Colab Notebooks/covid_clean_zzz_1.txt'
fileToWrite = '/content/drive/My Drive/Colab Notebooks/covid_clean_zzz_2.txt'

In [0]:
with open(fileToWrite,'w', encoding="utf-8") as fw:
    with open(fileToRead,'r',  encoding="utf-8") as fr:
      [fw.write(clean(line) +'\n') for line in fr]
    
    fr.close()
fw.close()

In [0]:
fileFinal =  '/content/drive/My Drive/Colab Notebooks/covid_clean_zzz_2.txt'

In [0]:
words = []

with open(fileFinal, 'r', encoding='UTF-8') as fr:
     [words.append(line.replace('\n', '').split()) for line in fr]   
fr.close()   

In [0]:
words[0]

### Build the Topic Model

For build the model was use library Gensim Models

In [0]:
import gensim

from gensim import corpora

from gensim.models import CoherenceModel

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

! pip install pyLDAvis
import pyLDAvis.gensim

*   **Create Dictionary**

In [0]:
noBelow = round((len(words) * 0.20))

dictionaryDocs = corpora.Dictionary(words)
dictionaryDocs.filter_extremes(no_below=noBelow, no_above=0.8, keep_n=10000)

*   **Create Corpus: Term Document Frequency**

In [0]:
doc_term_matrix = [dictionaryDocs.doc2bow(w) for w in words]

SOME_FIXED_SEED = 100

np.random.seed(SOME_FIXED_SEED)

* **Model**

In [0]:
ldamodel = gensim.models.LdaMulticore(corpus=doc_term_matrix,
                                        id2word=dictionaryDocs,
                                        num_topics=8, 
                                        random_state=SOME_FIXED_SEED,
                                        iterations = 500,
                                        alpha = 0.01,
                                        passes=50)

In [0]:
import pickle
with open('/content/drive/My Drive/Colab Notebooks/model_covid_8', 'wb') as handle:
    pickle.dump(ldamodel, handle)

* **Analisys on Topics**


* Were Extract 8 tópics from texts about COVID-19.
* Each tópico has format with [(percent of word in topic) * word].
* The words were order by percent desc.

In [0]:
for idx, topic in ldamodel.print_topics(-1):
  print("Topic: {} \nWords: {}".format(str(idx), [topic]))
  print("\n")

Topic: 0 
Words: ['0.126*"cell" + 0.047*"protein" + 0.038*"express" + 0.030*"viru" + 0.026*"viral" + 0.016*"inhibit" + 0.015*"bind" + 0.013*"receptor" + 0.013*"interact" + 0.013*"induc"']


Topic: 1 
Words: ['0.112*"vaccin" + 0.070*"antibodi" + 0.070*"immun" + 0.043*"antigen" + 0.034*"viru" + 0.027*"protect" + 0.025*"strain" + 0.025*"serum" + 0.022*"neutral" + 0.020*"cell"']


Topic: 2 
Words: ['0.036*"detect" + 0.035*"assay" + 0.024*"concentr" + 0.020*"incub" + 0.015*"determin" + 0.013*"measur" + 0.011*"wash" + 0.011*"acid" + 0.011*"cell" + 0.010*"temperatur"']


Topic: 3 
Words: ['0.095*"viru" + 0.043*"influenza" + 0.039*"respiratori" + 0.032*"viral" + 0.027*"detect" + 0.026*"pathogen" + 0.022*"child" + 0.022*"human" + 0.018*"transmiss" + 0.016*"symptom"']


Topic: 4 
Words: ['0.043*"model" + 0.018*"rate" + 0.016*"popul" + 0.015*"individu" + 0.011*"measur" + 0.010*"epidem" + 0.010*"predict" + 0.009*"transmiss" + 0.009*"period" + 0.008*"higher"']


Topic: 5 
Words: ['0.058*"sequenc" +


* The LDAvis System: web-based visualization system,
LDAvis, has two core functionalities that enable
users to understand the topic-term relationships in
a fitted LDA model, and a number of extra features
that provide additional perspectives on the model. [Sievert C and Shirley, 2014]


In [0]:
from matplotlib import pyplot as plt

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionaryDocs)

pyLDAvis.save_html(vis,'/content/drive/My Drive/Colab Notebooks/vis.html')
vis

 * Metric to Coherence: Cv. [Röder M and Both A and Hinneburg A., 2015]



*   Model have 55.25% coherence score.



In [0]:
coherencemodel = CoherenceModel(model=ldamodel, texts=words, dictionary=dictionaryDocs, coherence='c_v')
print('\nCoherence Score: ', coherencemodel.get_coherence())


Coherence Score:  0.571411587762478


* What topics are predominant (5-hot-topics)?
 




In [0]:
topics = ldamodel.show_topics(num_topics=11,num_words=10, formatted=False)

In [0]:
doctop=[]
for i in range(len(doc_term_matrix)):
   doctop.append(ldamodel.get_document_topics(doc_term_matrix[i],minimum_probability=.5))

In [0]:
tp={}
for i in range(len(doctop)):
  for j in range(len(doctop[i])): 
      try:
        tp[doctop[i][j][0]]+=1
      except:
         tp[doctop[i][j][0]]=1
soma=0
for k in tp:
   soma+=tp[k]

In [0]:
tp

{0: 3565, 1: 551, 2: 1144, 3: 2025, 4: 1800, 5: 2404, 6: 3615, 7: 1969}

* Hot-topics about COVID-19:
1.   Topic 6
2.   Topic 0
3.   Topic 5
4.   Topic 3
5.   Topic 7

Topic **6** ( **Risk Public Health** ) | Topic 0 ( Viral expression ) |Topic 5 ( Viral Genomes ) | Topic: 3 ( Respiratory viruses ) | Topic: 7 ( Respiratory failure ) |
--- | ---| ---| ----| ---|
"health" |"cell" | "sequence"| "virus" | "blood"
 "public" |"protein" | "gene"| "influenza" | "lung" 
 "countries"|"express"| "protein" | "respiratories" | "tissue" 
 "risk"|"virus" | "virus" | "viral" | "viral" | "therapies" 
 "care"|"viral" | "genome" | "detect" | "detect" | "organ" 
 "provide"| "inhibit" | "structure" |  "pathogen" | "acute" 
 "Emergency"| "bind" | "strain" | "child" | "sign"  
 "hospital" | "receptor" | "mutation" | "human" | "chronic" 
 "community" | "interact" |"viral" | "transmiss" | "diagnosis"
 "outbreak"| "induced"| "acid" |"symptom" | "bacterial"
 