---
#### **Task used as a query for external validation:**
*(https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=561).*



---
# Imports and Drive Mount Function:


In [1]:
#@markdown Numpy is imported for large arrays and math functions. <br /> Pandas is a data analysis library. <br /> Glob is used to recursively find files matching a defined pattern. <br /> Json is used for reading .json files. <br /> NLTK (Natural Language Toolkit) is used for various NLP practices. <br /> GenSim is used to import the Doc2Vec model, alongside the TaggedDocument object.
import numpy as np
import pandas as pd
# import glob
import json
import nltk
from gensim import corpora
from gensim.models import Doc2Vec
from gensim.models import ldamodel
from gensim.models.doc2vec import TaggedDocument
import time

In [2]:
#@markdown Function to mount a google drive location to the colab runtime for file access. <br /> 'root' is returned when the function is called containing the path to the mounted drive file directory.
def drive_mount():
    """ Mounts Google Drive to runtime and sets 'root' to folder containing pickled data. """
    from google.colab import drive
    drive.mount('/content/drive/', force_remount=True)
    root = "drive/My Drive/Data/project/"
    return root



---
# Initial Data Processing on Kaggle Platform:


In [3]:
#@markdown 'root' variable set to kaggle's default input location. <br /> Metatada dataframe created from the database metadata file.
# root = '/kaggle/input/CORD-19-research-challenge'
# metadata = f'{root}/metadata.csv'

# metaDF = pd.read_csv(metadata, dtype=str)

In [4]:
#@markdown glob.glob used to find all files in the dataset with the .json extension. <br /> These are the full text documents in the dataset.
# jsonFiles = glob.glob(f'{root}/**/*.json', recursive=True)

In [5]:
#@markdown ReadFile function created to read individual json files and pull abstract, full text and id information for use in a dataframe.
# class ReadFile:
#     def __init__(self, filePath):
#         with open(filePath) as file:
#             oF = json.load(file)
#             
#             self.abstract = []
#             self.body_text = []
#             self.paper_id = oF['paper_id']
# 
#             for en in oF['abstract']:
#                 self.abstract.append(en['text'])
#             for en in oF['body_text']:
#                 self.body_text.append(en['text'])
#                 
#             self.abstract = '\n'.join(self.abstract)
#             self.body_text = '\n'.join(self.body_text)

In [6]:
#@markdown The previously found json files are iterated through, using the readfile function to pull information alongside matching information in the metadata file to create an initial dataframe for the dataset.
# cordDict = {'paper_id': [], 'title': [], 'authors': [], 'abstract': [], 'body_text': []}

# for i, en in enumerate(jsonFiles):
#    if i % (len(jsonFiles) // 10) == 0:
#         print(f'Processing index: {i} of {len(jsonFiles)}')
#     
#     try:
#         jsnContent = ReadFile(en)
#     except Exception as e:
#         continue
#     
#     mdContent = metaDF.loc[metaDF['sha'] == jsnContent.paper_id]
#     if len(mdContent) == 0:
#         continue
#     
#     cordDict['paper_id'].append(jsnContent.paper_id)
#     cordDict['title'].append(mdContent['title'])
#     cordDict['authors'].append(mdContent['authors'])
#     cordDict['abstract'].append(jsnContent.abstract)
#     cordDict['body_text'].append(jsnContent.body_text)
#     
#     
# covidDF = pd.DataFrame(cordDict, columns=['paper_id', 'title', 'authors', 'abstract', 'body_text'])

In [7]:
#@markdown Dataframe saved by 'pickling' which flattens the dataframe, used to transfer set to different environment. <br /> Resulting file is 2GB down from 15GB.
# covidDF.to_pickle('CORD-19_Processed.pkl')



---
# Data Processing from Generated Pickle:


In [8]:
#@markdown Drive mounting function is called. <br /> The processed dataframe is opened from the pickle file on the drive.
root = drive_mount();
covidDF = pd.read_pickle(root + 'CORD-19_Processed.pkl')

Mounted at /content/drive/


In [9]:
#@markdown The first five values of the dataframe currently.
covidDF.head()

Unnamed: 0,paper_id,title,authors,abstract,body_text
0,31977b1c9042daf1c57cd6fe348a00cf885a8bc1,220950 Is Returning to Work during the COVI...,"220950 Tan, Wanqiu; Hao, Fengyi; McIntyre, ...",,An outbreak of the Coronavirus Disease 2019 oc...
1,9a933dd382a3e5b553f6a9a2efbbb55546c62761,11695 A summary of second systemic pulmonar...,"11695 Yang, Xue-Yong; Jing, Xiao-Yong; Chen...",Background: There has been an increasing numbe...,With the steady improvement of the comprehensi...
2,8a33ea505c5bd08d6923f01a7839b19404584400,191431 First case of COVID-19 complicated w...,"191431 Zeng, Jia-Hui; Liu, Ying-Xia; Yuan, ...",Background Coronavirus disease 2019 has been d...,Background A series of unexplained pneumonia c...
3,832a043240a9b1025dbb011e9f413ce79c5a3ad1,28370 High-resolution CT features of COVID-...,"28370 Omar, Suzan; Motawea, Abdelghany Moha...",Background: Coronavirus (COVID-19) pneumonia e...,"On December 31, 2019, the World Health Organiz..."
4,a3eeaa271941b86bb7865a44add8d480ec3b2c9f,5577 Antibacterial activity of two phlorogl...,"5577 Lee, Hyang Burm; Kim, Jin Cheol; Lee, ...",The antimicrobial effect of solvent extracts f...,The thick-stemmed wood fern (Dryopteris crassi...


In [10]:
#@markdown Displays information about what is held in the current dataframe.
covidDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86632 entries, 0 to 86631
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   86632 non-null  object
 1   title      86632 non-null  object
 2   authors    86632 non-null  object
 3   abstract   86632 non-null  object
 4   body_text  86632 non-null  object
dtypes: object(5)
memory usage: 3.3+ MB


In [11]:
#@markdown A new dataframe is created from a sample of pre defined entries in the original dataframe. <br /> The random_state argument is used to ensure reproducibility. <br /> Small sample is removed from the original dataframe. <br /> Small sample is taken from dataframe for use with queries. <br /> sampleSize variable created for code automation. <br /> Query documents are sampled before the training documents to ensure that the queries are the same across different sample sizes.
sampleSize = 5000

cQD = covidDF.sample(10, random_state=404)
covidDF = covidDF.drop(cQD.index)

cDF = covidDF.sample(sampleSize, random_state=404)

del covidDF

#For External Data Test
ExtQuery = open(root + "ExternalQuery.txt")
eQD = ExtQuery.read()
ExtQuery.close()

In [12]:
#@markdown The first five values of the new dataframe, distinctly different from the original.
cDF.head()

Unnamed: 0,paper_id,title,authors,abstract,body_text
69147,9241e0b97dce927be4592622e554fe81af5bfb2f,163773 Lack of MERS Coronavirus but Prevale...,"163773 Gautret, Philippe; Charrel, Rémi; Be...",,To the Editor: Saudi Arabia has reported the h...
78698,a08eaa8b4c47649f772cc42a869995b4884c07af,210509 Reducing droplet spread during airwa...,"210509 Au Yong, Phui S.; Chen, Xuanxuan Nam...",,Reducing droplet spread during airway manipula...
16316,c7f27f99be34d2c419e37fb6955eda022b0db893,178736 Phosphonic Acid Analogs of Fluorophe...,"178736 Wanat, Weronika; Talma, Michał; Dziu...",A library of phosphonic acid analogs of phenyl...,Aminopeptidases are a heterogeneous group of e...
47627,b1b3f0b30a6d0a58b198bc7d1f0f055c2d4645ca,6659 Development of reliable artificial liv...,"6659 Yoshiba, Makoto; Sekiyama, Kazuhiko; I...",A new artificial liver support system (ALSS) c...,"tients include exchange transfusion (ET) (1), ..."
14269,ea763d76c18c169dbea960bd078460b0e60445fb,222799 Equine arteritis virus: An overview ...,"222799 Chirnside, Ewan D. Name: authors, dt...",,Although equine viral arteritis has been known...


In [13]:
#@markdown The first five values of the query dataframe.
cQD.head()

Unnamed: 0,paper_id,title,authors,abstract,body_text
45131,b7cf1edf68fb68d779471c95de07c5a4e61de6ea,17167 The Physical Burden of Immunopercepti...,"17167 Saghazadeh, Amene; Rezaei, Nima Name:...",The previous chapter introduced the ImmunoEmot...,"such as pemphigus [10, 11] . Further, human st..."
39378,44b33d82fccc1ac19cdc1456ad3cb083ea81d8eb,187841 Fewer cancer diagnoses during the CO...,"187841 Dinmohamed, Avinash G; Visser, Otto;...",,The dreadful consequences of coronavirus disea...
17681,2b1ca06156430eb4efd335dddc7062716b338399,192076 Repurposing and reshaping of hospita...,"192076 Her, Minyoung Name: authors, dtype: ...",During the extensive outbreak of coronavirus d...,The fear of coronavirus disease 2019 (COVID-19...
53007,89e162115392f120533361ea7ed71657dfb6e05e,203956 Porcine arterivirus activates the NF...,"203956 Lee, Sang-Myeong; Kleiboeker, Steven...",Nuclear factor-kappaB (NF-nB) is a critical re...,"PRRSV is an enveloped, positive-stranded RNA v..."
46746,8b53beb6dc64c3e96b829e264601b4774ad8deff,214611 A Mathematical Model for the Coverag...,"214611 Araújo, Eliseu J.; Chaves, Antônio A...",The Coverage Location Problem (CLP) seeks the ...,of additional facilities is excessive in some ...


In [14]:
#@markdown Function to drop null values, in this case as the sampled data non-null count has not changed, there are no null values present.
cDF.dropna(inplace=True)
cDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 69147 to 35172
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   5000 non-null   object
 1   title      5000 non-null   object
 2   authors    5000 non-null   object
 3   abstract   5000 non-null   object
 4   body_text  5000 non-null   object
dtypes: object(5)
memory usage: 234.4+ KB


In [15]:
#@markdown Install the language detection library to the colab environment.
!pip install langdetect



In [16]:
#@markdown After import, the seed is set to 0 to ensure reproducibility, for the same results each time the program is run. <br /> Abstracts are looped through in attempt to detect the language of each entry in the dataframe, if the function fails to detect the language from the abstract alone, the full text is used.
from langdetect import detect
from langdetect import DetectorFactory
DetectorFactory.seed = 0

def DetectLanguage(cDF):
    """Attempts to detect the language of each paper in the dataset."""
    language = []
    for i in range(0, len(cDF)):
      lg = "en"
      try:
        lg = detect(cDF.iloc[i]['abstract'])
      except Exception as e:
        try:
          lg = detect(cDF.iloc[i]['body_text'])
        except Exception as e:
          lg = "u"
          pass
      language.append(lg)
    return language


lang = DetectLanguage(cDF)

In [17]:
#@markdown Language detection results are then added to the dataframe, and non english results are omitted. <br /> This is done to later improve training results, as there are significantly more english documents in the dataset sample than any other language. <br /> The information shows that 47 papers have been omitted.
cDF['language'] = lang
del lang
cDF = cDF[cDF['language'] == 'en']
cDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4851 entries, 69147 to 35172
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   4851 non-null   object
 1   title      4851 non-null   object
 2   authors    4851 non-null   object
 3   abstract   4851 non-null   object
 4   body_text  4851 non-null   object
 5   language   4851 non-null   object
dtypes: object(6)
memory usage: 265.3+ KB




---
# Preparing data for Doc2Vec:


In [None]:
#@markdown The full text is decapitalised.
def DecapitaliseDoc(docs):
    """Performs text decapitalisation on the full text."""
    for i in range(0, len(docs)):
      docs.iloc[i]['body_text'] = docs.iloc[i]['body_text'].lower()

DecapitaliseDoc(cDF)
DecapitaliseDoc(cQD)

eQD = eQD.lower()

In [19]:
#@markdown Followed by the tokenisation of the full text. <br /> In this case word tokenisation is used to separate the words into a list.
nltk.download('punkt')

def TokenizeDoc(docs):
    """Performs word tokenization on the full text."""
    docs['body_text'] = docs['body_text'].apply(nltk.word_tokenize)

TokenizeDoc(cDF)
TokenizeDoc(cQD)

eQD = nltk.word_tokenize(eQD)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [20]:
#@markdown Stop words and any symbol characters are removed from the tokenised text.
nltk.download('stopwords')
from nltk.corpus import stopwords
sw = stopwords.words('english')
symbols = ["!",'"',"#","%","&","'","(",")","*","+",",","-",".","/",":",";","<","=",">","?","@","[","\\","]","^","_","`","{","|","}","~","–"]
swsym = sw + symbols

def RemoveSymSW(docs, symsw):
    """Removes symbols and stopwords from tokenized full text."""
    for i in range(0, len(docs)):
      docs.iloc[i]['body_text'] = [word for word in docs.iloc[i]['body_text'] if word not in symsw]

RemoveSymSW(cDF, swsym)
RemoveSymSW(cQD, swsym)

eQD = [word for word in eQD if word not in swsym]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [21]:
del sw
del symbols
del swsym
cDF['body_text'].head()

69147    [editor, saudi, arabia, reported, highest, num...
78698    [reducing, droplet, spread, airway, manipulati...
16316    [aminopeptidases, heterogeneous, group, enzyme...
47627    [tients, include, exchange, transfusion, et, 1...
14269    [although, equine, viral, arteritis, known, al...
Name: body_text, dtype: object

In [22]:
cDF.head()

Unnamed: 0,paper_id,title,authors,abstract,body_text,language
69147,9241e0b97dce927be4592622e554fe81af5bfb2f,163773 Lack of MERS Coronavirus but Prevale...,"163773 Gautret, Philippe; Charrel, Rémi; Be...",,"[editor, saudi, arabia, reported, highest, num...",en
78698,a08eaa8b4c47649f772cc42a869995b4884c07af,210509 Reducing droplet spread during airwa...,"210509 Au Yong, Phui S.; Chen, Xuanxuan Nam...",,"[reducing, droplet, spread, airway, manipulati...",en
16316,c7f27f99be34d2c419e37fb6955eda022b0db893,178736 Phosphonic Acid Analogs of Fluorophe...,"178736 Wanat, Weronika; Talma, Michał; Dziu...",A library of phosphonic acid analogs of phenyl...,"[aminopeptidases, heterogeneous, group, enzyme...",en
47627,b1b3f0b30a6d0a58b198bc7d1f0f055c2d4645ca,6659 Development of reliable artificial liv...,"6659 Yoshiba, Makoto; Sekiyama, Kazuhiko; I...",A new artificial liver support system (ALSS) c...,"[tients, include, exchange, transfusion, et, 1...",en
14269,ea763d76c18c169dbea960bd078460b0e60445fb,222799 Equine arteritis virus: An overview ...,"222799 Chirnside, Ewan D. Name: authors, dt...",,"[although, equine, viral, arteritis, known, al...",en


In [23]:
#@markdown The tagged documents are generated by passing the processed text through, alongside a unique index tag to later identify vectors.
cTD = []
for i in range(0, len(cDF)):
  cTD.append(TaggedDocument(words=cDF.iloc[i]['body_text'], tags=[i]))

---
# Topic Modelling with Latent Dirichlet Allocation (LDA):
##### (For Experiment Result Analysis)

In [24]:
#@markdown Dictionary and Corpus objects created for training LDA.
cordDictionary = corpora.Dictionary(cDF['body_text'])
cordCorpus = [cordDictionary.doc2bow(txt) for txt in cDF['body_text']]

In [25]:
#@markdown LDA is trained.
#cordLdamodel = ldamodel.LdaModel(cordCorpus, num_topics = 10, id2word = cordDictionary, passes = 15)
#cordTopics = cordLdamodel.print_topics(num_words = 5)
#for topic in cordTopics:
#  print(topic)

In [26]:
#cordLdamodel.save(root + "cord19_ldamodel.gensim")

---
# Training Doc2Vec model:


In [27]:
#@markdown Doc2Vec trained for 10 epochs across the sample set. <br /> Current standard arguments: <br /> Distributed Memory / Distributed Bag of Words = 1 (DM) <br /> Vector Size: 100 <br /> Alpha: 0.025 <br /> Min Alpha Value: 0.00025 <br /> Min Count: 1 <br /> Workers (cores): 2
n_epochs = 10

model = Doc2Vec(dm=1,
                vector_size=100,
                alpha=0.025, 
                min_alpha=0.00025,
                min_count=4,
                workers=2,
                seed=0)
  
model.build_vocab(cTD)

print("Training...")
tStart = time.time()
for epoch in range(n_epochs):
    print('Epoch: {0}'.format(epoch + 1))
    model.train(cTD, total_examples=model.corpus_count, epochs=1)
    model.alpha -= 0.0002
    model.min_alpha = model.alpha

tStop = time.time()
model.save(root + "cord19_doc2vec_" + str(sampleSize) + ".model")
print("Training complete, model saved to 'root' dir.")
print("\nTime taken: " + str(tStop - tStart))

Training...
Epoch: 1
Epoch: 2
Epoch: 3
Epoch: 4
Epoch: 5
Epoch: 6
Epoch: 7
Epoch: 8
Epoch: 9
Epoch: 10


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Training complete, model saved to 'root' dir.

Time taken: 250.94368362426758


---
# Testing the model:


In [28]:
#@markdown Saved trained model is loaded from the root directory. <br /> The model is used to generate a vector for the query document. <br /> The generated vector is printed (20 dimensions/weights)
queryDoc = 0

model = Doc2Vec.load(root + "cord19_doc2vec_" + str(sampleSize) + ".model")

examplequery = cQD.iloc[queryDoc]['body_text']

exampleresult = model.infer_vector(examplequery)
print(exampleresult)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[-8.5137147e-01  4.6744540e-02  1.0375755e+00 -3.2759149e-02
 -5.9045553e-01  7.6166064e-01  1.3554773e-01 -1.3657941e+00
  8.4505540e-01  1.1720208e-02 -9.5752925e-02  5.3222561e-01
 -2.6433158e-01 -3.5895300e-01 -1.0904936e+00  1.2114434e+00
 -1.8200487e-01 -1.4309647e+00 -1.0479684e+00 -7.8429538e-01
 -2.4806674e-01 -3.0446100e-01  4.5302689e-01  5.8414572e-01
 -3.9317906e-01 -6.8678623e-01  3.7504968e-01  8.6531943e-01
  2.0536977e-01  1.1583893e-01 -9.4963139e-01 -1.2580934e-01
 -1.3036656e-01  1.8386500e-01  6.1950070e-01 -3.1189901e-01
 -5.2053696e-01  1.4115907e+00  8.7202388e-01 -6.8742549e-04
 -4.0260985e-01  1.4399241e+00  6.3211513e-01  4.2968237e-01
  8.6646581e-01 -2.3050738e-02 -1.2346877e+00 -8.7277925e-01
  8.9015895e-01 -7.7062225e-01 -6.9486880e-01 -1.1893155e-02
  2.4307296e-02  3.2460389e-01 -2.7876809e-01  4.7375473e-01
 -7.9824907e-01 -6.0878450e-01 -9.9215400e-01  1.4510233e+00
  6.6173816e-01  8.2003959e-03 -1.6593148e-01 -3.3723527e-01
  6.8867379e-01 -2.14447

In [29]:
#@markdown The 'most_similar' function is called to return the 10 closest document vectors to the query vector, alongside the cosine values, giving a similarity score.
simmatch = model.docvecs.most_similar([exampleresult])
resultindex = []
for i in range(0, 10):
  resultindex.append(simmatch[i][0])

simmatch

  if np.issubdtype(vec.dtype, np.int):


[(3662, 0.6128749847412109),
 (791, 0.6097314357757568),
 (4542, 0.5580143928527832),
 (4807, 0.5553766489028931),
 (3610, 0.549929678440094),
 (1488, 0.5438060760498047),
 (1171, 0.5382349491119385),
 (3874, 0.5380638837814331),
 (2001, 0.5368372201919556),
 (504, 0.5353505611419678)]

In [30]:
#@markdown Using the index of the document vectors, information can be pulled from the dataset about the documents that the program has returned. <br /> The returned papers appear to be about various tests and vaccines/therapeutics.
for result in resultindex:
  print("   Title:" + cDF.iloc[result]['title'].to_string(index=0))
  print("Paper ID: " + cDF.iloc[result]['paper_id'] + "\n")

   Title: Symptomatology, assessment, and treatment of a...
Paper ID: 102f1245baa5cd32901a08b71bb4ec95038cf0a6

   Title: Physical Exercise Potentials Against Viral Dis...
Paper ID: f7f5cedd2f6a1cd613a0ffb6cf41d1f0513df21f

   Title: Going digital: how technology use may influenc...
Paper ID: eacea071c23a29d707e9ee01231f17deec30967c

   Title: Health, Psychosocial, and Social issues emanat...
Paper ID: d5e421c92b7aa0d1c9d46ed25d4467345bb5e38a

   Title: Risk factors for psychological distress during...
Paper ID: 3763e2be3ed374fce2c7beac959f76d030b11c93

   Title: COVID-19: the perfect vector for a mental heal...
Paper ID: 50f5808a80117884e88353c31f2e78666faa1389

   Title: Immune Response in the Brain: Glial Response a...
Paper ID: 1fbebd0c74d5e15a4fea4d206bbc4816e6b00bd4

   Title: Letter to editor: CoVID-19 pandemic and sleep ...
Paper ID: ff9c8bbe9e5ff7a4b2155ba1d477c3712e4edb04

   Title: Mental health in the UK during the COVID-19 pa...
Paper ID: 69362cbc24f42d095d571cd3bff3368ed7

---
# Linking matched papers to their DOI:

In [31]:
metaDF = pd.read_csv(root + 'metadata.csv', dtype=str)

In [32]:
#@markdown Matched papers are relocated using the metadata file to display links and titles.
mdContent = metaDF.loc[metaDF['sha'] == cQD.iloc[queryDoc]['paper_id']]
print("Query Document")
print("  DOI: https://www.doi.org/" + mdContent['doi'].to_string(index=0).strip())
print("Title:" + mdContent['title'].to_string(index=0))
print("----------\n\n")

for result in resultindex:
  mdContent = metaDF.loc[metaDF['sha'] == cDF.iloc[result]['paper_id']]
  print("  DOI: https://www.doi.org/" + mdContent['doi'].to_string(index=0).strip())
  print("Title:" + mdContent['title'].to_string(index=0))
  print("\n")

Query Document
  DOI: https://www.doi.org/10.1007/978-3-030-10620-1_10
Title: The Physical Burden of Immunoperception
----------
  DOI: https://www.doi.org/10.1016/j.jgo.2020.06.011
Title: Symptomatology, assessment, and treatment of a...
  DOI: https://www.doi.org/10.3389/fmed.2020.00379
Title: Physical Exercise Potentials Against Viral Dis...
  DOI: https://www.doi.org/10.31887/dcns.2020.22.2/mhoehe
Title: Going digital: how technology use may influenc...
  DOI: https://www.doi.org/NaN
Title: Health, Psychosocial, and Social issues emanat...
  DOI: https://www.doi.org/10.1111/bjhp.12455
Title: Risk factors for psychological distress during...
  DOI: https://www.doi.org/10.1192/bjb.2020.60
Title: COVID-19: the perfect vector for a mental heal...
  DOI: https://www.doi.org/10.1016/s1567-7443(07)10014-4
Title: Immune Response in the Brain: Glial Response a...
  DOI: https://www.doi.org/10.1007/s10072-020-04523-1
Title: Letter to editor: CoVID-19 pandemic and sleep ...
  DOI: https://www

---
# Topic Modelling the Matched Papers:

In [33]:
#@markdown LDA model is loaded. </br> Matched papers are passed through topic model to estimate the topic of each paper.
cordLdamodel = ldamodel.LdaModel.load(root + "cord19_ldamodel.gensim")

def GetTopics(docText):
    docTextBow = cordDictionary.doc2bow(docText)
    print(cordLdamodel.get_document_topics(docTextBow))
    print("\n")


print("Title:" + cQD.iloc[queryDoc]['title'].to_string(index=0))
GetTopics(cQD.iloc[queryDoc]['body_text'])

for result in resultindex:
  print("Title:" + cDF.iloc[result]['title'].to_string(index=0))
  resultDoc = cDF.iloc[result]['body_text']
  GetTopics(resultDoc)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Title: The Physical Burden of Immunoperception
[(0, 0.22756946), (1, 0.088375784), (2, 0.3677839), (7, 0.2497352), (8, 0.04677461)]


Title: Symptomatology, assessment, and treatment of a...
[(0, 0.5184808), (7, 0.44310802), (8, 0.03230556)]


Title: Physical Exercise Potentials Against Viral Dis...
[(0, 0.2535256), (1, 0.06406279), (2, 0.37213448), (4, 0.030791046), (7, 0.27909902)]


Title: Going digital: how technology use may influenc...
[(0, 0.6989858), (2, 0.048365157), (9, 0.23322244)]


Title: Health, Psychosocial, and Social issues emanat...
[(0, 0.8562752), (4, 0.12103212), (9, 0.020104663)]


Title: Risk factors for psychological distress during...
[(0, 0.80919665), (3, 0.019970015), (7, 0.16764393)]


Title: COVID-19: the perfect vector for a mental heal...
[(0, 0.83087534), (7, 0.12123847), (9, 0.04756403)]


Title: Immune Response in the Brain: Glial Response a...
[(2, 0.99263716)]


Title: Letter to editor: CoVID-19 pandemic and sleep ...
[(0, 0.784495), (7, 0.21396624)]

---
# LDA Visualisation:

In [34]:
#@markdown LDA Visualisation library for viewing topic distances and word frequency metrics.
#!pip install pyLDAvis

In [35]:
#import pyLDAvis.gensim
#cordLdavis = pyLDAvis.gensim.prepare(cordLdamodel, cordCorpus, cordDictionary, sort_topics=False)
#pyLDAvis.display(cordLdavis)

---
# External QD Test:

In [39]:
extvec = model.infer_vector(eQD)
extmatch = model.docvecs.most_similar([extvec])

extmatch

  if np.issubdtype(vec.dtype, np.int):


[(4481, 0.8333190679550171),
 (1325, 0.784773588180542),
 (2935, 0.7823768854141235),
 (2368, 0.76727294921875),
 (2032, 0.7645853161811829),
 (1784, 0.7562088370323181),
 (4522, 0.7410877346992493),
 (1382, 0.7398072481155396),
 (1965, 0.7388125658035278),
 (2454, 0.7346237897872925)]

In [40]:
extindex = []
for i in range(0, 10):
  extindex.append(extmatch[i][0])

In [42]:
GetTopics(eQD)

[(0, 0.17933074), (1, 0.11614219), (5, 0.05417085), (7, 0.120852746), (9, 0.5258793)]




In [41]:
for result in extindex:
  mdContent = metaDF.loc[metaDF['sha'] == cDF.iloc[result]['paper_id']]
  print("  DOI: https://www.doi.org/" + mdContent['doi'].to_string(index=0).strip())
  print("Title:" + mdContent['title'].to_string(index=0))
  resultDoc = cDF.iloc[result]['body_text']
  GetTopics(resultDoc)

  DOI: https://www.doi.org/10.1016/0166-3542(95)90006-3
Title: Related elsevier virology titles contents alert
[(2, 0.22099984), (3, 0.4284627), (5, 0.051053412), (6, 0.28233925)]


  DOI: https://www.doi.org/10.1148/radiol.2020200236
Title: CT Imaging of the 2019 Novel Coronavirus (2019...
[(1, 0.18287195), (7, 0.80896294)]


  DOI: https://www.doi.org/10.1016/j.isci.2020.101270
Title: Influenza virus-induced oxidized DNA activates...
[(2, 0.12112391), (6, 0.8538716)]


  DOI: https://www.doi.org/10.1016/j.clim.2020.108542
Title: Intestinal microbiome transfer, a novel therap...
[(0, 0.39467093), (1, 0.11409597), (4, 0.0950611), (7, 0.13701819), (9, 0.25476682)]


  DOI: https://www.doi.org/10.1002/jmv.25888
Title: Concomitant neurological symptoms observed in ...
[(1, 0.48138458), (7, 0.5118346)]


  DOI: https://www.doi.org/10.1016/j.tmaid.2020.101790
Title: Temperature and the difference in impact of SA...
[(0, 0.8873587), (5, 0.050821785), (7, 0.029127926), (9, 0.01602189)]


  DO