**Steps To Download Dataset From Kaggle**

In [1]:
! pip install -q kaggle

In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"debuggedcoder","key":"0b536703cc8a7c8c23bf7081d02bf3e1"}'}

In [3]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [4]:
!kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge

Downloading CORD-19-research-challenge.zip to /content
100% 18.4G/18.4G [01:52<00:00, 237MB/s]
100% 18.4G/18.4G [01:52<00:00, 175MB/s]


In [None]:
! unzip CORD-19-research-challenge.zip

### Importing Necessary Libraries For Pre Processing

In [9]:
import numpy as np
import pandas as pd 
import glob
import json
import math

In [10]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Loading Meta Data

In [25]:
meta_df = pd.read_csv('/content/CORD-19-research-challenge/metadata.csv')
meta_df.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,11667967,no-cc,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
2,ejv2xln0,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,11667972,no-cc,Surfactant protein-D (SP-D) participates in th...,2000-08-25,"Crouch, Erika C",Respir Res,,,,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
3,2b73a28n,348055649b6b8cf2b9a376498df9bf41f7123605,PMC,Role of endothelin-1 in lung disease,10.1186/rr44,PMC59574,11686871,no-cc,Endothelin-1 (ET-1) is a 21 amino acid peptide...,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",Respir Res,,,,document_parses/pdf_json/348055649b6b8cf2b9a37...,document_parses/pmc_json/PMC59574.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
4,9785vg6d,5f48792a5fa08bed9f56016f4981ae2ca6031b32,PMC,Gene expression in epithelial cells in respons...,10.1186/rr61,PMC59580,11686888,no-cc,Respiratory syncytial virus (RSV) and pneumoni...,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",Respir Res,,,,document_parses/pdf_json/5f48792a5fa08bed9f560...,document_parses/pmc_json/PMC59580.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,


#### Get  All The Papers having Covid-19 Related content using following Keys -- "covid", "coronavirus", "cov", "sha", "coronaviruses", 

In [26]:
meta_df = meta_df[(meta_df.abstract.str.contains('covid') | meta_df.abstract.str.contains('coronavirus') | meta_df.abstract.str.contains('cov') | meta_df.abstract.str.contains('coronaviruses')) | meta_df.abstract.str.contains('SARS-CoV-2')] 
meta_df.drop_duplicates(['sha'], inplace = True)
len(meta_df)
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 122235 entries, 0 to 1056657
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   cord_uid          122235 non-null  object 
 1   sha               122234 non-null  object 
 2   source_x          122235 non-null  object 
 3   title             122235 non-null  object 
 4   doi               119316 non-null  object 
 5   pmcid             105894 non-null  object 
 6   pubmed_id         100356 non-null  object 
 7   license           122235 non-null  object 
 8   abstract          122235 non-null  object 
 9   publish_time      122235 non-null  object 
 10  authors           121948 non-null  object 
 11  journal           110113 non-null  object 
 12  mag_id            0 non-null       float64
 13  who_covidence_id  0 non-null       object 
 14  arxiv_id          3196 non-null    object 
 15  pdf_json_files    122234 non-null  object 
 16  pmc_json_files    9

### Loading All Json data

In [27]:
all_json = meta_df.pdf_json_files.tolist()
all_json = ['/content/CORD-19-research-challenge/' + str(x) for x in all_json]

In [28]:
_json = glob.glob('/content/document_parses/pdf_json/*.json', recursive=True)
len(all_json)

122235

In [29]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'


In [30]:
first_row = FileReader(all_json[0])
print(first_row)

d1aafb70c066a2068b02786f8929fd9c900897fb: Objective: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, J... Mycoplasma pneumoniae is a common cause of upper and lower respiratory tract infections. It remains one of the most frequent causes of atypical pneumonia particu-larly among young adults. [1, 2, 3, 4,...


In [31]:

dict_ = {'paper_id': [], 'abstract': [], 'body_text': []}
for idx, entry in enumerate(all_json):
    try:
        content = FileReader(entry)
        if (idx % 1000 == 0):
            print(f"processing {idx} of {len(all_json)} ")
        dict_['paper_id'].append(content.paper_id)
        dict_['abstract'].append(content.abstract)
        dict_['body_text'].append(content.body_text)
    except:
        pass
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text'])
df_covid.head()

processing 0 of 122235 
processing 1000 of 122235 
processing 2000 of 122235 
processing 3000 of 122235 
processing 4000 of 122235 
processing 5000 of 122235 
processing 6000 of 122235 
processing 7000 of 122235 
processing 8000 of 122235 
processing 9000 of 122235 
processing 10000 of 122235 
processing 11000 of 122235 
processing 12000 of 122235 
processing 13000 of 122235 
processing 14000 of 122235 
processing 15000 of 122235 
processing 16000 of 122235 
processing 17000 of 122235 
processing 18000 of 122235 
processing 19000 of 122235 
processing 20000 of 122235 
processing 21000 of 122235 
processing 22000 of 122235 
processing 23000 of 122235 
processing 24000 of 122235 
processing 25000 of 122235 
processing 26000 of 122235 
processing 27000 of 122235 
processing 28000 of 122235 
processing 29000 of 122235 
processing 30000 of 122235 
processing 31000 of 122235 
processing 32000 of 122235 
processing 33000 of 122235 
processing 34000 of 122235 
processing 35000 of 122235 
proce

Unnamed: 0,paper_id,abstract,body_text
0,d1aafb70c066a2068b02786f8929fd9c900897fb,Objective: This retrospective chart review des...,Mycoplasma pneumoniae is a common cause of upp...
1,03203ab50eb64271a9e825f94a1b1a6c46ea14b3,Viral recombination can dramatically impact ev...,As increasing numbers of full-length viral seq...
2,d450fc8885843d48772df9a898552302f8c80b98,"Sequencing pathogen genomes is costly, demandi...",Draft sequencing requires that the order of ba...
3,4ba79e54ecf81b30b56461a6aec2094eaf7b7f06,Background and methods: Human metapneumovirus ...,Respiratory viruses play an important role in ...
4,ccc36b04ad5c71de61967624f7f739e868d7c0a5,Monoclonal antibodies that strongly neutralize...,Development of a humanized monoclonal antibody...


In [32]:
final_df = pd.merge(meta_df, df_covid, how = 'inner', left_on = 'sha', right_on = 'paper_id')
final_df.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract_x,publish_time,...,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,paper_id,abstract_y,body_text
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,...,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,d1aafb70c066a2068b02786f8929fd9c900897fb,Objective: This retrospective chart review des...,Mycoplasma pneumoniae is a common cause of upp...
1,zowp10ts,03203ab50eb64271a9e825f94a1b1a6c46ea14b3,PMC,Recombination Every Day: Abundant Recombinatio...,10.1371/journal.pbio.0030089,PMC1054884,15737066,cc-by,Viral recombination can dramatically impact ev...,2005-03-01,...,,,,document_parses/pdf_json/03203ab50eb64271a9e82...,document_parses/pmc_json/PMC1054884.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,,03203ab50eb64271a9e825f94a1b1a6c46ea14b3,Viral recombination can dramatically impact ev...,As increasing numbers of full-length viral seq...
2,t40ybhgb,d450fc8885843d48772df9a898552302f8c80b98,PMC,Draft versus finished sequence data for DNA an...,10.1093/nar/gki896,PMC1266063,16243783,no-cc,"Sequencing pathogen genomes is costly, demandi...",2005-10-20,...,,,,document_parses/pdf_json/d450fc8885843d48772df...,document_parses/pmc_json/PMC1266063.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,,d450fc8885843d48772df9a898552302f8c80b98,"Sequencing pathogen genomes is costly, demandi...",Draft sequencing requires that the order of ba...
3,qva0jt86,4ba79e54ecf81b30b56461a6aec2094eaf7b7f06,PMC,Relevance of human metapneumovirus in exacerba...,10.1186/1465-9921-6-150,PMC1334186,16371156,cc-by,BACKGROUND AND METHODS: Human metapneumovirus ...,2005-12-21,...,,,,document_parses/pdf_json/4ba79e54ecf81b30b5646...,document_parses/pmc_json/PMC1334186.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,,4ba79e54ecf81b30b56461a6aec2094eaf7b7f06,Background and methods: Human metapneumovirus ...,Respiratory viruses play an important role in ...
4,oluq7v0h,ccc36b04ad5c71de61967624f7f739e868d7c0a5,PMC,Development of a humanized monoclonal antibody...,10.1038/nm1240,PMC1458527,15852016,no-cc,Neutralization of West Nile virus (WNV) in viv...,2005-04-24,...,,,,document_parses/pdf_json/ccc36b04ad5c71de61967...,document_parses/pmc_json/PMC1458527.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,,ccc36b04ad5c71de61967624f7f739e868d7c0a5,Monoclonal antibodies that strongly neutralize...,Development of a humanized monoclonal antibody...


In [33]:
selected_columns = ['paper_id','title', 'abstract_x', 'body_text']
final_df = final_df[selected_columns]

In [34]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111539 entries, 0 to 111538
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   paper_id    111539 non-null  object
 1   title       111539 non-null  object
 2   abstract_x  111539 non-null  object
 3   body_text   111539 non-null  object
dtypes: object(4)
memory usage: 4.3+ MB


In [35]:
final_df.sample(1).abstract_x.values

array(['BACKGROUND: COVID-19 has created havoc in healthcare systems worldwide, including shortages in equipment and supplies for dialysis in the acute setting. METHODS: We compared our planning and experience at a tertiary care academic medical center to recommendations in the literature. RESULTS: Published literature and our experience underscored the need to plan for adequate dialysis equipment, particularly for continuous renal replacement therapy in the ICU setting, adequate nursing, and flexible scheduling of chronic patients to accommodate the surge in acute patients. We discovered other “shortages” not mentioned in the literature: shortages in the number of portable reverse osmosis (RO) machines needed to prepare dialysis water, inadequate number of rooms in units designated for COVID-19 patients with plumbing for dialysis, and lack of temperature blending valves on sinks that necessitated using cold water only, and damaging the RO membranes. We identified the need for cooperat

In [54]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
final_df.to_csv('/content/drive/MyDrive/Anas New Dataset/selected_122235_data.csv', index =True)
