# Data Exploration of a sample of 1000 articles from CORD-19 dataset on Kaggle

## 1) Import libraries and load data

In [3]:
# Import necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import json
import matplotlib.pyplot as plt
import seaborn as sns

# Read Excel data file with 1000 articles
df = pd.read_excel('/home/caba/code/Glonnet/Sci_papers/raw_data/papers.xlsx')

# Display the first 5 rows of the dataframe
df.head()

Unnamed: 0,paper_id,title,abstract,full-text
0,9f05493734433385ff6082f0d4368b93a9a6b3df,Journey of cystatins from being mere thiol pro...,Abstract\n\nCystatins are thiol proteinase inh...,Proteases\n\nProteases are enzymes that irreve...
1,76b6934f3ead3c415d98b18154c3223178bfbc4f,Dear Editor,,\n\nAccording to the World Health Organization...
2,97c2f4235923a5366f93d98a48f88b68e1941b21,Accurate Representations of the Microphysical ...,Abstract\n\nAerosols and droplets from expirat...,I. INTRODUCTION\n\nThe transmission of respira...
3,d327e343b221e242373be5e86d02a03afe4d250b,Journal Pre-proof RECOMENDACIONES DE CONSENSO ...,Abstract\n\nLa gran afectación pulmonar produc...,INTRODUCCION\n\nLa rápida progresión de la pan...
4,088e0fabcaf75dea9bc0e42ed2a85dc0cf677e02,Myofascial Release of the Hamstrings Improves ...,"Abstract\n\nCitation: Itotani, K.; Kawahata, K...","Introduction\n\nIn 2020, COVID-19 had an impac..."


In [4]:
row_count = df.shape[0]
print(f"Number of rows in df: {row_count}")

Number of rows in df: 1000


## 2) Continue with cleaned df (remove articles with empty title, abstract or full-text)

In [None]:
# Continue only with articles that have values in title and abstract
df = df.dropna(subset=['title', 'abstract'])
df.head()

Unnamed: 0,paper_id,title,abstract,full-text
0,9f05493734433385ff6082f0d4368b93a9a6b3df,Journey of cystatins from being mere thiol pro...,Abstract\n\nCystatins are thiol proteinase inh...,Proteases\n\nProteases are enzymes that irreve...
2,97c2f4235923a5366f93d98a48f88b68e1941b21,Accurate Representations of the Microphysical ...,Abstract\n\nAerosols and droplets from expirat...,I. INTRODUCTION\n\nThe transmission of respira...
3,d327e343b221e242373be5e86d02a03afe4d250b,Journal Pre-proof RECOMENDACIONES DE CONSENSO ...,Abstract\n\nLa gran afectación pulmonar produc...,INTRODUCCION\n\nLa rápida progresión de la pan...
4,088e0fabcaf75dea9bc0e42ed2a85dc0cf677e02,Myofascial Release of the Hamstrings Improves ...,"Abstract\n\nCitation: Itotani, K.; Kawahata, K...","Introduction\n\nIn 2020, COVID-19 had an impac..."
6,48fb0f2ee0b3fd564e6988e24f443e54996ee24d,Deep Sequence Modeling for Pressure Controlled...,Abstract\n\nThis paper presents a deep neural ...,Introduction\n\nThe mechanical ventilator is a...


In [25]:
# Number of articles left in our sample dataset
row_count = df.shape[0]
print(f"Number of rows in df: {row_count}")

Number of rows in df: 656


In [7]:
# Check content of a particular article
df[df.paper_id == '97c2f4235923a5366f93d98a48f88b68e1941b21']['title'].values[0]

'Accurate Representations of the Microphysical Processes Occurring during the Transport of Exhaled Aerosols and Droplets'

## 3) Baseline to evaluate Document Retrieval with RAG

- Retrieve number of articles of a given topic e.g. "vaccine" + "gender" in abstract of articles.
- RAG should at a minimum be able to identify the number of documents that mention these sample topics ("vaccine" and "gender") either in the title or abstract of an article.  
- These articles serve as first baseline to see if RAG can roughly identify different topics and in this case retrieve successfully articles that correspond to the topics "vaccine" and "gender".
- For a query about "List research articles on covid 19 vaccine related to gender" thats also the minimum list of articles our system should retrieve.

In [None]:
# Baseline number of articles for our RAG model for three sample Covid-19 reasearch topics: "vaccine" , "gender", "vaccine + gender"
num_vaccine_papers = vaccine_papers.shape[0]
print(f"Number of vaccine papers: {num_vaccine_papers}")

num_gender_papers = gender_papers.shape[0]
print(f"Number of gender/sex papers: {num_gender_papers}")

num_vaccine_gender_papers = vaccine_gender_papers.shape[0]
print(f"Number of vaccine papers related to gender: {num_vaccine_gender_papers}")

Number of vaccine papers: 70
Number of gender/sex papers: 18
Number of vaccine papers related to gender: 4


In [None]:
# For us to "roughly" check if RAG is retrieving a reasonable number of articles on "gender"/ "sex"
# Filter the dataframe for papers where the abstract contains the word 'gender' or 'sex'
gender_papers = df[df['abstract'].str.contains('gender' or 'sex', case=False, na=False)][['paper_id', 'title', 'abstract']]

# Display the filtered dataframe
gender_papers

Unnamed: 0,paper_id,title,abstract
18,d763a5d11cd12c3a973712c65064d3ebf7feeb5f,Public's perceptions of urban identity of Thes...,Abstract\n\nUrban identity (UI) is a multi-fac...
87,c715cb4f57f8b7678370f720f5b0cbddc2e4fbd7,Public Perception of COVID-19 Vaccines on Twit...,"Abstract\n\nText Word Count: 2,971 using relev..."
165,4c35240959b6db6d7b16e5552232275d5c0ae2c4,Increasing hip fracture volume following repea...,Abstract\n\nBackground Older age groups were i...
175,922a1754f95a739dbfcafacd71c8ff35333415a3,How has COVID-19 modified training and mood in...,Abstract\n\nBackground: Coronavirus disease 20...
300,88de67dcb3bbf297a26774f1e8a7483fafa5be59,Clinical Heterogeneity in ME/CFS. A Way to Und...,Abstract\n\nThe aim of present paper is to ide...
319,78cea81841181e70452effadfaf7ebac0d8065c8,Assessing preventive health behaviors from COV...,Abstract\n\nBackground: Coronavirus disease 20...
376,19a16062447188ef64e651ec009a8fbff11ec372,COVID-19 stressors and symptoms of depression ...,Abstract\n\nPurpose To examine associations be...
464,f99ce27c805506e8d51bf55775fccc7fd0c63fb9,Gender Bias: Another Rising Curve to Flatten?,Abstract\n\nThe COVID-19 pandemic and the uphe...
483,3e53918a21883b8decbc0b8f49ecccc6ae18025f,Relative Incidence of Office Visits and Cumula...,Abstract\n\nWe performed a retrospective analy...
517,656043e6a3ff882f09cef66f0b42f59661e44fbe,Structural equation modeling test of the pre- ...,Abstract\n\nBased on the Theory of Health Acti...


In [None]:
# For us to "roughly" check if RAG is retrieving a reasonable number of articles on "vaccines"
# Filter the dataframe for papers where the abstract contains the word 'vaccine'
vaccine_papers = df[df['abstract'].str.contains('vaccine', case=False, na=False)][['paper_id', 'title', 'abstract']]

# Display the filtered dataframe
vaccine_papers

Unnamed: 0,paper_id,title,abstract
61,1c3021528dea5b342a90fa28d3a4477315602eb9,CLINICAL EXPERIMENTAL VACCINE RESEARCH K O R E...,Abstract\n\nPurpose: This study aims to compar...
70,c5db08a4925ae08d69530ee3c2bb263f8fd9fca4,Characterization of spike glycoprotein of SARS...,"Abstract\n\nSince 2002, beta coronaviruses (Co..."
79,ab8241179f1277eb3c5b90f9e1d2bc4bd974aadf,Journal Pre-proof Assessing the efficacy of in...,Abstract\n\nPlease cite this article as: Trevo...
83,ff42e013c390fdf9c34101eb13d93817dbe0371f,R E V I E W Advances in Neutralization Assays ...,Abstract\n\nThe coronavirus disease 2019 (COVI...
87,c715cb4f57f8b7678370f720f5b0cbddc2e4fbd7,Public Perception of COVID-19 Vaccines on Twit...,"Abstract\n\nText Word Count: 2,971 using relev..."
...,...,...,...
940,e283c964691d950836fd62790ff4c032b8b1a56a,molecules Evaluation of the Inhibitory Effects...,"Abstract\n\nDiNap [(E)-1-(2-hydroxy-4,6-dimeth..."
946,84ef3ad08ba26c368ee84953c55027054f049cb9,Vaccines and myasthenia gravis: a comprehensiv...,Abstract\n\nIntroduction Myasthenia gravis (MG...
947,e64cdcafab3a127a035f29739ed7f27a48352c8b,To appear in: Vaccine,Abstract\n\nThe safety and immunogenicity of a...
980,595b4da638d33d86ed49b55090e444a9aad479fd,Development of antibodies to feline IFN-g as t...,Abstract\n\nAn understanding of the nature of ...


In [9]:
vaccine_papers.shape

(70, 3)

In [27]:
# Filter the vaccine_papers for papers where the abstract contains the word 'gender'
vaccine_gender_papers = vaccine_papers[vaccine_papers['abstract'].str.contains('gender' or 'sex', case=False, na=False)]

# Display a sample set of the filtered dataframe
vaccine_gender_papers.head()

Unnamed: 0,paper_id,title,abstract
87,c715cb4f57f8b7678370f720f5b0cbddc2e4fbd7,Public Perception of COVID-19 Vaccines on Twit...,"Abstract\n\nText Word Count: 2,971 using relev..."
319,78cea81841181e70452effadfaf7ebac0d8065c8,Assessing preventive health behaviors from COV...,Abstract\n\nBackground: Coronavirus disease 20...
483,3e53918a21883b8decbc0b8f49ecccc6ae18025f,Relative Incidence of Office Visits and Cumula...,Abstract\n\nWe performed a retrospective analy...
691,5f6688f081900f0c2e24e7f96ee04821d2df3709,Spanish Version of the Attitude Towards COVID-...,Abstract\n\nThe negative attitude to vaccines ...


In [28]:
vaccine_gender_papers.shape

(4, 3)

In [29]:
vaccine_gender_papers[gender_papers.paper_id== '78cea81841181e70452effadfaf7ebac0d8065c8']['abstract'].values[0]

"Abstract\n\nBackground: Coronavirus disease 2019 (COVID-19) is a new viral disease that has caused a pandemic in the world. Due to the lack of vaccines and definitive treatment, preventive behaviors are the only way to overcome the disease. Therefore, the present study aimed to determine the preventive behaviors from the disease based on constructs of the health belief model.In the present cross-sectional study during March 11-16, 2020, 750 individuals in Golestan Province of Iran were included in the study using the convenience sampling and they completed the questionnaires through cyberspace. Factor scores were calculated using the confirmatory factor analysis. The effects of different factors were separately investigated using the univariate analyses, including students sample t-test, ANOVA, and simple linear regression. Finally, the effective factors were examined by the multiple regression analysis at a significant level of 0.05 and through Mplus 7 and SPSS 16.Results: The partic