<h1 align=center> COVID-19 Research Challenge 💉🦠</h1>

| [<img src="https://avatars0.githubusercontent.com/u/18689888" width="150px;" height="150px;"/><br /><sub><b>Amr M. Kayid</b></sub>](https://github.com/AmrMKayid)| [<img src="https://avatars2.githubusercontent.com/u/25725667" width="150px;" height="150px;"/><br /><sub><b>Omar ElSayed</b></sub>](https://github.com/OmarElSayed97/) | [<img src="https://avatars2.githubusercontent.com/u/25728207" width="150px;" height="150px;"/><br /><sub><b>Sama AlShareef</b></sub>](https://github.com/SamaAlshareef) | [<img src="https://avatars1.githubusercontent.com/u/36242784" width="150px;" height="150px;"/><br /><sub><b>Basma Afifi</b></sub>](https://github.com/BasmaAfifi) | [<img src="https://avatars1.githubusercontent.com/u/25587733" width="150px;" height="150px;"/><br /><sub><b>Dahlia Magdi</b></sub>](https://github.com/dahliakarass) | 
| :---: | :---: | :---: | :---: | :---: | 
| amr.kayid@student.guc.edu.eg | omar.elsayedmohamed@student.guc.edu.eg | sama.elsherif@student.guc.edu.eg | basma.afifi@student.guc.edu.eg | dahlia.karass@student.guc.edu.eg |
| **37-15594** | **37-6537** |  **37-0705** |  **37-0620** |  **37-5960** | 
| **T06** | **T07** | **T03** | **T03** | **DMET-T1** | 

<h1 align=center> Overview </h1>

<h2 align=center> The aim of this project is to deeply explore the dataset for COVID-19 Research Challenge, visualize the common words appearning in the papers, to make it easy for others to get quick overview for the topics inside the papers. Moreover, we aimed at creating a better visualization for the papers in the dataset, which make it easy for researcher to explore the papers quickly inside the jupyter notebook without searching on the internet. Lastly, we created a nicely interactive search enginer to search for specific words, papers, challenge's tasks and get the most recent ones with the highest score for the givin keywords. </h2>

# Libraries & Dependencies

We have created a small library called [recovid](https://github.com/amrmkayid/recovid.git) to make it easy for us focusing on analysing and visualizing the dataset without writing all the code snippets in the jupyter notebook.

### The library consist of these submodules:
- **data**: classes for representing single paper and a collection of research papers inside jupyter notebook
- **process**: using nlp techniques for preporcessing and cleaning the input text
- **utils**: some utilities and visualization methods used in the notebook
- **search**: contains classes and methods for building the interactive search engine

PS: To use our custom library, we need to install it first, it automatically install other dependencies and import all needed libraries for this project

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
!pip install git+https://github.com/amrmkayid/recovid.git
# !pip uninstall recovid -y

In [None]:
#@title Libraries & Dependencies

from recovid import *
import recovid.data as data
import recovid.process as process
import recovid.search as search
import recovid.utils as utils

# Configs

In [None]:
#@title Configuration

plt.style.use("ggplot")
plt.rcParams["figure.figsize"] = (25, 10)
warnings.filterwarnings("ignore")

nltk.download("punkt")
nltk.download("stopwords")
nltk.download('wordnet')

ROOT_PATH = Path("/kaggle/input/CORD-19-research-challenge/")
METADATA_PATH = ROOT_PATH / "metadata.csv"
ROOT_PATH, METADATA_PATH

## Loading metadata
> We are going to explore the metadata, clean the dataframe, check how to extract useful information, and do some basic visualizations

In [None]:
metadata = pd.read_csv(
    METADATA_PATH,
    dtype={
        "doi": str,
        "title": str,
        "pubmed_id": str,
        "Microsoft Academic Paper ID": str,
    },
)
metadata.head(3)

In [None]:
metadata.info()

In [None]:
# Visualizing null values in each column
metadata.isna().sum().plot(kind="bar", stacked=True)

In [None]:
metadata.isna().sum()

In [None]:
# Distribution of title length

sns.distplot(metadata["title"].str.len())
plt.title("Distribution of title length")
plt.show()

In [None]:
#@title Visualizing Most Common Words from Title

utils.most_common_words_from_title(metadata)

In [None]:
#@title Visualizing Most Common Journals

utils.most_common_journals(metadata)

In [None]:
# Set the abstract to the paper title if it is null

metadata.abstract = metadata.abstract.fillna(metadata.title)
print("Number of articles before removing duplicates: %s " % len(metadata))

duplicate_paper = ~(metadata.title.isnull() | metadata.abstract.isnull() | metadata.publish_time.isnull()) & (metadata.duplicated(subset=['title', 'abstract']))
metadata.dropna(subset=['publish_time', 'journal'])
metadata = metadata[~duplicate_paper].reset_index(drop=True)
print("Number of articles AFTER removing duplicates: %s " % len(metadata))

## Creating an interactive class for research paper presentation

In [None]:
papers = data.ResearchPapers(metadata)

In [None]:
paper = papers[0]
print(f'Example paper \n\nTitle: {paper.title()} \n\nAuthors: {paper.authors(split=True)} \n\nAbstract: {paper.abstract()} \n\n')

In [None]:
# Summary for a single paper
paper

## Rendering the whole paper as HTML page inside jupyter notebook

In [None]:
paper.html()

In [None]:
display(HTML(paper.text()))

# Creating interactive Search Engine

In [None]:
search_engine = search.SearchEngine(metadata)
search_engine

In [None]:
keywords = 'virus pandemic' #@param {type:"string"}
results = search_engine.search(keywords, 50)
results.results.sort_values(by=['publish_time'], ascending=False).head(5)

# Creating an Autocomplete Search bar with ranking by score

In [None]:
search_terms = 'virus pandemic' #@param {type:"string"}
searchbar = widgets.interactive(lambda search_terms: search.search_papers(search_engine, search_terms), search_terms=search_terms)
searchbar

# COVID Research Tasks 

In [None]:
tasks = [
    ('What is known about transmission, incubation, and environmental stability?',
     'transmission incubation environment coronavirus'),
    ('What do we know about COVID-19 risk factors?', 'risk factors'),
    ('What do we know about virus genetics, origin, and evolution?',
     'genetics origin evolution'),
    ('What has been published about ethical and social science considerations',
     'ethics ethical social'),
    ('What do we know about diagnostics and surveillance?',
     'diagnose diagnostic surveillance'),
    ('What has been published about medical care?', 'medical care'),
    ('What do we know about vaccines and therapeutics?',
     'vaccines vaccine vaccinate therapeutic therapeutics')
]
tasks = pd.DataFrame(tasks, columns=['Task', 'Keywords'])

# Kaggle/Colab widget searching

In [None]:
results = interact(lambda task: search.show_task(search_engine, tasks, task), task=tasks.Task.tolist());