This project analyzes the CORD-19 metadata dataset and presents findings through a Streamlit web application.
The CORD-19 dataset contains information about COVID-19 research papers. This analysis focuses on:
- Publication trends over time
- Top publishing journals
- Word frequency in paper titles
- Distribution by data source
The metadata.csv
file from the CORD-19 dataset includes:
- Paper titles and abstracts
- Publication dates
- Authors and journals
- Source information
Download from: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
- Clone this repository
- Create a virtual environment:
python -m venv .venv
- Activate the environment:
.venv\Scripts\activate
(Windows) - Install dependencies:
pip install -r requirements.txt
Run the cord19_analysis.ipynb
notebook for data exploration and analysis.
Run the Streamlit app: streamlit run app.py
The app allows interactive filtering by year range and data source.
- Most COVID-19 papers were published in 2020
- Top journals include various medical and scientific publications
- Common words in titles: covid, coronavirus, sars, etc.
- Sources are primarily from PMC, bioRxiv, etc.
- Handling large dataset size
- Dealing with missing values in abstracts
- Parsing dates in various formats
- Python
- Pandas
- Matplotlib
- Seaborn
- Streamlit
- WordCloud