This project provides an interactive data exploration and visualization platform for analyzing the CORD-19 research dataset. It consists of a Jupyter notebook for exploratory data analysis and a Streamlit web application for interactive visualizations and insights.
The COVID-19 Research Dataset Analysis project aims to:
- Facilitate exploration of COVID-19 research publications metadata
- Identify publication trends and patterns over time
- Analyze top contributing journals and institutions
- Provide accessible visualizations for research insights
- Enable data-driven understanding of the scientific response to COVID-19
- Purpose: Exploratory data analysis and initial insights
- Data Flow: Load → Clean → Transform → Analyze → Visualize
- Output: Static charts and analysis summaries
- Audience: Data scientists, researchers, analysts
- Purpose: Interactive web-based dashboard
- Architecture: Single-page app with tabbed interface
- Data Caching: Optimized performance with Streamlit's caching
- Deployment: Locally runnable web application
- Data Loading & Cleaning: Handles CORD-19 metadata with proper missing value treatment
- Time-based Analysis: Publication trends over time (2019-2024)
- Journey Analysis: Top publishing journals and word frequency patterns
- Visualizations: Line plots, bar charts, word clouds, frequency plots
- Statistical Summaries: Descriptive statistics for numerical features
- Dashboard Layout: Multi-tab interface (Overview, Publications, Journals, Word Analysis)
- Interactive Controls: Sliders for customizable data display
- Metric Cards: Key statistics display (total papers, time range, word counts)
- Dynamic Charts: Matplotlib and Streamlit built-in charts
- Data Table Views: Scrollable tables for detailed examination
- Python 3.7+
- Internet connection for dependency installation
-
Create Virtual Environment:
python -m venv venv venv\Scripts\activate # Windows
-
Install Dependencies:
pip install -r requirements.txt
- Place
metadata.csv
file in the project root directory - Ensure CSV contains CORD-19 dataset columns (cord_uid, title, abstract, publish_time, authors, journal)
jupyter notebook COVID.ipynb
Process Flow:
- Import libraries
- Load and examine data structure
- Perform data cleaning
- Transform dates and add derived columns
- Generate analyses (publications by year, top journals, word frequencies)
- Create visualizations
streamlit run streamlit_app.py
Usage Guide:
- Overview Tab: View dataset statistics and sample publications
- Publications Tab: Analyze temporal trends and publication counts
- Journals Tab: Explore top publishing journals with customizable counts
- Word Analysis Tab: Examine title word frequencies and cloud visualizations
- Adjust slider controls to view different numbers of results
- Use multiselect filters for year-based analysis
- Resize browser window for responsive layout
The applications reveal key patterns in COVID-19 research:
- Accelerated publication growth from 2020 onwards
- Medical and scientific journals as primary publishers
- Thematic focus on COVID, SARS, viral topics in titles
- Increasing publication volume reflecting pandemic urgency
- Dependencies: pandas, matplotlib, seaborn, streamlit, wordcloud
- Data Size: Optimized for datasets up to 1M+ rows
- Performance: Streamlit caching prevents redundant data processing
- Compatibility: Tested on Windows 11 with Python 3.9-3.13
For detailed code documentation, see inline comments in respective files.