CORD-19 Data Explorer This project provides a Python script for analyzing and visualizing a subset of the COVID-19 Open Research Dataset (CORD-19). The script is divided into two main parts: a command-line analysis workflow and a a web-based data explorer using the Streamlit framework.
Features Data Loading: Automatically downloads and loads the CORD-19 metadata.
Data Cleaning: Handles missing values and prepares the data for analysis.
Data Analysis:
Counts papers by publication year.
Identifies the top publishing journals.
Finds the most frequent words in paper titles.
Visualizations: Generates various plots to visualize key insights, including a word cloud of titles, a bar chart of publications over time, and a chart of the top journals.
Interactive Web App: A Streamlit application that allows users to explore the data with an interactive slider for filtering by publication year.
Getting Started Prerequisites To run the script, you will need to have Python installed on your system along with the following libraries:
pandas
requests
matplotlib
wordcloud
streamlit
You can install these dependencies using pip:
pip install pandas requests matplotlib wordcloud streamlit
How to Run
- Command-Line Analysis The script contains a main execution block that, when run directly, will perform the data loading, cleaning, and visualization steps, displaying the plots in separate windows.
To run this part of the script, save the code as cord_19_analysis.py and execute it from your terminal:
python cord_19_analysis.py
The script will print analysis outputs to the console and display the generated plots.
- Streamlit Web App To run the interactive web application, you must first have Streamlit installed. The run_streamlit_app() function is designed to be the entry point for the app.
To launch the app, save the code as app.py and run the following command in your terminal:
streamlit run app.py
This will open a new tab in your web browser with the interactive CORD-19 Data Explorer.
Code Structure The script is organized into logical parts to make it easy to understand and modify:
load_data(): Responsible for fetching the dataset.
clean_data(): Cleans and prepares the DataFrame.
analyze_and_visualize(): Performs the core analysis and generates static plots.
run_streamlit_app(): Contains the full code for the Streamlit web application.
if name == 'main':: The main block that controls which part of the script is executed.