Download and load the data
Download only the metadata.csv file from the CORD-19 dataset
Load it into a pandas DataFrame
Examine the first few rows and data structure
Check the DataFrame dimensions (rows, columns)
Identify data types of each column
Check for missing values in important columns
Generate basic statistics for numerical columns
Identify columns with many missing values
Decide how to handle missing values (removal or filling)
Create a cleaned version of the dataset
Convert date columns to datetime format
Extract year from publication date for time-based analysis
Create new columns if needed (e.g., abstract word count)
Count papers by publication year
Identify top journals publishing COVID-19 research
Find most frequent words in titles (using simple word frequency)
Plot number of publications over time
Create a bar chart of top publishing journals
Generate a word cloud of paper titles
Plot distribution of paper counts by source
Create a basic layout with title and description
Add interactive widgets (sliders, dropdowns)
Display your visualizations in the app
Show a sample of the data