# CORD-19 Metadata Analysis & Streamlit App

This notebook guides you through loading, exploring, cleaning, and visualizing the CORD-19 metadata.csv dataset, and building a simple Streamlit app to present your findings.

**Outline:**
1. Install and Import Required Libraries
2. Download and Load the Dataset
3. Explore Data Structure and Basic Statistics
4. Handle Missing Data and Clean Dataset
5. Prepare Data for Analysis
6. Analyze Publication Trends by Year
7. Identify Top Journals
8. Word Frequency Analysis in Titles
9. Create Visualizations
10. Build Streamlit Application

---

## 1. Install and Import Required Libraries

Install the necessary Python packages and import them for use in this notebook.

In [5]:
%pip install pandas matplotlib seaborn streamlit wordcloud

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import streamlit as st

Collecting pandas
  Downloading pandas-2.3.2-cp313-cp313-win_amd64.whl.metadata (19 kB)
Collecting matplotlib
  Downloading matplotlib-3.10.6-cp313-cp313-win_amd64.whl.metadata (11 kB)
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting streamlit
  Downloading streamlit-1.49.1-py3-none-any.whl.metadata (9.5 kB)
Collecting wordcloud
  Downloading wordcloud-1.9.4-cp313-cp313-win_amd64.whl.metadata (3.5 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.3-cp313-cp313-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.60.0-cp313-cp313-win_amd64.whl.metadata (1

Matplotlib is building the font cache; this may take a moment.


## 2. Download and Load the Dataset

Download the `metadata.csv` file from the CORD-19 dataset (Kaggle) and load it into a pandas DataFrame.

In [6]:
import pandas as pd

# Load the metadata.csv file (ensure it is in the working directory)
df = pd.read_csv('metadata.csv')

# Display the first few rows
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'metadata.csv'

## 3. Explore Data Structure and Basic Statistics

Examine the DataFrame shape, column data types, and generate basic statistics for numerical columns.

In [None]:
# DataFrame shape (rows, columns)
df.shape

# DataFrame info (column types, non-null counts)
df.info()

# Basic statistics for numerical columns
df.describe()

NameError: name 'df' is not defined

In [None]:
pip install pandas matplotlib seaborn streamlit wordcloud

## 4. Handle Missing Data and Clean Dataset

Identify columns with missing values, decide on removal or filling strategies, and create a cleaned version of the dataset.

In [None]:
# Check missing values in important columns
missing = df.isnull().sum().sort_values(ascending=False)
missing

# Drop rows with missing 'title' or 'publish_time' (essential for analysis)
df_clean = df.dropna(subset=['title', 'publish_time'])

# Optionally fill missing abstracts with empty string
df_clean['abstract'] = df_clean['abstract'].fillna('')

df_clean.head()

## 5. Prepare Data for Analysis

Convert date columns to datetime, extract publication year, and create new columns such as abstract word count.

In [None]:
# Convert publish_time to datetime and extract year
df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')
df_clean['year'] = df_clean['publish_time'].dt.year

# Create abstract word count column
df_clean['abstract_word_count'] = df_clean['abstract'].apply(lambda x: len(str(x).split()))

df_clean[['title', 'publish_time', 'year', 'abstract_word_count']].head()

## 6. Analyze Publication Trends by Year

Count the number of papers published each year and prepare data for visualization.

In [None]:
# Count papers by publication year
year_counts = df_clean['year'].value_counts().sort_index()
year_counts

## 7. Identify Top Journals

Find and display the journals with the most COVID-19 research papers.

In [None]:
# Top journals by publication count
top_journals = df_clean['journal'].value_counts().head(10)
top_journals

## 8. Word Frequency Analysis in Titles

Perform simple word frequency analysis on paper titles to find most common words.

In [None]:
# Simple word frequency analysis for titles
from collections import Counter
import re

# Combine all titles into one string
all_titles = ' '.join(df_clean['title'].dropna().astype(str))
# Remove punctuation and split into words
words = re.findall(r'\b\w+\b', all_titles.lower())
# Remove common stopwords
stopwords = set(['the', 'and', 'of', 'in', 'to', 'for', 'a', 'on', 'with', 'by', 'an', 'at', 'from', 'is', 'as', 'are', 'that', 'this', 'be', 'or', 'we', 'can', 'covid', '19'])
filtered_words = [w for w in words if w not in stopwords]
word_freq = Counter(filtered_words)
word_freq.most_common(20)

## 9. Create Visualizations

Plot publications over time, bar chart of top journals, word cloud of titles, and distribution of paper counts by source.

In [None]:
# Publications by year
plt.figure(figsize=(8,5))
year_counts.plot(kind='bar')
plt.title('Publications by Year')
plt.xlabel('Year')
plt.ylabel('Number of Papers')
plt.tight_layout()
plt.show()

# Top journals bar chart
plt.figure(figsize=(10,5))
top_journals.plot(kind='bar')
plt.title('Top Journals Publishing COVID-19 Research')
plt.xlabel('Journal')
plt.ylabel('Number of Papers')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Word cloud of paper titles
wc = WordCloud(width=800, height=400, background_color='white').generate(' '.join(filtered_words))
plt.figure(figsize=(10,5))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Paper Titles')
plt.show()

# Distribution of paper counts by source
plt.figure(figsize=(8,5))
df_clean['source_x'].value_counts().plot(kind='bar')
plt.title('Distribution of Paper Counts by Source')
plt.xlabel('Source')
plt.ylabel('Number of Papers')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 10. Build Streamlit Application

Create a Streamlit app with title, description, interactive widgets, and display visualizations and sample data.

In [None]:
# Streamlit app code (save as app.py for deployment)
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud

def run_app():
    st.title("CORD-19 Data Explorer")
    st.write("Simple exploration of COVID-19 research papers")
    # Load data
    df = pd.read_csv('metadata.csv')
    df['publish_time'] = pd.to_datetime(df['publish_time'], errors='coerce')
    df['year'] = df['publish_time'].dt.year
    # Interactive year range
    min_year, max_year = int(df['year'].min()), int(df['year'].max())
    year_range = st.slider("Select year range", min_year, max_year, (min_year, max_year))
    df_filtered = df[(df['year'] >= year_range[0]) & (df['year'] <= year_range[1])]
    st.write(f"Number of papers: {len(df_filtered)}")
    # Publications by year
    year_counts = df_filtered['year'].value_counts().sort_index()
    fig, ax = plt.subplots()
    year_counts.plot(kind='bar', ax=ax)
    ax.set_title('Publications by Year')
    st.pyplot(fig)
    # Top journals
    top_journals = df_filtered['journal'].value_counts().head(10)
    fig2, ax2 = plt.subplots()
    top_journals.plot(kind='bar', ax=ax2)
    ax2.set_title('Top Journals')
    st.pyplot(fig2)
    # Word cloud
    all_titles = ' '.join(df_filtered['title'].dropna().astype(str))
    wc = WordCloud(width=800, height=400, background_color='white').generate(all_titles)
    fig3, ax3 = plt.subplots(figsize=(10,5))
    ax3.imshow(wc, interpolation='bilinear')
    ax3.axis('off')
    st.pyplot(fig3)
    # Show sample data
    st.write("Sample Data:")
    st.dataframe(df_filtered[['title', 'journal', 'publish_time']].head(10))

# Uncomment below to run in a script
# if __name__ == "__main__":
#     run_app()