# CORD-19 Data Analysis

This notebook provides a beginner-friendly analysis of the **CORD-19 research dataset** (COVID-19 Open Research Dataset).  
We focus on the `metadata.csv` file, which contains information about research papers such as:

- Titles and abstracts  
- Authors and journals  
- Publication dates  
- Sources  

The goals of this analysis are:
1. Explore the structure of the dataset.  
2. Clean and prepare the data.  
3. Perform basic visualizations (publications per year, top journals, word frequency).  
4. Reflect on key insights.  


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re


In [None]:
# Load metadata.csv (sample or full file)
df = pd.read_csv('../data/metadata.csv')

# Display first 5 rows
df.head()


In [None]:
# Shape of dataset (rows, columns)
df.shape


In [None]:
# General info about columns and data types
df.info()


In [None]:
# Check for missing values
df.isnull().sum()


We can see that some columns contain missing values, especially in **abstracts** and **journals**.  
For this analysis, we will focus on:  

- `title`  
- `abstract`  
- `authors`  
- `journal`  
- `publish_time`  
- `source_x`  


In [None]:
# Drop rows without title or publish_time
df_clean = df.dropna(subset=['title', 'publish_time'])

# Convert publish_time to datetime
df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')

# Drop rows where conversion failed
df_clean = df_clean.dropna(subset=['publish_time'])

# Extract year
df_clean['year'] = df_clean['publish_time'].dt.year

# Add abstract word count
df_clean['abstract_word_count'] = df_clean['abstract'].fillna("").apply(lambda x: len(x.split()))

df_clean.head()


In [None]:
year_counts = df_clean['year'].value_counts().sort_index()

plt.figure(figsize=(10,6))
sns.barplot(x=year_counts.index, y=year_counts.values, palette='viridis')
plt.title("Publications by Year")
plt.xlabel("Year")
plt.ylabel("Number of Papers")
plt.show()


In [None]:
top_journals = df_clean['journal'].value_counts().head(10)

plt.figure(figsize=(12,6))
sns.barplot(x=top_journals.values, y=top_journals.index, palette='magma')
plt.title("Top 10 Journals Publishing COVID-19 Research")
plt.xlabel("Number of Papers")
plt.ylabel("Journal")
plt.show()


In [None]:
# Combine all titles
titles = ' '.join(df_clean['title'].dropna().tolist()).lower()

# Extract words
words = re.findall(r'\b\w+\b', titles)

# Count word frequency
word_counts = Counter(words)
common_words = word_counts.most_common(20)

# Plot
words, counts = zip(*common_words)
plt.figure(figsize=(12,6))
sns.barplot(x=list(counts), y=list(words), palette='coolwarm')
plt.title("Top 20 Most Frequent Words in Titles")
plt.xlabel("Frequency")
plt.ylabel("Word")
plt.show()


In [None]:
source_counts = df_clean['source_x'].value_counts()

plt.figure(figsize=(10,6))
sns.barplot(x=source_counts.index, y=source_counts.values, palette='pastel')
plt.title("Distribution of Papers by Source")
plt.xlabel("Source")
plt.ylabel("Number of Papers")
plt.show()


In [None]:
## Conclusion & Reflection

From this basic analysis, we observed:

- The majority of COVID-19 research papers were published between **2020 and 2021**, showing the global urgency of the pandemic.  
- Journals such as *Lancet, Nature Medicine,* and *Science* contributed significantly to publishing COVID-19 research.  
- Frequent words in titles included "COVID-19", "SARS-CoV-2", "health", and "pandemic".  
- Sources like **PMC** and **Elsevier** provided many papers in this dataset.  

### Reflection
- A challenge was dealing with **missing values** (e.g., abstracts and journals were often missing).  
- Parsing dates also required cleaning since some rows had invalid formats.  
- I learned how to clean, analyze, and visualize real-world datasets using **pandas, matplotlib, and seaborn**.  
- This assignment also introduced me to **Streamlit** for building an interactive data app.  
