# CORD-19 Dataset Analysis

This notebook analyzes the metadata.csv file from the CORD-19 dataset, which contains information about COVID-19 research papers. The analysis covers data loading, exploration, cleaning, visualization, and building a Streamlit application.

## Required Libraries
- pandas
- matplotlib
- seaborn
- wordcloud
- streamlit (for the app)

Install them using: `pip install pandas matplotlib seaborn wordcloud streamlit`

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

## Part 1: Data Loading and Basic Exploration

In [None]:
# Load the metadata.csv file
df = pd.read_csv('metadata.csv', low_memory=False)

# Display basic information
print(f"Dataset shape: {df.shape}")
print("\nFirst 5 rows:")
df.head()

In [None]:
# Data types and missing values
print("Data types:")
print(df.dtypes)
print("\nMissing values per column:")
print(df.isnull().sum())

In [None]:
# Basic statistics for numerical columns
df.describe()

## Part 2: Data Cleaning and Preparation

In [None]:
# Create a copy for cleaning
df_clean = df.copy()

# Handle missing values
# Drop rows with missing titles (essential)
df_clean = df_clean.dropna(subset=['title'])

# Fill missing abstracts with empty string
df_clean['abstract'] = df_clean['abstract'].fillna('')

# Convert publish_time to datetime
df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')

# Extract year from publish_time
df_clean['publish_year'] = df_clean['publish_time'].dt.year

# Create abstract word count
df_clean['abstract_word_count'] = df_clean['abstract'].apply(lambda x: len(str(x).split()))

print(f"Cleaned dataset shape: {df_clean.shape}")
df_clean.head()

## Part 3: Data Analysis and Visualization

In [None]:
# Publications by year
yearly_counts = df_clean['publish_year'].value_counts().sort_index()

plt.figure(figsize=(12, 6))
yearly_counts.plot(kind='bar')
plt.title('Number of Publications by Year')
plt.xlabel('Year')
plt.ylabel('Number of Papers')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Top journals
top_journals = df_clean['journal'].value_counts().head(10)

plt.figure(figsize=(12, 6))
top_journals.plot(kind='barh')
plt.title('Top 10 Journals Publishing COVID-19 Research')
plt.xlabel('Number of Papers')
plt.ylabel('Journal')
plt.tight_layout()
plt.show()

In [None]:
# Word cloud of titles
titles_text = ' '.join(df_clean['title'].dropna())

wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(titles_text)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Paper Titles')
plt.show()

In [None]:
# Distribution by source
source_counts = df_clean['source_x'].value_counts()

plt.figure(figsize=(10, 6))
source_counts.plot(kind='pie', autopct='%1.1f%%')
plt.title('Distribution of Papers by Source')
plt.ylabel('')
plt.show()

## Part 4: Streamlit Application

The Streamlit app code is in a separate file: `streamlit_app.py`

Run it with: `streamlit run streamlit_app.py`