# CORD-19 Research Dataset Analysis

## Assignment Overview
This notebook provides a comprehensive analysis of the CORD-19 research dataset, focusing on COVID-19 research papers metadata. We'll explore the data, clean it, create visualizations, and prepare components for a Streamlit application.

### Learning Objectives:
- Practice loading and exploring real-world datasets
- Learn basic data cleaning techniques
- Create meaningful visualizations
- Build components for interactive web applications
- Present data insights effectively

## Part 1: Import Required Libraries
Import all necessary libraries for data analysis, visualization, and text processing.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Text processing and word clouds
from wordcloud import WordCloud
from collections import Counter
import re

# Date and time handling
from datetime import datetime, date
import warnings

# For downloading data
import requests
import os

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
warnings.filterwarnings('ignore')

# Set style for matplotlib
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")

## Part 2: Download and Load the CORD-19 Dataset

**Note**: For this example, we'll create a sample dataset since the full CORD-19 dataset is very large. In a real scenario, you would download the metadata.csv file from Kaggle.

### Instructions for downloading the real dataset:
1. Go to https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
2. Download only the `metadata.csv` file
3. Place it in the `data/` directory
4. Uncomment the real data loading code below

In [None]:
# For demonstration, we'll create a sample dataset
# In practice, you would load the real CORD-19 metadata

def create_sample_cord19_data(n_samples=1000):
    """
    Create a sample dataset that mimics the CORD-19 metadata structure
    This is for demonstration purposes only.
    """
    np.random.seed(42)
    
    # Sample journals
    journals = [
        'Nature', 'Science', 'The Lancet', 'New England Journal of Medicine',
        'Cell', 'PLOS ONE', 'Nature Medicine', 'Journal of Virology',
        'Virology', 'Nature Communications', 'BMJ', 'JAMA',
        'Proceedings of the National Academy of Sciences', 'Nature Microbiology',
        'Clinical Infectious Diseases', 'Emerging Infectious Diseases'
    ]
    
    # Sample COVID-related terms for titles
    covid_terms = [
        'COVID-19', 'SARS-CoV-2', 'coronavirus', 'pandemic', 'vaccine',
        'antiviral', 'treatment', 'symptoms', 'transmission', 'respiratory',
        'infection', 'immunity', 'antibody', 'outbreak', 'diagnosis',
        'therapeutic', 'prevention', 'epidemiology', 'public health', 'clinical'
    ]
    
    # Generate sample data
    data = {
        'cord_uid': [f'cord-{i:06d}' for i in range(n_samples)],
        'title': [f'{np.random.choice(covid_terms)} {np.random.choice(["study", "analysis", "research", "investigation", "treatment", "vaccine", "therapy"])} in {np.random.choice(["patients", "population", "healthcare workers", "elderly", "children"])}: {np.random.choice(["a systematic review", "clinical trial", "observational study", "meta-analysis", "case series"])}' for _ in range(n_samples)],
        'abstract': [f'This study investigates {np.random.choice(covid_terms).lower()} in the context of pandemic response. Methods included analysis of {np.random.randint(50, 5000)} participants over {np.random.randint(1, 24)} months.' for _ in range(n_samples)],
        'authors': [f'Author{i % 100}, J.; Smith, A.; Johnson, B.' for i in range(n_samples)],
        'journal': np.random.choice(journals, n_samples),
        'publish_time': pd.date_range(start='2019-12-01', end='2023-12-31', periods=n_samples),
        'source_x': np.random.choice(['PMC', 'Elsevier', 'arXiv', 'bioRxiv', 'medRxiv'], n_samples),
        'pmcid': [f'PMC{np.random.randint(1000000, 9999999)}' if np.random.random() > 0.3 else None for _ in range(n_samples)],
        'pubmed_id': [np.random.randint(10000000, 99999999) if np.random.random() > 0.2 else None for _ in range(n_samples)],
        'license': np.random.choice(['cc-by', 'cc-by-nc', 'cc-by-sa', 'els-covid', 'arxiv'], n_samples),
        'has_full_text': np.random.choice([True, False], n_samples, p=[0.7, 0.3])
    }
    
    return pd.DataFrame(data)

# Create sample data
print("Creating sample CORD-19 dataset...")
df = create_sample_cord19_data(2000)

# Uncomment below to load real CORD-19 data
# df = pd.read_csv('../data/metadata.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

## Part 3: Basic Data Exploration
Let's examine the structure and basic properties of our dataset.

In [None]:
# Display basic information about the dataset
print("=== DATASET OVERVIEW ===")
print(f"Dataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]:,}")
print(f"Number of columns: {df.shape[1]}")
print("\n=== COLUMN NAMES ===")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

In [None]:
# Display first few rows
print("=== FIRST 5 ROWS ===")
display(df.head())

In [None]:
# Data types and memory usage
print("=== DATA TYPES AND MEMORY USAGE ===")
df.info(memory_usage='deep')

In [None]:
# Basic statistics for numerical columns
print("=== BASIC STATISTICS ===")
display(df.describe(include='all'))

## Part 4: Data Cleaning and Missing Value Handling
Identify and handle missing values in the dataset.

In [None]:
# Check for missing values
print("=== MISSING VALUES ANALYSIS ===")
missing_data = df.isnull().sum().sort_values(ascending=False)
missing_percentage = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage': missing_percentage
})

print(missing_df[missing_df['Missing Count'] > 0])

In [None]:
# Visualize missing data
plt.figure(figsize=(12, 6))
missing_counts = df.isnull().sum()
missing_counts = missing_counts[missing_counts > 0]

if len(missing_counts) > 0:
    plt.bar(range(len(missing_counts)), missing_counts.values)
    plt.xticks(range(len(missing_counts)), missing_counts.index, rotation=45)
    plt.title('Missing Values by Column')
    plt.ylabel('Count of Missing Values')
    plt.tight_layout()
    plt.show()
else:
    print("No missing values found in the dataset!")

In [None]:
# Create a cleaned version of the dataset
print("=== DATA CLEANING ===")
df_clean = df.copy()

# Remove rows where title is missing (critical field)
initial_rows = len(df_clean)
df_clean = df_clean.dropna(subset=['title'])
print(f"Removed {initial_rows - len(df_clean)} rows with missing titles")

# Fill missing abstracts with placeholder
df_clean['abstract'] = df_clean['abstract'].fillna('Abstract not available')

# Fill missing journal names
df_clean['journal'] = df_clean['journal'].fillna('Unknown Journal')

# For other missing values, we'll keep them as NaN for now
print(f"Cleaned dataset shape: {df_clean.shape}")
print(f"Rows retained: {len(df_clean)/initial_rows*100:.1f}%")

## Part 5: Data Type Conversion and Feature Engineering
Convert data types and create new features for analysis.

In [None]:
# Convert publish_time to datetime if it's not already
print("=== DATA TYPE CONVERSION ===")
if df_clean['publish_time'].dtype == 'object':
    df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')
    print("Converted publish_time to datetime")
else:
    print("publish_time is already datetime format")

# Extract year from publication date
df_clean['publication_year'] = df_clean['publish_time'].dt.year
print("Created publication_year feature")

# Extract month for seasonal analysis
df_clean['publication_month'] = df_clean['publish_time'].dt.month
df_clean['publication_month_name'] = df_clean['publish_time'].dt.month_name()
print("Created publication_month features")

# Create abstract word count feature
df_clean['abstract_word_count'] = df_clean['abstract'].str.split().str.len()
print("Created abstract_word_count feature")

# Create title word count feature
df_clean['title_word_count'] = df_clean['title'].str.split().str.len()
print("Created title_word_count feature")

# Create title length feature
df_clean['title_length'] = df_clean['title'].str.len()
print("Created title_length feature")

print(f"\nDataset now has {df_clean.shape[1]} columns (added {df_clean.shape[1] - df.shape[1]} new features)")

In [None]:
# Display the new features
print("=== NEW FEATURES SUMMARY ===")
new_features = ['publication_year', 'publication_month', 'publication_month_name', 
                'abstract_word_count', 'title_word_count', 'title_length']

for feature in new_features:
    if feature in df_clean.columns:
        print(f"\n{feature}:")
        if df_clean[feature].dtype in ['int64', 'float64']:
            print(f"  Range: {df_clean[feature].min()} - {df_clean[feature].max()}")
            print(f"  Mean: {df_clean[feature].mean():.1f}")
        else:
            print(f"  Unique values: {df_clean[feature].nunique()}")
            print(f"  Sample values: {list(df_clean[feature].value_counts().head(3).index)}")

## Part 6: Publication Year Analysis
Analyze the distribution of papers by publication year to understand research trends.

In [None]:
# Analyze publications by year
print("=== PUBLICATION YEAR ANALYSIS ===")
year_counts = df_clean['publication_year'].value_counts().sort_index()
print("Papers by year:")
for year, count in year_counts.items():
    if pd.notna(year):
        print(f"  {int(year)}: {count:,} papers")

print(f"\nTotal papers with valid publication year: {year_counts.sum():,}")
print(f"Peak year: {year_counts.idxmax()} with {year_counts.max():,} papers")

In [None]:
# Create publication year distribution visualization
plt.figure(figsize=(12, 6))
year_counts_clean = year_counts.dropna()
plt.bar(year_counts_clean.index, year_counts_clean.values, color='skyblue', edgecolor='navy', alpha=0.7)
plt.title('Distribution of COVID-19 Research Papers by Publication Year', fontsize=16, fontweight='bold')
plt.xlabel('Publication Year', fontsize=12)
plt.ylabel('Number of Papers', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for year, count in year_counts_clean.items():
    plt.text(year, count + max(year_counts_clean) * 0.01, str(count), 
             ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## Part 7: Journal and Source Analysis
Identify the top journals and sources publishing COVID-19 research.

In [None]:
# Analyze top journals
print("=== TOP JOURNALS ANALYSIS ===")
top_journals = df_clean['journal'].value_counts().head(15)
print("Top 15 journals by publication count:")
for i, (journal, count) in enumerate(top_journals.items(), 1):
    print(f"{i:2d}. {journal}: {count:,} papers")

print(f"\nTotal unique journals: {df_clean['journal'].nunique():,}")

In [None]:
# Analyze sources
print("\n=== SOURCE ANALYSIS ===")
source_counts = df_clean['source_x'].value_counts()
print("Papers by source:")
for source, count in source_counts.items():
    percentage = (count / len(df_clean)) * 100
    print(f"  {source}: {count:,} papers ({percentage:.1f}%)")

In [None]:
# Visualize top journals
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Top journals bar chart
top_10_journals = df_clean['journal'].value_counts().head(10)
ax1.barh(range(len(top_10_journals)), top_10_journals.values, color='lightcoral')
ax1.set_yticks(range(len(top_10_journals)))
ax1.set_yticklabels([j[:50] + '...' if len(j) > 50 else j for j in top_10_journals.index])
ax1.set_xlabel('Number of Papers')
ax1.set_title('Top 10 Journals Publishing COVID-19 Research', fontweight='bold')
ax1.grid(axis='x', alpha=0.3)

# Add value labels
for i, v in enumerate(top_10_journals.values):
    ax1.text(v + max(top_10_journals.values) * 0.01, i, str(v), 
             va='center', fontweight='bold')

# Source distribution pie chart
ax2.pie(source_counts.values, labels=source_counts.index, autopct='%1.1f%%', 
        colors=plt.cm.Set3.colors, startangle=90)
ax2.set_title('Distribution of Papers by Source', fontweight='bold')

plt.tight_layout()
plt.show()

## Part 8: Title Text Analysis and Word Frequency
Analyze the most common words and terms in paper titles.

In [None]:
# Extract and analyze words from titles
print("=== TITLE TEXT ANALYSIS ===")

def extract_words(text_series, min_length=3):
    """
    Extract words from text, removing common stop words and short words
    """
    # Common stop words to remove
    stop_words = {
        'the', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by',
        'from', 'up', 'about', 'into', 'through', 'during', 'before', 'after',
        'above', 'below', 'between', 'among', 'throughout', 'despite', 'towards',
        'upon', 'concerning', 'a', 'an', 'as', 'are', 'was', 'were', 'been', 'be',
        'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should',
        'may', 'might', 'must', 'can', 'shall', 'is', 'this', 'that', 'these', 'those'
    }
    
    all_words = []
    for text in text_series.dropna():
        # Convert to lowercase and extract words
        words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
        # Filter words
        filtered_words = [w for w in words if len(w) >= min_length and w not in stop_words]
        all_words.extend(filtered_words)
    
    return all_words

# Extract words from titles
title_words = extract_words(df_clean['title'])
word_freq = Counter(title_words)

print(f"Total words extracted: {len(title_words):,}")
print(f"Unique words: {len(word_freq):,}")
print("\nTop 20 most frequent words in titles:")
for i, (word, count) in enumerate(word_freq.most_common(20), 1):
    print(f"{i:2d}. {word}: {count:,} occurrences")

In [None]:
# Visualize word frequency
top_words = word_freq.most_common(15)
words, counts = zip(*top_words)

plt.figure(figsize=(12, 8))
bars = plt.barh(range(len(words)), counts, color='mediumseagreen')
plt.yticks(range(len(words)), words)
plt.xlabel('Frequency')
plt.title('Most Frequent Words in COVID-19 Research Paper Titles', fontsize=16, fontweight='bold')
plt.gca().invert_yaxis()  # Show highest frequency at top
plt.grid(axis='x', alpha=0.3)

# Add value labels
for i, (bar, count) in enumerate(zip(bars, counts)):
    plt.text(count + max(counts) * 0.01, i, str(count), 
             va='center', fontweight='bold')

plt.tight_layout()
plt.show()

## Part 9: Time Series Visualization of Publications
Create detailed time series plots showing publication trends over time.

In [None]:
# Create monthly publication timeline
df_clean['year_month'] = df_clean['publish_time'].dt.to_period('M')
monthly_counts = df_clean['year_month'].value_counts().sort_index()

# Convert Period index to datetime for plotting
monthly_counts.index = monthly_counts.index.to_timestamp()

plt.figure(figsize=(15, 8))
plt.plot(monthly_counts.index, monthly_counts.values, linewidth=2, color='darkblue', marker='o', markersize=4)
plt.title('COVID-19 Research Publications Over Time (Monthly)', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Number of Papers Published', fontsize=12)
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)

# Highlight peak months
peak_month = monthly_counts.idxmax()
peak_count = monthly_counts.max()
plt.annotate(f'Peak: {peak_count} papers\n{peak_month.strftime("%B %Y")}', 
             xy=(peak_month, peak_count), xytext=(10, 10),
             textcoords='offset points', bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.7),
             arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'))

plt.tight_layout()
plt.show()

print(f"Peak publication month: {peak_month.strftime('%B %Y')} with {peak_count} papers")

In [None]:
# Create cumulative publications plot
cumulative_counts = monthly_counts.cumsum()

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 12))

# Monthly publications
ax1.plot(monthly_counts.index, monthly_counts.values, linewidth=2, color='darkred', marker='o', markersize=3)
ax1.set_title('Monthly COVID-19 Research Publications', fontsize=14, fontweight='bold')
ax1.set_ylabel('Papers per Month')
ax1.grid(True, alpha=0.3)
ax1.tick_params(axis='x', rotation=45)

# Cumulative publications
ax2.plot(cumulative_counts.index, cumulative_counts.values, linewidth=3, color='darkgreen')
ax2.fill_between(cumulative_counts.index, cumulative_counts.values, alpha=0.3, color='lightgreen')
ax2.set_title('Cumulative COVID-19 Research Publications', fontsize=14, fontweight='bold')
ax2.set_xlabel('Date')
ax2.set_ylabel('Total Papers Published')
ax2.grid(True, alpha=0.3)
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print(f"Total papers in dataset: {cumulative_counts.iloc[-1]:,}")

## Part 10: Journal Publication Distribution Charts
Create detailed visualizations of journal publication patterns.

In [None]:
# Analyze journal publication patterns over time
top_5_journals = df_clean['journal'].value_counts().head(5).index
journal_year_data = df_clean[df_clean['journal'].isin(top_5_journals)].groupby(['publication_year', 'journal']).size().unstack(fill_value=0)

# Create stacked area chart
plt.figure(figsize=(14, 8))
journal_year_data.plot(kind='area', stacked=True, alpha=0.7, figsize=(14, 8))
plt.title('Publication Trends by Top 5 Journals Over Time', fontsize=16, fontweight='bold')
plt.xlabel('Publication Year', fontsize=12)
plt.ylabel('Number of Papers', fontsize=12)
plt.legend(title='Journal', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Create journal impact visualization (papers vs average title length)
journal_stats = df_clean.groupby('journal').agg({
    'cord_uid': 'count',
    'title_length': 'mean',
    'abstract_word_count': 'mean'
}).rename(columns={'cord_uid': 'paper_count'})

# Filter to journals with at least 10 papers
journal_stats_filtered = journal_stats[journal_stats['paper_count'] >= 10].head(20)

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(18, 12))

# 1. Paper count vs average title length
scatter = ax1.scatter(journal_stats_filtered['paper_count'], 
                     journal_stats_filtered['title_length'],
                     alpha=0.6, s=100, c='purple')
ax1.set_xlabel('Number of Papers')
ax1.set_ylabel('Average Title Length (characters)')
ax1.set_title('Journal Paper Count vs Average Title Length')
ax1.grid(True, alpha=0.3)

# 2. Top journals by paper count
top_journals_count = journal_stats_filtered.nlargest(10, 'paper_count')
ax2.barh(range(len(top_journals_count)), top_journals_count['paper_count'], color='orange')
ax2.set_yticks(range(len(top_journals_count)))
ax2.set_yticklabels([j[:30] + '...' if len(j) > 30 else j for j in top_journals_count.index])
ax2.set_xlabel('Number of Papers')
ax2.set_title('Top 10 Journals by Paper Count')
ax2.grid(axis='x', alpha=0.3)

# 3. Distribution of papers per journal
all_journal_counts = df_clean['journal'].value_counts()
ax3.hist(all_journal_counts.values, bins=30, alpha=0.7, color='green', edgecolor='black')
ax3.set_xlabel('Papers per Journal')
ax3.set_ylabel('Number of Journals')
ax3.set_title('Distribution of Papers per Journal')
ax3.set_yscale('log')
ax3.grid(True, alpha=0.3)

# 4. Journal diversity over time
yearly_journal_diversity = df_clean.groupby('publication_year')['journal'].nunique()
ax4.plot(yearly_journal_diversity.index, yearly_journal_diversity.values, 
         marker='o', linewidth=2, color='red')
ax4.set_xlabel('Publication Year')
ax4.set_ylabel('Number of Unique Journals')
ax4.set_title('Journal Diversity by Year')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Total unique journals: {df_clean['journal'].nunique():,}")
print(f"Journals with only one paper: {sum(all_journal_counts == 1):,} ({sum(all_journal_counts == 1)/len(all_journal_counts)*100:.1f}%)")

## Part 11: Word Cloud Generation from Titles
Create visual word clouds to represent the most common terms in research titles.

In [None]:
# Create word cloud from titles
print("=== GENERATING WORD CLOUD ===")

# Combine all titles into one text
all_titles_text = ' '.join(df_clean['title'].dropna().astype(str))

# Remove common words and create word cloud
additional_stopwords = {
    'study', 'analysis', 'research', 'using', 'based', 'case', 'patients', 'patient',
    'clinical', 'systematic', 'review', 'meta', 'observational', 'trial', 'results',
    'among', 'associated', 'association', 'factors', 'risk', 'retrospective'
}

# Generate word cloud
wordcloud = WordCloud(
    width=1200, 
    height=600, 
    background_color='white',
    max_words=100,
    stopwords=additional_stopwords,
    colormap='viridis',
    relative_scaling=0.5,
    min_font_size=10
).generate(all_titles_text)

# Display word cloud
plt.figure(figsize=(15, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of COVID-19 Research Paper Titles', fontsize=20, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print(f"Word cloud generated from {len(df_clean['title'].dropna()):,} titles")

In [None]:
# Create separate word clouds for different years
fig, axes = plt.subplots(2, 2, figsize=(20, 12))
axes = axes.flatten()

# Get top 4 years by paper count
top_years = df_clean['publication_year'].value_counts().head(4).index

for i, year in enumerate(top_years):
    if pd.isna(year):
        continue
        
    year_titles = df_clean[df_clean['publication_year'] == year]['title'].dropna()
    year_text = ' '.join(year_titles.astype(str))
    
    if len(year_text.strip()) > 0:
        year_wordcloud = WordCloud(
            width=600, 
            height=400, 
            background_color='white',
            max_words=50,
            stopwords=additional_stopwords,
            colormap='Set2',
            relative_scaling=0.5
        ).generate(year_text)
        
        axes[i].imshow(year_wordcloud, interpolation='bilinear')
        axes[i].axis('off')
        axes[i].set_title(f'{int(year)} ({len(year_titles)} papers)', fontsize=14, fontweight='bold')

plt.suptitle('Word Clouds by Publication Year', fontsize=18, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 12: Source Distribution Analysis
Analyze the distribution of papers across different sources and repositories.

In [None]:
# Comprehensive source analysis
print("=== COMPREHENSIVE SOURCE ANALYSIS ===")

# Source distribution
source_stats = df_clean.groupby('source_x').agg({
    'cord_uid': 'count',
    'has_full_text': lambda x: sum(x == True),
    'title_length': 'mean',
    'abstract_word_count': 'mean'
}).rename(columns={
    'cord_uid': 'paper_count',
    'has_full_text': 'full_text_count'
})

source_stats['full_text_percentage'] = (source_stats['full_text_count'] / source_stats['paper_count']) * 100

print("Source statistics:")
for source in source_stats.index:
    stats = source_stats.loc[source]
    print(f"\n{source}:")
    print(f"  Papers: {stats['paper_count']:,}")
    print(f"  Full text available: {stats['full_text_count']:,} ({stats['full_text_percentage']:.1f}%)")
    print(f"  Avg title length: {stats['title_length']:.1f} characters")
    print(f"  Avg abstract words: {stats['abstract_word_count']:.1f}")

In [None]:
# Create comprehensive source visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

# 1. Source distribution pie chart
colors = plt.cm.Set3.colors
wedges, texts, autotexts = ax1.pie(source_stats['paper_count'], 
                                  labels=source_stats.index, 
                                  autopct='%1.1f%%',
                                  colors=colors,
                                  startangle=90)
ax1.set_title('Distribution of Papers by Source', fontweight='bold')

# 2. Full text availability by source
ax2.bar(source_stats.index, source_stats['full_text_percentage'], color='lightblue', edgecolor='navy')
ax2.set_title('Full Text Availability by Source', fontweight='bold')
ax2.set_ylabel('Percentage with Full Text')
ax2.set_ylim(0, 100)
ax2.tick_params(axis='x', rotation=45)
for i, v in enumerate(source_stats['full_text_percentage']):
    ax2.text(i, v + 2, f'{v:.1f}%', ha='center', fontweight='bold')

# 3. Source publication trends over time
source_year = df_clean.groupby(['publication_year', 'source_x']).size().unstack(fill_value=0)
source_year.plot(kind='line', ax=ax3, marker='o', linewidth=2)
ax3.set_title('Publication Trends by Source Over Time', fontweight='bold')
ax3.set_xlabel('Publication Year')
ax3.set_ylabel('Number of Papers')
ax3.legend(title='Source', bbox_to_anchor=(1.05, 1), loc='upper left')
ax3.grid(True, alpha=0.3)

# 4. Average metrics by source
x_pos = np.arange(len(source_stats.index))
width = 0.35

# Normalize metrics for comparison
norm_title_length = source_stats['title_length'] / source_stats['title_length'].max() * 100
norm_abstract_words = source_stats['abstract_word_count'] / source_stats['abstract_word_count'].max() * 100

ax4.bar(x_pos - width/2, norm_title_length, width, label='Title Length (normalized)', alpha=0.7)
ax4.bar(x_pos + width/2, norm_abstract_words, width, label='Abstract Words (normalized)', alpha=0.7)
ax4.set_title('Average Metrics by Source (Normalized)', fontweight='bold')
ax4.set_ylabel('Normalized Score (0-100)')
ax4.set_xticks(x_pos)
ax4.set_xticklabels(source_stats.index, rotation=45)
ax4.legend()
ax4.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## Part 13: Create Streamlit Application Structure
Build the foundation for our interactive Streamlit web application.

In [None]:
# Save processed data for Streamlit app
print("=== PREPARING DATA FOR STREAMLIT APP ===")

# Create a sample of the data for the app (to keep it responsive)
sample_size = min(1000, len(df_clean))
df_sample = df_clean.sample(n=sample_size, random_state=42)

# Save the sample data
df_sample.to_csv('../data/cord19_sample.csv', index=False)
print(f"Saved sample dataset with {len(df_sample):,} records to '../data/cord19_sample.csv'")

# Also save key statistics for the app
app_stats = {
    'total_papers': len(df_clean),
    'date_range': f"{df_clean['publish_time'].min().strftime('%Y-%m-%d')} to {df_clean['publish_time'].max().strftime('%Y-%m-%d')}",
    'unique_journals': df_clean['journal'].nunique(),
    'top_journal': df_clean['journal'].value_counts().index[0],
    'peak_year': df_clean['publication_year'].value_counts().index[0],
    'sources': list(df_clean['source_x'].unique())
}

import json
with open('../data/app_stats.json', 'w') as f:
    json.dump(app_stats, f, indent=2, default=str)

print("Saved app statistics to '../data/app_stats.json'")
print("\nKey statistics for the app:")
for key, value in app_stats.items():
    print(f"  {key}: {value}")

## Part 14: Summary and Key Findings
Summarize our analysis and key insights from the CORD-19 dataset.

In [None]:
# Generate comprehensive summary
print("=== CORD-19 DATASET ANALYSIS SUMMARY ===")
print("\n" + "="*60)
print("KEY FINDINGS AND INSIGHTS")
print("="*60)

# Dataset overview
print(f"\n📊 DATASET OVERVIEW:")
print(f"   • Total papers analyzed: {len(df_clean):,}")
print(f"   • Date range: {df_clean['publish_time'].min().strftime('%B %Y')} - {df_clean['publish_time'].max().strftime('%B %Y')}")
print(f"   • Unique journals: {df_clean['journal'].nunique():,}")
print(f"   • Data sources: {', '.join(df_clean['source_x'].unique())}")

# Publication trends
print(f"\n📈 PUBLICATION TRENDS:")
peak_year = df_clean['publication_year'].value_counts().index[0]
peak_count = df_clean['publication_year'].value_counts().iloc[0]
print(f"   • Peak publication year: {int(peak_year)} ({peak_count:,} papers)")

monthly_peak = df_clean['year_month'].value_counts().index[0]
monthly_peak_count = df_clean['year_month'].value_counts().iloc[0]
print(f"   • Peak publication month: {monthly_peak} ({monthly_peak_count:,} papers)")

# Journal insights
print(f"\n📚 JOURNAL INSIGHTS:")
top_journal = df_clean['journal'].value_counts().index[0]
top_journal_count = df_clean['journal'].value_counts().iloc[0]
print(f"   • Top publishing journal: {top_journal} ({top_journal_count:,} papers)")

single_paper_journals = sum(df_clean['journal'].value_counts() == 1)
print(f"   • Journals with only one paper: {single_paper_journals:,} ({single_paper_journals/df_clean['journal'].nunique()*100:.1f}% of all journals)")

# Content analysis
print(f"\n📝 CONTENT ANALYSIS:")
avg_title_length = df_clean['title_length'].mean()
avg_abstract_words = df_clean['abstract_word_count'].mean()
print(f"   • Average title length: {avg_title_length:.1f} characters")
print(f"   • Average abstract word count: {avg_abstract_words:.1f} words")

top_words = [word for word, _ in word_freq.most_common(5)]
print(f"   • Most frequent title terms: {', '.join(top_words)}")

# Source analysis
print(f"\n🗄️ SOURCE ANALYSIS:")
for source in df_clean['source_x'].value_counts().index:
    count = df_clean['source_x'].value_counts()[source]
    percentage = (count / len(df_clean)) * 100
    full_text_pct = (df_clean[df_clean['source_x'] == source]['has_full_text'].sum() / count) * 100
    print(f"   • {source}: {count:,} papers ({percentage:.1f}%), {full_text_pct:.1f}% with full text")

# Research focus areas (based on title analysis)
print(f"\n🔬 RESEARCH FOCUS AREAS (based on title analysis):")
covid_keywords = ['covid', 'coronavirus', 'sars', 'pandemic', 'vaccine', 'treatment', 'clinical', 'patients']
for keyword in covid_keywords:
    if keyword in word_freq:
        count = word_freq[keyword]
        percentage = (count / len(title_words)) * 100
        print(f"   • {keyword.capitalize()}: {count:,} mentions ({percentage:.1f}% of all title words)")

print("\n" + "="*60)
print("ANALYSIS COMPLETE - DATA READY FOR STREAMLIT APP")
print("="*60)

## Next Steps

This analysis provides a comprehensive foundation for understanding the CORD-19 research dataset. The key components have been prepared for integration into a Streamlit application:

### What we've accomplished:
1. ✅ **Data Loading & Exploration** - Loaded and examined the dataset structure
2. ✅ **Data Cleaning** - Handled missing values and prepared clean data
3. ✅ **Feature Engineering** - Created new features for analysis
4. ✅ **Publication Analysis** - Analyzed trends over time and by year
5. ✅ **Journal Analysis** - Identified top publishers and publication patterns
6. ✅ **Text Analysis** - Extracted insights from titles and abstracts
7. ✅ **Visualizations** - Created comprehensive charts and graphs
8. ✅ **Word Clouds** - Generated visual representations of key terms
9. ✅ **Source Analysis** - Analyzed distribution across data sources
10. ✅ **Data Export** - Prepared datasets for Streamlit application

### Ready for Streamlit App Development:
- Sample dataset saved: `../data/cord19_sample.csv`
- Statistics saved: `../data/app_stats.json`
- All visualizations and analysis functions ready for integration

The next step is to create the Streamlit application that will make this analysis interactive and accessible to users.