# CORD-19 Dataset Analysis

This notebook contains a comprehensive analysis of the CORD-19 research dataset metadata. We'll explore the dataset structure, clean the data, and create meaningful visualizations to understand COVID-19 research patterns.

## Dataset Overview
The CORD-19 dataset contains metadata for COVID-19 research papers including:
- Paper titles and abstracts
- Publication dates
- Authors and journals
- Source information


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set style for better-looking plots
try:
    plt.style.use('seaborn-v0_8')
except OSError:
    try:
        plt.style.use('seaborn')
    except OSError:
        plt.style.use('default')
        print("Using default matplotlib style")

# Set seaborn style and palette
sns.set_style("whitegrid")
sns.set_palette("husl")

# Set default figure size
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")


Libraries imported successfully!
Pandas version: 2.3.2
NumPy version: 2.3.2
Matplotlib version: 3.10.6
Seaborn version: 0.13.2


## 1. Data Loading and Initial Exploration


In [8]:
# Load the dataset
df = pd.read_csv('cord19_metadata.csv')

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())
print("\nFirst few rows:")
df.head()


Dataset Shape: (1000, 12)

Column Names:
['cord_uid', 'title', 'abstract', 'authors', 'journal', 'publish_time', 'source_x', 'has_full_text', 'pdf_json_files', 'pmc_json_files', 'url', 'doi']

First few rows:


Unnamed: 0,cord_uid,title,abstract,authors,journal,publish_time,source_x,has_full_text,pdf_json_files,pmc_json_files,url,doi
0,cord_000000,COVID-19 Research Paper 0: Preventive Clinical,This study investigates COVID-19 symptoms usin...,"Author 0 Williams, Author 1 Garcia",PLOS ONE,2021-12-11,Research Square,False,,,https://example.com/paper/0,10.1000/000000
1,cord_000001,COVID-19 Research Paper 1: Virological Epidemi...,This study investigates COVID-19 prevention us...,"Author 0 Davis, Author 1 Jones, Author 2 Williams",PNAS,2022-09-18,Elsevier,True,pdf_1.json,pmc_1.json,https://example.com/paper/1,
2,cord_000002,COVID-19 Research Paper 2: Preventive Preventive,This study investigates COVID-19 transmission ...,Author 0 Williams,JAMA,2020-01-11,Elsevier,False,pdf_2.json,pmc_2.json,https://example.com/paper/2,
3,cord_000003,COVID-19 Research Paper 3: Mathematical Modeli...,This study investigates COVID-19 epidemiology ...,"Author 0 Jones, Author 1 Jones, Author 2 Garci...",BMJ,2020-11-20,Research Square,False,,pmc_3.json,https://example.com/paper/3,10.1000/000003
4,cord_000004,COVID-19 Research Paper 4: Therapeutic Clinical,This study investigates COVID-19 transmission ...,Author 0 Miller,Nature,2022-12-04,Research Square,True,pdf_4.json,,https://example.com/paper/4,10.1000/000004


In [9]:
# Dataset information and data types
print("Dataset Info:")
df.info()
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print(f"\nMissing Values Percentage:")
print((df.isnull().sum() / len(df)) * 100)


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   cord_uid        1000 non-null   object
 1   title           1000 non-null   object
 2   abstract        980 non-null    object
 3   authors         1000 non-null   object
 4   journal         985 non-null    object
 5   publish_time    1000 non-null   object
 6   source_x        1000 non-null   object
 7   has_full_text   1000 non-null   bool  
 8   pdf_json_files  494 non-null    object
 9   pmc_json_files  498 non-null    object
 10  url             1000 non-null   object
 11  doi             672 non-null    object
dtypes: bool(1), object(11)
memory usage: 87.0+ KB

Data Types:
cord_uid          object
title             object
abstract          object
authors           object
journal           object
publish_time      object
source_x          object
has_full_text       bool


In [None]:
# Basic statistics for numerical columns
print("Basic Statistics:")
df.describe()


## 2. Data Cleaning


In [11]:
# Create a copy for cleaning
df_clean = df.copy()

# Convert publish_time to datetime
df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')

# Extract year and month for analysis
df_clean['publish_year'] = df_clean['publish_time'].dt.year
df_clean['publish_month'] = df_clean['publish_time'].dt.month

# Handle missing values in abstracts by replacing with empty string
df_clean['abstract'] = df_clean['abstract'].fillna('No abstract available')

# Handle missing journal values
df_clean['journal'] = df_clean['journal'].fillna('Unknown Journal')

# Handle missing DOI values
df_clean['doi'] = df_clean['doi'].fillna('No DOI available')

print("Data cleaning completed!")
print(f"Date range: {df_clean['publish_time'].min()} to {df_clean['publish_time'].max()}")
print(f"Years covered: {sorted(df_clean['publish_year'].dropna().unique())}")


Data cleaning completed!
Date range: 2020-01-01 00:00:00 to 2022-12-31 00:00:00
Years covered: [np.int32(2020), np.int32(2021), np.int32(2022)]


## 3. Data Analysis


In [12]:
# Publication trends by year
publication_by_year = df_clean['publish_year'].value_counts().sort_index()
print("Publications by Year:")
print(publication_by_year)

# Top journals
top_journals = df_clean['journal'].value_counts().head(10)
print("\nTop 10 Journals:")
print(top_journals)

# Source distribution
source_distribution = df_clean['source_x'].value_counts()
print("\nSource Distribution:")
print(source_distribution)

# Full text availability
full_text_availability = df_clean['has_full_text'].value_counts()
print("\nFull Text Availability:")
print(full_text_availability)


Publications by Year:
publish_year
2020    341
2021    343
2022    316
Name: count, dtype: int64

Top 10 Journals:
journal
The Lancet                      96
Science                         95
PLOS ONE                        92
Clinical Infectious Diseases    79
Journal of Medical Virology     77
Nature                          74
Emerging Infectious Diseases    73
BMJ                             72
Nature Medicine                 70
PNAS                            68
Name: count, dtype: int64

Source Distribution:
source_x
Elsevier           177
bioRxiv            171
Research Square    168
medRxiv            166
WHO                165
PubMed Central     153
Name: count, dtype: int64

Full Text Availability:
has_full_text
False    529
True     471
Name: count, dtype: int64


In [13]:
# Author analysis
# Count number of authors per paper
df_clean['author_count'] = df_clean['authors'].str.split(',').str.len()

print("Author Statistics:")
print(f"Average authors per paper: {df_clean['author_count'].mean():.2f}")
print(f"Median authors per paper: {df_clean['author_count'].median():.2f}")
print(f"Max authors per paper: {df_clean['author_count'].max()}")
print(f"Min authors per paper: {df_clean['author_count'].min()}")

# Abstract length analysis
df_clean['abstract_length'] = df_clean['abstract'].str.len()
print(f"\nAbstract Statistics:")
print(f"Average abstract length: {df_clean['abstract_length'].mean():.0f} characters")
print(f"Median abstract length: {df_clean['abstract_length'].median():.0f} characters")


Author Statistics:
Average authors per paper: 3.56
Median authors per paper: 4.00
Max authors per paper: 6
Min authors per paper: 1

Abstract Statistics:
Average abstract length: 135 characters
Median abstract length: 138 characters


## 4. Data Visualizations


In [None]:
# Set up the plotting style
plt.rcParams['figure.figsize'] = (16, 12)
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Publications by Year
if not publication_by_year.empty:
    axes[0, 0].bar(publication_by_year.index, publication_by_year.values, 
                   color='skyblue', edgecolor='navy', alpha=0.7)
    axes[0, 0].set_title('Publications by Year', fontsize=14, fontweight='bold')
    axes[0, 0].set_xlabel('Year')
    axes[0, 0].set_ylabel('Number of Publications')
    axes[0, 0].grid(True, alpha=0.3)
    # Format x-axis to show integers
    axes[0, 0].set_xticks(publication_by_year.index)
    axes[0, 0].set_xticklabels(publication_by_year.index.astype(int))

# 2. Top 10 Journals
if not top_journals.empty:
    top_journals_plot = top_journals.head(10)
    y_pos = range(len(top_journals_plot))
    axes[0, 1].barh(y_pos, top_journals_plot.values, 
                    color='lightcoral', edgecolor='darkred', alpha=0.7)
    axes[0, 1].set_yticks(y_pos)
    axes[0, 1].set_yticklabels(top_journals_plot.index)
    axes[0, 1].set_title('Top 10 Journals', fontsize=14, fontweight='bold')
    axes[0, 1].set_xlabel('Number of Publications')
    axes[0, 1].grid(True, alpha=0.3)

# 3. Source Distribution
if not source_distribution.empty:
    # Handle case where there might be too many sources
    if len(source_distribution) > 10:
        # Show top 10 sources and group others as "Others"
        top_sources = source_distribution.head(9)
        others_count = source_distribution.tail(len(source_distribution) - 9).sum()
        plot_sources = pd.concat([top_sources, pd.Series([others_count], index=['Others'])])
        plot_labels = plot_sources.index
        plot_values = plot_sources.values
    else:
        plot_labels = source_distribution.index
        plot_values = source_distribution.values
    
    axes[1, 0].pie(plot_values, labels=plot_labels, autopct='%1.1f%%', startangle=90)
    axes[1, 0].set_title('Source Distribution', fontsize=14, fontweight='bold')

# 4. Author Count Distribution
if 'author_count' in df_clean.columns and not df_clean['author_count'].empty:
    # Ensure we have valid data for histogram
    valid_author_counts = df_clean['author_count'].dropna()
    if not valid_author_counts.empty:
        max_bins = min(20, len(valid_author_counts.unique()))
        axes[1, 1].hist(valid_author_counts, bins=max_bins, 
                       color='lightgreen', edgecolor='darkgreen', alpha=0.7)
        axes[1, 1].set_title('Distribution of Author Count per Paper', fontsize=14, fontweight='bold')
        axes[1, 1].set_xlabel('Number of Authors')
        axes[1, 1].set_ylabel('Frequency')
        axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [18]:
# Additional visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Monthly publication trends
try:
    monthly_publications = df_clean.groupby(['publish_year', 'publish_month']).size().reset_index(name='count')
    if not monthly_publications.empty:
        monthly_publications['date'] = pd.to_datetime(monthly_publications[['publish_year', 'publish_month']].assign(day=1))
        
        axes[0, 0].plot(monthly_publications['date'], monthly_publications['count'], 
                       marker='o', linewidth=2, markersize=4, color='steelblue')
        axes[0, 0].set_title('Monthly Publication Trends', fontsize=14, fontweight='bold')
        axes[0, 0].set_xlabel('Date')
        axes[0, 0].set_ylabel('Number of Publications')
        axes[0, 0].grid(True, alpha=0.3)
        axes[0, 0].tick_params(axis='x', rotation=45)
    else:
        axes[0, 0].text(0.5, 0.5, 'No monthly data available', ha='center', va='center', transform=axes[0, 0].transAxes)
        axes[0, 0].set_title('Monthly Publication Trends', fontsize=14, fontweight='bold')
except Exception as e:
    axes[0, 0].text(0.5, 0.5, f'Error: {str(e)}', ha='center', va='center', transform=axes[0, 0].transAxes)
    axes[0, 0].set_title('Monthly Publication Trends', fontsize=14, fontweight='bold')

# 2. Full text availability
if not full_text_availability.empty:
    full_text_labels = ['Available', 'Not Available']
    full_text_counts = [full_text_availability.get(True, 0), full_text_availability.get(False, 0)]
    colors = ['lightgreen', 'lightcoral']
    
    # Only plot if we have data
    if sum(full_text_counts) > 0:
        axes[0, 1].pie(full_text_counts, labels=full_text_labels, autopct='%1.1f%%', 
                      colors=colors, startangle=90)
    axes[0, 1].set_title('Full Text Availability', fontsize=14, fontweight='bold')
else:
    axes[0, 1].text(0.5, 0.5, 'No full text data available', ha='center', va='center', transform=axes[0, 1].transAxes)
    axes[0, 1].set_title('Full Text Availability', fontsize=14, fontweight='bold')

# 3. Abstract length distribution
if 'abstract_length' in df_clean.columns:
    valid_abstracts = df_clean['abstract_length'].dropna()
    if not valid_abstracts.empty:
        # Use a reasonable number of bins based on data range
        max_bins = min(30, len(valid_abstracts.unique()))
        axes[1, 0].hist(valid_abstracts, bins=max_bins, 
                       color='lightblue', edgecolor='navy', alpha=0.7)
        axes[1, 0].set_title('Abstract Length Distribution', fontsize=14, fontweight='bold')
        axes[1, 0].set_xlabel('Abstract Length (characters)')
        axes[1, 0].set_ylabel('Frequency')
        axes[1, 0].grid(True, alpha=0.3)
    else:
        axes[1, 0].text(0.5, 0.5, 'No abstract length data', ha='center', va='center', transform=axes[1, 0].transAxes)
        axes[1, 0].set_title('Abstract Length Distribution', fontsize=14, fontweight='bold')
else:
    axes[1, 0].text(0.5, 0.5, 'Abstract length not calculated', ha='center', va='center', transform=axes[1, 0].transAxes)
    axes[1, 0].set_title('Abstract Length Distribution', fontsize=14, fontweight='bold')

# 4. Publications by source
if not source_distribution.empty:
    # Limit to top sources to avoid overcrowding
    top_sources = source_distribution.head(10)
    x_pos = range(len(top_sources))
    
    axes[1, 1].bar(x_pos, top_sources.values, color='orange', edgecolor='darkorange', alpha=0.7)
    axes[1, 1].set_xticks(x_pos)
    axes[1, 1].set_xticklabels(top_sources.index, rotation=45, ha='right')
    axes[1, 1].set_title('Publications by Source (Top 10)', fontsize=14, fontweight='bold')
    axes[1, 1].set_ylabel('Number of Publications')
    axes[1, 1].grid(True, alpha=0.3)
else:
    axes[1, 1].text(0.5, 0.5, 'No source data available', ha='center', va='center', transform=axes[1, 1].transAxes)
    axes[1, 1].set_title('Publications by Source', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()


## 5. Key Insights and Summary


In [None]:
# Generate summary statistics and insights
print("=== CORD-19 Dataset Analysis Summary ===\n")

print(f"📊 Dataset Overview:")
print(f"   • Total papers: {len(df_clean):,}")

# Handle date range safely
try:
    min_date = df_clean['publish_time'].min()
    max_date = df_clean['publish_time'].max()
    if pd.notna(min_date) and pd.notna(max_date):
        print(f"   • Date range: {min_date.strftime('%Y-%m-%d')} to {max_date.strftime('%Y-%m-%d')}")
    else:
        print("   • Date range: Unable to determine")
except:
    print("   • Date range: Unable to determine")

# Handle years safely
try:
    years = sorted(df_clean['publish_year'].dropna().unique())
    print(f"   • Years covered: {years}")
except:
    print("   • Years covered: Unable to determine")

print(f"\n📚 Publication Patterns:")
try:
    if not publication_by_year.empty:
        print(f"   • Most active year: {publication_by_year.idxmax()} ({publication_by_year.max()} papers)")
    else:
        print("   • Most active year: Unable to determine")
except:
    print("   • Most active year: Unable to determine")

try:
    if 'author_count' in df_clean.columns:
        avg_authors = df_clean['author_count'].mean()
        print(f"   • Average authors per paper: {avg_authors:.1f}")
    else:
        print("   • Average authors per paper: Not calculated")
except:
    print("   • Average authors per paper: Unable to calculate")

try:
    if 'abstract_length' in df_clean.columns:
        avg_abstract = df_clean['abstract_length'].mean()
        print(f"   • Average abstract length: {avg_abstract:.0f} characters")
    else:
        print("   • Average abstract length: Not calculated")
except:
    print("   • Average abstract length: Unable to calculate")

print(f"\n🏆 Top Publishing Sources:")
try:
    if not source_distribution.empty:
        for i, (source, count) in enumerate(source_distribution.head(3).items(), 1):
            percentage = (count / len(df_clean)) * 100
            print(f"   {i}. {source}: {count} papers ({percentage:.1f}%)")
    else:
        print("   No source data available")
except Exception as e:
    print(f"   Error processing sources: {str(e)}")

print(f"\n📖 Journal Distribution:")
try:
    journal_count = df_clean['journal'].nunique()
    print(f"   • Total unique journals: {journal_count}")
    
    if not top_journals.empty:
        print(f"   • Top journal: {top_journals.index[0]} ({top_journals.iloc[0]} papers)")
    else:
        print("   • Top journal: Unable to determine")
except:
    print("   • Journal information: Unable to process")

print(f"\n📄 Full Text Availability:")
try:
    if not full_text_availability.empty:
        true_count = full_text_availability.get(True, 0)
        false_count = full_text_availability.get(False, 0)
        total_count = true_count + false_count
        
        if total_count > 0:
            full_text_pct = (true_count / total_count) * 100
            print(f"   • Papers with full text: {true_count} ({full_text_pct:.1f}%)")
            print(f"   • Papers without full text: {false_count} ({100-full_text_pct:.1f}%)")
        else:
            print("   • No full text data available")
    else:
        print("   • No full text data available")
except Exception as e:
    print(f"   • Error processing full text data: {str(e)}")

print(f"\n🔍 Data Quality:")
try:
    missing_abstract_pct = (df_clean['abstract'] == 'No abstract available').sum() / len(df_clean) * 100
    missing_journal_pct = (df_clean['journal'] == 'Unknown Journal').sum() / len(df_clean) * 100
    print(f"   • Missing abstracts: {missing_abstract_pct:.1f}%")
    print(f"   • Missing journal info: {missing_journal_pct:.1f}%")
except:
    print("   • Data quality metrics: Unable to calculate")

print("\n✅ Analysis completed successfully!")


=== CORD-19 Dataset Analysis Summary ===

📊 Dataset Overview:
   • Total papers: 1,000
   • Date range: 2020-01-01 to 2022-12-31
   • Years covered: [np.int32(2020), np.int32(2021), np.int32(2022)]

📚 Publication Patterns:
   • Most active year: 2021 (343 papers)
   • Average authors per paper: 3.6
   • Average abstract length: 135 characters

🏆 Top Publishing Sources:
   1. Elsevier: 177 papers (17.7%)
   2. bioRxiv: 171 papers (17.1%)
   3. Research Square: 168 papers (16.8%)

📖 Journal Distribution:
   • Total unique journals: 14
   • Top journal: The Lancet (96 papers)

📄 Full Text Availability:
   • Papers with full text: 471 (47.1%)
   • Papers without full text: 529 (52.9%)

🔍 Data Quality:
   • Missing abstracts: 2.0%
   • Missing journal info: 1.5%

✅ Analysis completed successfully!
