# 🦠 COVID-19 Data Collection & Initial Processing

## 📋 Notebook Overview
This notebook handles the collection and initial processing of COVID-19 data from multiple reliable sources including:
- **Our World in Data (OWID)**: Comprehensive global dataset
- **Johns Hopkins University**: Time series data for cases, deaths, and recoveries
- **Data validation and quality checks**

**Objective**: Download, clean, and prepare COVID-19 data for analysis

**Author**: Your Name  
**Date**: September 2025  
**Duration**: ~30-45 minutes

## 📚 Step 1: Import Required Libraries

First, we'll import all the necessary Python libraries for data collection, processing, and basic analysis.

In [None]:
# Standard libraries
import sys
import os
from pathlib import Path
import warnings
from datetime import datetime, timedelta

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Add the src directory to Python path so we can import our custom modules
sys.path.append('../src')

# Import our custom COVID-19 data utilities
try:
    from data_fetcher import CovidDataFetcher
    from data_processor import CovidDataProcessor
    print("✅ Successfully imported custom COVID-19 utilities")
except ImportError as e:
    print(f"❌ Error importing custom utilities: {e}")
    print("Make sure the src/ directory contains data_fetcher.py and data_processor.py")

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Configure matplotlib
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 6)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("🚀 All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Matplotlib version: {plt.__version__ if hasattr(plt, '__version__') else 'Available'}")
print(f"Current working directory: {os.getcwd()}")

## 🌐 Step 2: Initialize Data Fetcher

Now we'll set up our COVID-19 data fetcher to download data from reliable sources.

In [None]:
# Initialize the COVID-19 data fetcher
fetcher = CovidDataFetcher(data_dir="../data/raw")

# Show information about available data sources
print("📊 Available COVID-19 Data Sources:")
print("=" * 60)
fetcher.get_data_info()

# Check if we have any existing data
print("\n" + "="*60)
print("📁 Checking for existing data files...")

raw_data_path = Path("../data/raw")
if raw_data_path.exists():
    existing_files = list(raw_data_path.glob("*.csv"))
    if existing_files:
        print(f"Found {len(existing_files)} existing data files:")
        for file in existing_files:
            file_size = file.stat().st_size / (1024 * 1024)  # Size in MB
            modified_time = datetime.fromtimestamp(file.stat().st_mtime)
            print(f"  📄 {file.name} ({file_size:.1f} MB, modified: {modified_time.strftime('%Y-%m-%d %H:%M')})")
    else:
        print("No existing data files found. We'll download fresh data.")
else:
    print("Data directory doesn't exist yet. It will be created when we download data.")

print("\n🚀 Data fetcher initialized successfully!")

## 📥 Step 3: Download COVID-19 Dataset from Kaggle Source

This comprehensive dataset includes multiple files with different views of COVID-19 data from Johns Hopkins and Worldometer sources.

## 📥 Step 3: Download Our World in Data (OWID) Dataset

We'll start by downloading the main COVID-19 dataset from Our World in Data, which provides comprehensive global data.

In [None]:
# Download the main clean complete COVID-19 dataset
print("🔄 Downloading COVID-19 clean complete dataset...")
print("This dataset contains daily country-level data without province/state details")
print("This may take a few minutes depending on your internet connection...")

try:
    # Fetch the clean complete dataset (country-level daily data)
    covid_data = fetcher.fetch_clean_complete_data(save_local=True)
    
    print(f"\n✅ Successfully downloaded COVID-19 data!")
    print(f"📊 Dataset shape: {covid_data.shape[0]:,} rows × {covid_data.shape[1]} columns")
    
    # Handle different date column names
    date_col = 'Date' if 'Date' in covid_data.columns else 'date'
    if date_col in covid_data.columns:
        print(f"📅 Date range: {covid_data[date_col].min()} to {covid_data[date_col].max()}")
    
    # Handle different country column names
    country_col = 'Country/Region' if 'Country/Region' in covid_data.columns else 'Country'
    if country_col in covid_data.columns:
        print(f"🌍 Number of locations: {covid_data[country_col].nunique()}")
    
    # Show a quick preview of the data
    print(f"\n👀 First 5 rows:")
    display(covid_data.head())
    
    # Show column names
    print(f"\n📋 Available columns ({len(covid_data.columns)} total):")
    for i, col in enumerate(covid_data.columns):
        print(f"  {i+1:2d}. {col}")
    
    # Also download the latest country summary for comparison
    print(f"\n📊 Downloading latest country-wise summary...")
    country_latest = fetcher.fetch_country_wise_latest(save_local=True)
    print(f"✅ Country summary: {country_latest.shape[0]} countries")
    
    # Download global day-wise data
    print(f"\n🌍 Downloading global day-wise data...")
    day_wise_data = fetcher.fetch_day_wise_data(save_local=True)
    print(f"✅ Global day-wise data: {day_wise_data.shape[0]} days")
    
except Exception as e:
    print(f"❌ Error downloading COVID-19 data: {str(e)}")
    print("Please check your internet connection and try again.")
    print("You may also need to check if the data source URLs are still active.")

## 🔍 Step 4: Initial Data Exploration

Let's take a closer look at our data to understand its structure and quality.

In [None]:
# Basic dataset information
print("📊 DATASET OVERVIEW")
print("=" * 50)
print(f"Shape: {covid_data.shape[0]:,} rows × {covid_data.shape[1]} columns")
print(f"Memory usage: {covid_data.memory_usage(deep=True).sum() / 1024 / 1024:.1f} MB")
print(f"Date range: {covid_data['date'].min()} to {covid_data['date'].max()}")
print(f"Number of unique locations: {covid_data['location'].nunique()}")

# Check data types
print(f"\n📋 DATA TYPES")
print("=" * 30)
data_types = covid_data.dtypes.value_counts()
for dtype, count in data_types.items():
    print(f"{dtype}: {count} columns")

# Show sample locations
print(f"\n🌍 SAMPLE LOCATIONS (first 20)")
print("=" * 40)
locations = covid_data['location'].unique()[:20]
for i, location in enumerate(locations, 1):
    print(f"{i:2d}. {location}")

if len(covid_data['location'].unique()) > 20:
    print(f"... and {len(covid_data['location'].unique()) - 20} more locations")

# Basic statistics for key numerical columns
key_columns = ['new_cases', 'new_deaths', 'total_cases', 'total_deaths']
available_key_columns = [col for col in key_columns if col in covid_data.columns]

if available_key_columns:
    print(f"\n📈 BASIC STATISTICS FOR KEY METRICS")
    print("=" * 50)
    display(covid_data[available_key_columns].describe())

## ❓ Step 5: Data Quality Assessment

Let's check for missing values and data quality issues that we'll need to address.

In [None]:
# Check for missing values
print("🔍 MISSING VALUES ANALYSIS")
print("=" * 40)

missing_values = covid_data.isnull().sum()
missing_percentage = (missing_values / len(covid_data)) * 100

# Create a summary of missing values
missing_summary = pd.DataFrame({
    'Column': missing_values.index,
    'Missing_Count': missing_values.values,
    'Missing_Percentage': missing_percentage.values
}).sort_values('Missing_Percentage', ascending=False)

# Only show columns with missing values
missing_summary = missing_summary[missing_summary['Missing_Count'] > 0]

if len(missing_summary) > 0:
    print(f"Found missing values in {len(missing_summary)} columns:")
    display(missing_summary.head(15))  # Show top 15 columns with missing values
    
    if len(missing_summary) > 15:
        print(f"... and {len(missing_summary) - 15} more columns with missing values")
else:
    print("✅ No missing values found in the dataset!")

# Check for duplicate rows
duplicate_count = covid_data.duplicated().sum()
print(f"\n🔁 DUPLICATE ROWS: {duplicate_count}")

if duplicate_count > 0:
    print(f"⚠️  Found {duplicate_count} duplicate rows that may need to be removed")
else:
    print("✅ No duplicate rows found")

# Check date consistency
print(f"\n📅 DATE ANALYSIS")
print("=" * 20)
print(f"Date column type: {covid_data['date'].dtype}")
print(f"Earliest date: {covid_data['date'].min()}")
print(f"Latest date: {covid_data['date'].max()}")
print(f"Date range: {(covid_data['date'].max() - covid_data['date'].min()).days} days")

# Check for any invalid dates (nulls in date column)
null_dates = covid_data['date'].isnull().sum()
print(f"Missing dates: {null_dates}")

# Quick sanity check on data values
print(f"\n🔢 DATA SANITY CHECKS")
print("=" * 25)

# Check for negative values in cumulative columns
cumulative_columns = ['total_cases', 'total_deaths']
for col in cumulative_columns:
    if col in covid_data.columns:
        negative_count = (covid_data[col] < 0).sum()
        if negative_count > 0:
            print(f"⚠️  {col}: {negative_count} negative values found")
        else:
            print(f"✅ {col}: No negative values")

print(f"\n📊 DATA QUALITY SUMMARY")
print("=" * 30)
total_cells = covid_data.shape[0] * covid_data.shape[1]
missing_cells = covid_data.isnull().sum().sum()
data_completeness = ((total_cells - missing_cells) / total_cells) * 100

print(f"Overall data completeness: {data_completeness:.1f}%")
print(f"Total cells: {total_cells:,}")
print(f"Missing cells: {missing_cells:,}")
print(f"Complete cells: {total_cells - missing_cells:,}")

## 🧹 Step 6: Data Cleaning and Processing

Now let's clean and process our data using our custom data processor.

In [None]:
# Initialize the data processor
processor = CovidDataProcessor(raw_data_dir="../data/raw", processed_data_dir="../data/processed")

print("🧹 Starting data cleaning and processing...")

try:
    # Clean the COVID-19 data
    cleaned_data = processor.clean_covid_data(covid_data)
    
    print(f"\n✅ Data cleaning completed!")
    print(f"📊 Original shape: {covid_data.shape[0]:,} rows × {covid_data.shape[1]} columns")
    print(f"📊 Cleaned shape: {cleaned_data.shape[0]:,} rows × {cleaned_data.shape[1]} columns")
    
    # Show what changed
    rows_removed = covid_data.shape[0] - cleaned_data.shape[0]
    columns_added = cleaned_data.shape[1] - covid_data.shape[1]
    
    if rows_removed > 0:
        print(f"🗑️  Removed {rows_removed:,} rows (likely duplicates or data corrections)")
    if rows_removed < 0:
        print(f"➕ Added {abs(rows_removed):,} rows (likely from aggregation)")
    if columns_added > 0:
        print(f"➕ Added {columns_added} new derived columns")
    
    # Show sample of cleaned data
    print(f"\n👀 Sample of cleaned data:")
    display(cleaned_data.head())
    
    # Show the new/modified columns
    new_columns = set(cleaned_data.columns) - set(covid_data.columns)
    if new_columns:
        print(f"\n✨ New columns added during processing:")
        for col in sorted(new_columns):
            print(f"  • {col}")
    
    # Show data types
    print(f"\n📋 Data types after cleaning:")
    for col, dtype in cleaned_data.dtypes.items():
        print(f"  {col}: {dtype}")
    
except Exception as e:
    print(f"❌ Error during data cleaning: {str(e)}")
    # If cleaning fails, we'll use the original data
    cleaned_data = covid_data.copy()
    print("Using original data for now...")
    
    # Still try to standardize column names manually
    if 'Country/Region' in cleaned_data.columns:
        cleaned_data = cleaned_data.rename(columns={'Country/Region': 'location'})
    if 'Date' in cleaned_data.columns:
        cleaned_data['Date'] = pd.to_datetime(cleaned_data['Date'])
        cleaned_data = cleaned_data.rename(columns={'Date': 'date'})

## 📈 Step 7: Create Processed Datasets

Let's create different views of our data that will be useful for analysis.

In [None]:
# Create different views of the data for analysis
print("📊 Creating processed datasets for analysis...")

datasets_created = {}

try:
    # 1. Country summary (latest data for each country)
    print("\n1️⃣ Creating country summary...")
    country_summary = processor.create_country_summary(cleaned_data)
    datasets_created['country_summary'] = country_summary
    print(f"   ✅ Country summary: {country_summary.shape[0]} countries")
    
    # Show top 10 countries by total cases
    print("\n🏆 Top 10 countries by total cases:")
    display(country_summary.head(10))
    
    # 2. Time series for major countries
    print("\n2️⃣ Creating time series for major countries...")
    major_countries_ts = processor.create_time_series_data(cleaned_data)
    datasets_created['major_countries_timeseries'] = major_countries_ts
    print(f"   ✅ Major countries time series: {major_countries_ts.shape[0]:,} rows")
    
    # 3. Global aggregated data
    print("\n3️⃣ Creating global aggregated data...")
    global_data = processor.aggregate_global_data(cleaned_data)
    datasets_created['global_timeseries'] = global_data
    print(f"   ✅ Global time series: {global_data.shape[0]} days")
    
    # Show latest global stats
    latest_global = global_data.iloc[-1]
    print(f"\n🌍 Latest global statistics ({latest_global['date'].strftime('%Y-%m-%d')}):")
    print(f"   Total cases: {latest_global['total_cases']:,.0f}")
    print(f"   Total deaths: {latest_global['total_deaths']:,.0f}")
    print(f"   Case fatality rate: {latest_global['case_fatality_rate']:.2f}%")
    print(f"   New cases (7-day avg): {latest_global['new_cases_7day_avg']:,.0f}")
    
    print(f"\n✅ Successfully created {len(datasets_created)} processed datasets!")
    
except Exception as e:
    print(f"❌ Error creating processed datasets: {str(e)}")
    # Create minimal datasets as fallback
    datasets_created['cleaned_data'] = cleaned_data

## 💾 Step 8: Save Processed Data

Let's save our cleaned and processed datasets for use in subsequent analysis notebooks.

In [None]:
# Save all processed datasets
print("💾 Saving processed datasets...")

# Add the full cleaned dataset to our collection
datasets_created['covid_cleaned'] = cleaned_data

try:
    # Save using our processor
    processor.save_processed_data(datasets_created)
    
    print(f"\n✅ Successfully saved {len(datasets_created)} datasets!")
    
    # Show what we saved
    processed_dir = Path("../data/processed")
    if processed_dir.exists():
        saved_files = list(processed_dir.glob("*.csv"))
        print(f"\n📁 Saved files in {processed_dir}:")
        
        total_size = 0
        for file in saved_files:
            file_size = file.stat().st_size / (1024 * 1024)  # Size in MB
            total_size += file_size
            print(f"   📄 {file.name} ({file_size:.1f} MB)")
        
        print(f"\n📊 Total size of processed data: {total_size:.1f} MB")
    
    # Generate and display data quality report
    print(f"\n📋 Final Data Quality Report:")
    print("=" * 40)
    quality_report = processor.get_data_quality_report(cleaned_data)
    
    print(f"Total rows: {quality_report['total_rows']:,}")
    print(f"Total columns: {quality_report['total_columns']}")
    print(f"Date range: {quality_report['date_range']['start']} to {quality_report['date_range']['end']}")
    print(f"Countries/locations: {quality_report['countries']}")
    print(f"Duplicate rows: {quality_report['duplicate_rows']}")
    
    # Show columns with significant missing data
    missing_data = {k: v for k, v in quality_report['missing_values'].items() if v > 0}
    if missing_data:
        print(f"\nColumns with missing values:")
        for col, count in sorted(missing_data.items(), key=lambda x: x[1], reverse=True)[:10]:
            percentage = (count / quality_report['total_rows']) * 100
            print(f"   {col}: {count:,} ({percentage:.1f}%)")
    
except Exception as e:
    print(f"❌ Error saving processed data: {str(e)}")
    print("You may need to create the processed data directory manually.")

## 📊 Step 9: Quick Visualization Preview

Let's create a quick preview visualization to make sure our data looks reasonable.

In [None]:
# Create a quick visualization to verify our data looks correct
print("📊 Creating quick data validation plots...")

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('COVID-19 Data Collection - Quick Validation Plots', fontsize=16, fontweight='bold')

try:
    # 1. Global daily cases over time
    if 'global_timeseries' in datasets_created:
        global_data = datasets_created['global_timeseries']
        axes[0, 0].plot(global_data['date'], global_data['new_cases_7day_avg'], 
                       color='blue', linewidth=2, alpha=0.8)
        axes[0, 0].set_title('Global Daily Cases (7-day average)')
        axes[0, 0].set_ylabel('New Cases')
        axes[0, 0].grid(True, alpha=0.3)
        axes[0, 0].tick_params(axis='x', rotation=45)
    
    # 2. Top 10 countries by total cases
    if 'country_summary' in datasets_created:
        top_10 = datasets_created['country_summary'].head(10)
        axes[0, 1].barh(range(len(top_10)), top_10['Total_Cases'], color='lightcoral')
        axes[0, 1].set_yticks(range(len(top_10)))
        axes[0, 1].set_yticklabels(top_10['Country'])
        axes[0, 1].set_title('Top 10 Countries - Total Cases')
        axes[0, 1].set_xlabel('Total Cases')
        # Invert y-axis so highest is at top
        axes[0, 1].invert_yaxis()
    
    # 3. Global daily deaths over time
    if 'global_timeseries' in datasets_created:
        axes[1, 0].plot(global_data['date'], global_data['new_deaths_7day_avg'], 
                       color='red', linewidth=2, alpha=0.8)
        axes[1, 0].set_title('Global Daily Deaths (7-day average)')
        axes[1, 0].set_ylabel('New Deaths')
        axes[1, 0].grid(True, alpha=0.3)
        axes[1, 0].tick_params(axis='x', rotation=45)
    
    # 4. Case fatality rates for top 10 countries
    if 'country_summary' in datasets_created:
        axes[1, 1].bar(range(len(top_10)), top_10['Case_Fatality_Rate'], 
                      color='orange', alpha=0.7)
        axes[1, 1].set_xticks(range(len(top_10)))
        axes[1, 1].set_xticklabels(top_10['Country'], rotation=45, ha='right')
        axes[1, 1].set_title('Case Fatality Rate - Top 10 Countries')
        axes[1, 1].set_ylabel('CFR (%)')
        axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("✅ Quick validation plots created successfully!")
    print("📊 The data appears to be loaded and processed correctly.")
    
except Exception as e:
    print(f"⚠️  Could not create validation plots: {str(e)}")
    print("This doesn't affect the data collection, but you may want to check the data manually.")

# Show some quick statistics
print(f"\n📈 QUICK STATS SUMMARY")
print("=" * 30)

if 'global_timeseries' in datasets_created:
    latest_global = datasets_created['global_timeseries'].iloc[-1]
    print(f"Latest global data ({latest_global['date'].strftime('%Y-%m-%d')}):")
    print(f"  • Total cases: {latest_global['total_cases']:,.0f}")
    print(f"  • Total deaths: {latest_global['total_deaths']:,.0f}")
    print(f"  • New cases (7-day avg): {latest_global['new_cases_7day_avg']:,.0f}")

if 'country_summary' in datasets_created:
    country_summary = datasets_created['country_summary']
    print(f"\nCountry analysis:")
    print(f"  • Countries analyzed: {len(country_summary)}")
    print(f"  • Most affected country: {country_summary.iloc[0]['Country']}")
    print(f"  • Highest CFR: {country_summary['Case_Fatality_Rate'].max():.2f}%")

print(f"\n🎉 Data collection and processing completed successfully!")
print(f"Ready for detailed analysis in the next notebooks!")

## 🎯 Summary & Next Steps

### ✅ What We Accomplished

1. **📥 Data Collection**: Successfully downloaded COVID-19 data from Our World in Data
2. **🔍 Data Exploration**: Analyzed the structure and quality of our dataset
3. **🧹 Data Cleaning**: Processed and cleaned the data for analysis
4. **📊 Dataset Creation**: Created multiple processed datasets for different types of analysis
5. **💾 Data Storage**: Saved all processed data for use in subsequent notebooks
6. **✅ Validation**: Verified data quality with quick visualization checks

### 📊 Datasets Created

- **`covid_cleaned.csv`**: Full cleaned dataset with all countries and metrics
- **`country_summary.csv`**: Latest statistics for each country
- **`major_countries_timeseries.csv`**: Time series data for major countries
- **`global_timeseries.csv`**: Global aggregated data by date

### 🚀 Next Steps

1. **Exploratory Analysis** (`02_exploratory_analysis.ipynb`): Deep dive into patterns and trends
2. **Time Series Analysis** (`03_trend_analysis.ipynb`): Analyze temporal patterns and seasonality
3. **Geographic Analysis** (`04_geographic_analysis.ipynb`): Compare countries and regions

### 📁 Files Ready for Analysis

All processed data files are saved in `../data/processed/` and ready for the next phase of analysis!