# COVID-19 Data Analytics Project
## Analysis of Johns Hopkins COVID-19 Dataset (February 9, 2020)

This notebook provides comprehensive analysis of the early COVID-19 pandemic phase, including:
- Data inspection and cleaning
- Visualization with bar charts, scatter plots, and pie charts
- Analysis of top affected regions
- Exploration of relationships between confirmed cases, deaths, and recoveries
- Investigation of China's dominance in the early phase
- Regional differences in recovery and mortality rates

## Import Required Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import requests
import os
from datetime import datetime

# Set style for better-looking plots
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]

print("Libraries imported successfully!")

## Load and Inspect Data

In [None]:
# Load the COVID-19 data
df_raw = pd.read_csv('data/covid_02_09_2020.csv')

print("COVID-19 DATA INSPECTION (February 9, 2020)")
print("=" * 50)
print(f"Dataset shape: {df_raw.shape}")
print(f"Columns: {list(df_raw.columns)}")

# Display first few rows
print("\nFirst 10 rows:")
df_raw.head(10)

In [None]:
# Basic statistics
print("Basic Statistics:")
df_raw.describe()

In [None]:
# Check for missing values
print("Missing Values:")
df_raw.isnull().sum()

## Data Cleaning

In [None]:
# Clean the data
df = df_raw.copy()

# Fill missing values
numeric_cols = ['Confirmed', 'Deaths', 'Recovered']
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)

# Handle missing geographic data
df['Province/State'].fillna('', inplace=True)

# Standardize country names
df['Country/Region'] = df['Country/Region'].replace('Mainland China', 'China')

# Create combined location column
df['Location'] = df.apply(
    lambda row: f"{row['Province/State']}, {row['Country/Region']}" 
    if row['Province/State'] else row['Country/Region'], 
    axis=1
)

# Calculate recovery and mortality rates
df['Mortality_Rate'] = np.where(
    df['Confirmed'] > 0, 
    (df['Deaths'] / df['Confirmed']) * 100, 
    0
)

df['Recovery_Rate'] = np.where(
    df['Confirmed'] > 0, 
    (df['Recovered'] / df['Confirmed']) * 100, 
    0
)

print(f"Cleaned dataset shape: {df.shape}")
print(f"Total confirmed cases globally: {df['Confirmed'].sum():,}")
print(f"Total deaths globally: {df['Deaths'].sum():,}")
print(f"Total recovered globally: {df['Recovered'].sum():,}")

## Visualization 1: Bar Charts - Top Affected Regions

In [None]:
# Create comprehensive bar charts
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 15))
fig.suptitle('COVID-19 Analysis - Top Affected Regions (Feb 9, 2020)', fontsize=18, fontweight='bold')

# Top 10 confirmed cases
top_confirmed = df.nlargest(10, 'Confirmed')
bars1 = ax1.bar(range(len(top_confirmed)), top_confirmed['Confirmed'], 
               color='steelblue', alpha=0.7)
ax1.set_title('Top 10 Regions by Confirmed Cases', fontweight='bold', fontsize=14)
ax1.set_xlabel('Region')
ax1.set_ylabel('Confirmed Cases')
ax1.set_xticks(range(len(top_confirmed)))
ax1.set_xticklabels(top_confirmed['Location'], rotation=45, ha='right')

# Add value labels
for i, bar in enumerate(bars1):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
            f'{int(height):,}', ha='center', va='bottom', fontsize=10)

# Top 10 deaths
top_deaths = df.nlargest(10, 'Deaths')
bars2 = ax2.bar(range(len(top_deaths)), top_deaths['Deaths'], 
               color='crimson', alpha=0.7)
ax2.set_title('Top 10 Regions by Deaths', fontweight='bold', fontsize=14)
ax2.set_xlabel('Region')
ax2.set_ylabel('Deaths')
ax2.set_xticks(range(len(top_deaths)))
ax2.set_xticklabels(top_deaths['Location'], rotation=45, ha='right')

# Top 10 recovered
top_recovered = df.nlargest(10, 'Recovered')
bars3 = ax3.bar(range(len(top_recovered)), top_recovered['Recovered'], 
               color='forestgreen', alpha=0.7)
ax3.set_title('Top 10 Regions by Recovered Cases', fontweight='bold', fontsize=14)
ax3.set_xlabel('Region')
ax3.set_ylabel('Recovered Cases')
ax3.set_xticks(range(len(top_recovered)))
ax3.set_xticklabels(top_recovered['Location'], rotation=45, ha='right')

# Country-level comparison
country_summary = df.groupby('Country/Region').agg({
    'Confirmed': 'sum',
    'Deaths': 'sum', 
    'Recovered': 'sum'
}).reset_index()
top_countries = country_summary.nlargest(8, 'Confirmed')

x = np.arange(len(top_countries))
width = 0.25

ax4.bar(x - width, top_countries['Confirmed'], width, label='Confirmed', color='steelblue', alpha=0.7)
ax4.bar(x, top_countries['Deaths'], width, label='Deaths', color='crimson', alpha=0.7)
ax4.bar(x + width, top_countries['Recovered'], width, label='Recovered', color='forestgreen', alpha=0.7)

ax4.set_title('Top Countries - Confirmed, Deaths, Recovered', fontweight='bold', fontsize=14)
ax4.set_xlabel('Country')
ax4.set_ylabel('Cases')
ax4.set_xticks(x)
ax4.set_xticklabels(top_countries['Country/Region'], rotation=45, ha='right')
ax4.legend()

plt.tight_layout()
plt.show()

## Visualization 2: Scatter Plots - Relationships Between Variables

In [None]:
# Create scatter plots for correlation analysis
df_scatter = df[df['Confirmed'] > 0].copy()

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(18, 12))
fig.suptitle('COVID-19 Correlation Analysis (Feb 9, 2020)', fontsize=16, fontweight='bold')

# Deaths vs Confirmed
ax1.scatter(df_scatter['Confirmed'], df_scatter['Deaths'], alpha=0.6, color='red', s=60)
ax1.set_xlabel('Confirmed Cases')
ax1.set_ylabel('Deaths')
ax1.set_title('Deaths vs Confirmed Cases (Log Scale)', fontweight='bold')
ax1.set_xscale('log')
ax1.set_yscale('log')

if len(df_scatter) > 1:
    corr_cd = np.corrcoef(df_scatter['Confirmed'], df_scatter['Deaths'])[0, 1]
    ax1.text(0.05, 0.95, f'Correlation: {corr_cd:.3f}', 
            transform=ax1.transAxes, bbox=dict(boxstyle="round", facecolor='wheat'))

# Recovered vs Confirmed
ax2.scatter(df_scatter['Confirmed'], df_scatter['Recovered'], alpha=0.6, color='green', s=60)
ax2.set_xlabel('Confirmed Cases')
ax2.set_ylabel('Recovered Cases')
ax2.set_title('Recovered vs Confirmed Cases (Log Scale)', fontweight='bold')
ax2.set_xscale('log')
ax2.set_yscale('log')

if len(df_scatter) > 1:
    corr_cr = np.corrcoef(df_scatter['Confirmed'], df_scatter['Recovered'])[0, 1]
    ax2.text(0.05, 0.95, f'Correlation: {corr_cr:.3f}', 
            transform=ax2.transAxes, bbox=dict(boxstyle="round", facecolor='wheat'))

# Mortality Rate vs Confirmed
ax3.scatter(df_scatter['Confirmed'], df_scatter['Mortality_Rate'], alpha=0.6, color='orange', s=60)
ax3.set_xlabel('Confirmed Cases')
ax3.set_ylabel('Mortality Rate (%)')
ax3.set_title('Mortality Rate vs Confirmed Cases (Log Scale)', fontweight='bold')
ax3.set_xscale('log')

# Recovery Rate vs Confirmed
ax4.scatter(df_scatter['Confirmed'], df_scatter['Recovery_Rate'], alpha=0.6, color='purple', s=60)
ax4.set_xlabel('Confirmed Cases')
ax4.set_ylabel('Recovery Rate (%)')
ax4.set_title('Recovery Rate vs Confirmed Cases (Log Scale)', fontweight='bold')
ax4.set_xscale('log')

plt.tight_layout()
plt.show()

## Visualization 3: Pie Charts - Regional Distribution

In [None]:
# Create pie charts for regional analysis
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('COVID-19 Regional Distribution (Feb 9, 2020)', fontsize=16, fontweight='bold')

# Top countries by confirmed cases
country_summary = df.groupby('Country/Region')['Confirmed'].sum().sort_values(ascending=False)
top_countries = country_summary.head(6)
others = country_summary.tail(-6).sum()

if others > 0:
    pie_data = list(top_countries) + [others]
    pie_labels = list(top_countries.index) + ['Others']
else:
    pie_data = list(top_countries)
    pie_labels = list(top_countries.index)

colors1 = plt.cm.Set3(np.linspace(0, 1, len(pie_data)))
ax1.pie(pie_data, labels=pie_labels, autopct='%1.1f%%', colors=colors1, startangle=90)
ax1.set_title('Confirmed Cases by Country', fontweight='bold')

# China's provinces
china_data = df[df['Country/Region'] == 'China']
if len(china_data) > 1:
    china_provinces = china_data.groupby('Province/State')['Confirmed'].sum().sort_values(ascending=False)
    top_provinces = china_provinces.head(5)
    others_china = china_provinces.tail(-5).sum()
    
    if others_china > 0:
        pie_data2 = list(top_provinces) + [others_china]
        pie_labels2 = list(top_provinces.index) + ['Other Provinces']
    else:
        pie_data2 = list(top_provinces)
        pie_labels2 = list(top_provinces.index)
        
    colors2 = plt.cm.Reds(np.linspace(0.3, 1, len(pie_data2)))
    ax2.pie(pie_data2, labels=pie_labels2, autopct='%1.1f%%', colors=colors2, startangle=90)
    ax2.set_title('China: Confirmed Cases by Province', fontweight='bold')

# Deaths by country
country_deaths = df.groupby('Country/Region')['Deaths'].sum().sort_values(ascending=False)
country_deaths = country_deaths[country_deaths > 0]

if len(country_deaths) > 0:
    colors3 = plt.cm.Reds(np.linspace(0.4, 1, len(country_deaths)))
    ax3.pie(country_deaths, labels=country_deaths.index, autopct='%1.1f%%', colors=colors3, startangle=90)
    ax3.set_title('Deaths by Country', fontweight='bold')

# Recoveries by country
country_recovered = df.groupby('Country/Region')['Recovered'].sum().sort_values(ascending=False)
country_recovered = country_recovered[country_recovered > 0]

if len(country_recovered) > 0:
    colors4 = plt.cm.Greens(np.linspace(0.4, 1, len(country_recovered)))
    ax4.pie(country_recovered, labels=country_recovered.index, autopct='%1.1f%%', colors=colors4, startangle=90)
    ax4.set_title('Recovered Cases by Country', fontweight='bold')

plt.tight_layout()
plt.show()

## Key Insights Analysis

In [None]:
# Generate comprehensive insights
total_confirmed = df['Confirmed'].sum()
total_deaths = df['Deaths'].sum()
total_recovered = df['Recovered'].sum()

print("=" * 60)
print("KEY INSIGHTS AND ANALYSIS")
print("=" * 60)

print(f"\nðŸ“Š GLOBAL SNAPSHOT (February 9, 2020)")
print(f"   Total Confirmed Cases: {total_confirmed:,}")
print(f"   Total Deaths: {total_deaths:,}")
print(f"   Total Recovered: {total_recovered:,}")
print(f"   Global Mortality Rate: {(total_deaths/total_confirmed)*100:.2f}%")
print(f"   Global Recovery Rate: {(total_recovered/total_confirmed)*100:.2f}%")

# China's dominance
china_cases = country_summary[country_summary['Country/Region'] == 'China']['Confirmed'].sum()
china_percentage = (china_cases / total_confirmed) * 100

print(f"\nðŸ‡¨ðŸ‡³ CHINA'S DOMINANCE IN EARLY PHASE")
print(f"   China's Cases: {china_cases:,}")
print(f"   Percentage of Global Cases: {china_percentage:.1f}%")
print(f"   China clearly dominated the early pandemic with {china_percentage:.1f}% of all cases")

# Top affected regions
print(f"\nðŸ”¥ TOP 5 MOST AFFECTED REGIONS:")
top_regions = df.nlargest(5, 'Confirmed')
for i, region in enumerate(top_regions.itertuples(), 1):
    print(f"   {i}. {region.Location}: {region.Confirmed:,} cases")

# Regional patterns
print(f"\nðŸ“ˆ REGIONAL MORTALITY AND RECOVERY PATTERNS:")
significant_regions = df[df['Confirmed'] >= 10].copy()

if len(significant_regions) > 0:
    print(f"   Regions with highest mortality rates (â‰¥10 cases):")
    high_mortality = significant_regions.nlargest(3, 'Mortality_Rate')
    for region in high_mortality.itertuples():
        if region.Mortality_Rate > 0:
            print(f"     â€¢ {region.Location}: {region.Mortality_Rate:.1f}% ({region.Deaths}/{region.Confirmed})")
    
    print(f"\n   Regions with highest recovery rates (â‰¥10 cases):")
    high_recovery = significant_regions.nlargest(3, 'Recovery_Rate')
    for region in high_recovery.itertuples():
        if region.Recovery_Rate > 0:
            print(f"     â€¢ {region.Location}: {region.Recovery_Rate:.1f}% ({region.Recovered}/{region.Confirmed})")

## Summary and Conclusions

Based on our analysis of the COVID-19 data from February 9, 2020:

### Key Findings:

1. **China's Early Dominance**: China accounted for over 99% of global cases, with Hubei province being the epicenter
2. **Regional Concentration**: The outbreak was heavily concentrated in Chinese provinces, particularly Hubei
3. **Mortality Patterns**: Overall mortality rate was around 2.3%, with regional variations
4. **Recovery Patterns**: Recovery rates varied significantly by region, with some areas showing promising recovery rates
5. **Global Spread**: While China dominated, cases were beginning to appear in other countries, signaling international spread

### Insights:
- The pandemic was still in its early phase with China as the primary affected country
- Different regions showed varying patterns of mortality and recovery
- The data reveals the importance of early detection and reporting
- Regional differences highlight the impact of local healthcare responses and reporting practices