# ARTI406 - Machine Learning
# Assignment 1: Exploratory Data Analysis (EDA)

## Dataset: Global COVID-19 Statistics – February 2026 Snapshot

### Dataset Description
This dataset provides a point-in-time view of COVID-19 metrics across **238 countries and territories** as of February 16, 2026. It includes key indicators such as:
- Confirmed cases and deaths (total and per million population)
- Active cases and new daily cases/deaths
- Testing data (total tests and tests per million)
- Continent classification and population figures

**Source:** Extracted via the `covid-193.p.rapidapi.com` API (api-sports.io) on February 16, 2026.

**Use Case:** This dataset is ideal for analyzing regional patterns in COVID-19 spread, mortality trends, and testing disparities across the globe.

---
## Step 1: Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Make plots look cleaner
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 5)

print("Libraries imported successfully!")

---
## Step 2: Load Dataset

In [None]:
# Load Dataset
df = pd.read_csv("covid19_global_statistics_2026.csv")

# Display first 5 rows
df.head()

In [None]:
# Display last 5 rows
df.tail()

---
## Step 3: Basic Dataset Information

### Number of Rows and Columns

In [None]:
# Shape of the dataset
print("Shape (rows, columns):", df.shape)
print("Number of rows:       ", df.shape[0])
print("Number of columns:    ", df.shape[1])

### Data Types of Columns

In [None]:
# Data types
df.dtypes

In [None]:
# Convert 'date' to datetime
df['date'] = pd.to_datetime(df['date'], errors='coerce')

print("Updated data types:")
df.dtypes

---
## Step 4: Check Missing Values

In [None]:
# Count missing values per column
missing = df.isna().sum()
missing_pct = (df.isna().mean() * 100).round(2)

missing_df = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing %', ascending=False)
print(missing_df)

In [None]:
# Visualize missing values
plt.figure(figsize=(10, 5))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title("Missing Values Heatmap (Yellow = Missing)")
plt.tight_layout()
plt.show()

print("\nObservation: Several columns have substantial missing data — especially 'new_cases', 'new_deaths',")
print("'total_tests', and 'tests_per_million'. This reflects inconsistent reporting across countries.")

---
## Step 5: Check Duplicate Rows

In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates == 0:
    print("No duplicate rows found. Each row represents a unique country/territory.")

---
## Step 6: Descriptive Statistical Summary

In [None]:
# Statistical summary for all columns
df.describe(include='all')

In [None]:
# Summary of numeric columns only
numeric_cols = ['total_cases', 'total_deaths', 'active_cases', 'cases_per_million',
                'deaths_per_million', 'total_tests', 'tests_per_million', 'population']

df[numeric_cols].describe().round(2)

**Key Observations from Statistical Summary:**
- `total_cases` ranges from a few hundred to hundreds of millions — enormous variance across countries
- `deaths_per_million` shows wide spread, indicating major differences in healthcare quality and reporting
- `tests_per_million` has high variation, reflecting testing disparity between wealthy and developing nations
- Very high standard deviations relative to means suggest the presence of significant outliers (large countries like USA, India, China)

---
## Step 7: Univariate Analysis

### Distribution of Total Cases

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['total_cases'].dropna(), bins=40, kde=True, color='steelblue')
plt.title("Distribution of Total Cases")
plt.xlabel("Total Cases")
plt.ylabel("Number of Countries")
plt.tight_layout()
plt.show()

print("Observation: Strongly right-skewed — most countries have relatively low total cases,")
print("while a few large countries (e.g., USA, India) drive extreme values.")

### Distribution of Cases Per Million (Normalized)

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['cases_per_million'].dropna(), bins=40, kde=True, color='darkorange')
plt.title("Distribution of Cases Per Million")
plt.xlabel("Cases Per Million")
plt.ylabel("Number of Countries")
plt.tight_layout()
plt.show()

print("Observation: Even after normalizing by population, the distribution is still right-skewed.")
print("Small island territories with high infection rates contribute to the right tail.")

### Distribution of Deaths Per Million

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['deaths_per_million'].dropna(), bins=40, kde=True, color='crimson')
plt.title("Distribution of Deaths Per Million")
plt.xlabel("Deaths Per Million")
plt.ylabel("Number of Countries")
plt.tight_layout()
plt.show()

### Boxplot for Outlier Detection

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

sns.boxplot(y=df['cases_per_million'], ax=axes[0], color='steelblue')
axes[0].set_title("Cases Per Million")

sns.boxplot(y=df['deaths_per_million'], ax=axes[1], color='crimson')
axes[1].set_title("Deaths Per Million")

sns.boxplot(y=df['tests_per_million'], ax=axes[2], color='seagreen')
axes[2].set_title("Tests Per Million")

plt.suptitle("Boxplots – Outlier Detection", fontsize=14)
plt.tight_layout()
plt.show()

print("Observation: All three metrics show significant outliers above the upper fence.")
print("These are typically small island territories or countries with very high testing/reporting rates.")

---
## Step 8: Bivariate Analysis

### Total Cases by Continent

In [None]:
continent_cases = df.groupby('continent')['total_cases'].sum().sort_values(ascending=False)

plt.figure(figsize=(10, 5))
continent_cases.plot(kind='bar', color=sns.color_palette('Set2', len(continent_cases)))
plt.title("Total COVID-19 Cases by Continent")
plt.ylabel("Total Cases")
plt.xlabel("Continent")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(continent_cases)

### Total Deaths by Continent

In [None]:
continent_deaths = df.groupby('continent')['total_deaths'].sum().sort_values(ascending=False)

plt.figure(figsize=(10, 5))
continent_deaths.plot(kind='bar', color=sns.color_palette('Reds_r', len(continent_deaths)))
plt.title("Total COVID-19 Deaths by Continent")
plt.ylabel("Total Deaths")
plt.xlabel("Continent")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(continent_deaths)

### Top 15 Countries by Total Cases

In [None]:
top15_cases = df.nlargest(15, 'total_cases')[['country', 'total_cases', 'continent']]

plt.figure(figsize=(12, 6))
sns.barplot(data=top15_cases, x='total_cases', y='country', palette='Blues_r')
plt.title("Top 15 Countries by Total COVID-19 Cases")
plt.xlabel("Total Cases")
plt.ylabel("Country")
plt.tight_layout()
plt.show()

print(top15_cases.to_string(index=False))

### Top 15 Countries by Deaths Per Million (Mortality Rate)

In [None]:
top15_deaths = df.nlargest(15, 'deaths_per_million')[['country', 'deaths_per_million', 'continent']]

plt.figure(figsize=(12, 6))
sns.barplot(data=top15_deaths, x='deaths_per_million', y='country', palette='Reds_r')
plt.title("Top 15 Countries by Deaths Per Million")
plt.xlabel("Deaths Per Million")
plt.ylabel("Country")
plt.tight_layout()
plt.show()

print(top15_deaths.to_string(index=False))

### Cases Per Million vs Deaths Per Million (Scatter Plot)

In [None]:
df_clean = df.dropna(subset=['cases_per_million', 'deaths_per_million'])

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_clean, x='cases_per_million', y='deaths_per_million',
                hue='continent', alpha=0.7, s=80)
plt.title("Cases Per Million vs Deaths Per Million by Continent")
plt.xlabel("Cases Per Million")
plt.ylabel("Deaths Per Million")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

print("Observation: There is a positive relationship between cases and deaths per million.")
print("European countries (orange) tend to cluster at higher mortality rates.")

### Testing Coverage vs Cases Per Million

In [None]:
df_test = df.dropna(subset=['tests_per_million', 'cases_per_million'])

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_test, x='tests_per_million', y='cases_per_million',
                hue='continent', alpha=0.7, s=80)
plt.title("Tests Per Million vs Cases Per Million")
plt.xlabel("Tests Per Million")
plt.ylabel("Cases Per Million")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

print("Observation: Countries with more testing tend to report more cases — this reflects better detection,")
print("not necessarily a higher infection rate. Low-testing countries may significantly undercount cases.")

---
## Step 9: Correlation Matrix

In [None]:
corr_cols = ['total_cases', 'total_deaths', 'active_cases', 'cases_per_million',
             'deaths_per_million', 'total_tests', 'tests_per_million', 'population']

corr_matrix = df[corr_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm',
            linewidths=0.5, square=True)
plt.title("Correlation Matrix – COVID-19 Metrics")
plt.tight_layout()
plt.show()

print("Key Correlations:")
print("- total_cases and total_deaths: strong positive correlation (expected)")
print("- total_cases and total_tests: positive (more testing → more detected cases)")
print("- population and total_cases: moderate positive (larger countries have more total cases)")
print("- cases_per_million and deaths_per_million: positive but weaker than raw totals")

---
## Step 10: Continent-Level Analysis (Grouped Comparison)

In [None]:
# Average cases and deaths per million by continent
continent_avg = df.groupby('continent')[['cases_per_million', 'deaths_per_million', 'tests_per_million']].mean().round(0)
print("Average Metrics by Continent:")
print(continent_avg.sort_values('cases_per_million', ascending=False))

In [None]:
# Boxplot: Cases per million across continents
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='continent', y='cases_per_million', palette='Set3')
plt.title("Cases Per Million Across Continents")
plt.xlabel("Continent")
plt.ylabel("Cases Per Million")
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()

In [None]:
# Boxplot: Deaths per million across continents
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='continent', y='deaths_per_million', palette='Set1')
plt.title("Deaths Per Million Across Continents")
plt.xlabel("Continent")
plt.ylabel("Deaths Per Million")
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()

print("Observation: Europe shows notably higher median deaths per million compared to Africa and Asia.")
print("This could be due to older populations, early pandemic impact, or better death reporting.")

---
## Step 11: Mortality Rate Analysis

In [None]:
# Calculate Case Fatality Rate (CFR)
df['case_fatality_rate'] = (df['total_deaths'] / df['total_cases'] * 100).round(3)

# Top 15 countries by CFR (countries with at least 1000 cases)
df_cfr = df[df['total_cases'] >= 1000].nlargest(15, 'case_fatality_rate')

plt.figure(figsize=(12, 6))
sns.barplot(data=df_cfr, x='case_fatality_rate', y='country', palette='YlOrRd_r')
plt.title("Top 15 Countries by Case Fatality Rate (CFR %) — min. 1000 cases")
plt.xlabel("Case Fatality Rate (%)")
plt.ylabel("Country")
plt.tight_layout()
plt.show()

print(df_cfr[['country', 'total_cases', 'total_deaths', 'case_fatality_rate']].to_string(index=False))

In [None]:
# Global CFR
global_cfr = df['total_deaths'].sum() / df['total_cases'].sum() * 100
print(f"Global Case Fatality Rate: {global_cfr:.3f}%")

# CFR by continent
cfr_continent = df.groupby('continent').apply(
    lambda x: (x['total_deaths'].sum() / x['total_cases'].sum() * 100)
).round(3).sort_values(ascending=False)

print("\nCase Fatality Rate by Continent:")
print(cfr_continent)

---
## Step 12: Testing Disparity Analysis

In [None]:
# Average tests per million by continent
testing_by_continent = df.groupby('continent')['tests_per_million'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 5))
testing_by_continent.plot(kind='bar', color=sns.color_palette('Greens_r', len(testing_by_continent)))
plt.title("Average Tests Per Million by Continent")
plt.ylabel("Tests Per Million")
plt.xlabel("Continent")
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()

print(testing_by_continent.round(0))

In [None]:
# Top 15 most tested countries (per million)
top_tested = df.dropna(subset=['tests_per_million']).nlargest(15, 'tests_per_million')[['country', 'continent', 'tests_per_million']]

plt.figure(figsize=(12, 6))
sns.barplot(data=top_tested, x='tests_per_million', y='country', palette='Greens_r')
plt.title("Top 15 Countries by Tests Per Million")
plt.xlabel("Tests Per Million")
plt.ylabel("Country")
plt.tight_layout()
plt.show()

print(top_tested.to_string(index=False))

---
## Step 13: Pie Chart – Share of Global Cases by Continent

In [None]:
continent_share = df.groupby('continent')['total_cases'].sum()

plt.figure(figsize=(8, 8))
plt.pie(continent_share, labels=continent_share.index, autopct='%1.1f%%',
        colors=sns.color_palette('Set2', len(continent_share)), startangle=140)
plt.title("Share of Global COVID-19 Cases by Continent")
plt.tight_layout()
plt.show()

---
## Step 14: Summary of EDA Findings

### Key Insights from Exploratory Data Analysis

**1. Data Quality Issues:**
- Multiple columns have significant missing data (`new_cases`, `new_deaths`, `total_tests`, `tests_per_million`)
- Missing data is more prevalent in African and smaller island nations, reflecting weaker reporting infrastructure
- No duplicate rows found — each row is a unique country/territory

**2. Distribution & Outliers:**
- Total cases and deaths are heavily right-skewed — a few large nations (USA, India, France) dominate raw counts
- Per-million metrics reduce size bias but still show significant outliers (small territories with high penetration rates)

**3. Continental Patterns:**
- **Europe and North America** have the highest cases and deaths per million
- **Africa** has the lowest reported cases and deaths per million — likely reflecting underreporting due to limited testing
- **Asia and South America** are in the middle range

**4. Mortality (CFR):**
- Global Case Fatality Rate is relatively low, but individual countries with weak healthcare show higher CFRs
- Europe has a higher CFR than Asia, partly due to older populations and early pandemic impact

**5. Testing Disparity:**
- Wealthier continents (Europe, North America) conducted significantly more tests per million
- Low-testing countries likely undercount cases, making direct comparisons unreliable without adjustment

**6. Correlation:**
- Strong positive correlation between `total_cases` and `total_deaths` (r ≈ 0.9+)
- Population strongly correlates with raw totals but not normalized rates
- More testing correlates with more detected cases — detection bias must be considered in modeling

### Implications for Machine Learning
- Before modeling, missing values must be handled (imputation or removal)
- Outlier treatment or log-transformation is needed for skewed numeric features
- Per-million metrics are better features than raw totals to avoid population bias
- Testing coverage (`tests_per_million`) should be considered as a confounder in any predictive model