# Project 10: COVID-19 Cases Time Series Analysis

This notebook performs a time-series analysis of the COVID-19 pandemic using data from Johns Hopkins University. The goal is to load, process, and visualize the data to understand the progression of the pandemic over time, identify major waves, and compare trends across different countries.

## 1. Setup and Library Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

## 2. Data Loading and Preprocessing

In [None]:
# Load the dataset
try:
    df = pd.read_csv('data/time_series_covid19_confirmed_global.csv')
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Data file not found. Please download the JHU dataset and place it in the 'data/' directory.")

df.head()

In [None]:
# Drop unnecessary columns
df_cleaned = df.drop(columns=['Province/State', 'Lat', 'Long'])

# Melt the dataframe to convert it from wide to long format
df_long = df_cleaned.melt(id_vars=['Country/Region'], var_name='Date', value_name='Cumulative Cases')

# Convert 'Date' column to datetime objects
df_long['Date'] = pd.to_datetime(df_long['Date'])

df_long.head()

### 2.1 Aggregating Global Data

In [None]:
# Group by date to get total global cases
global_cases = df_long.groupby('Date')['Cumulative Cases'].sum().reset_index()

# Calculate daily new cases
global_cases['New Cases'] = global_cases['Cumulative Cases'].diff().fillna(0)

# Calculate 7-day rolling average of new cases
global_cases['7-Day Rolling Avg'] = global_cases['New Cases'].rolling(window=7).mean()

global_cases.tail()

## 3. Global Trend Visualization

In [None]:
# Plot cumulative global cases
plt.figure(figsize=(12, 6))
plt.plot(global_cases['Date'], global_cases['Cumulative Cases'], label='Cumulative Cases')
plt.title('Global Cumulative COVID-19 Cases Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Cases (in billions)')
plt.legend()
plt.show()

In [None]:
# Plot daily new cases and the 7-day rolling average
plt.figure(figsize=(12, 6))
plt.bar(global_cases['Date'], global_cases['New Cases'], label='Daily New Cases', color='lightgray')
plt.plot(global_cases['Date'], global_cases['7-Day Rolling Avg'], label='7-Day Rolling Average', color='red')
plt.title('Global Daily New COVID-19 Cases and 7-Day Rolling Average')
plt.xlabel('Date')
plt.ylabel('Number of New Cases')
plt.legend()
plt.show()

The 7-day rolling average clearly smooths out the daily noise and allows us to see the major waves of the pandemic.

## 4. Country-Specific Analysis

In [None]:
# Group by country and date
country_cases = df_long.groupby(['Country/Region', 'Date'])['Cumulative Cases'].sum().reset_index()

# Calculate new cases per country
country_cases['New Cases'] = country_cases.groupby('Country/Region')['Cumulative Cases'].diff().fillna(0)

# Calculate 7-day rolling average per country
country_cases['7-Day Rolling Avg'] = country_cases.groupby('Country/Region')['New Cases'].transform(lambda x: x.rolling(7).mean())

countries_to_compare = ['US', 'India', 'Brazil', 'United Kingdom']
comparison_df = country_cases[country_cases['Country/Region'].isin(countries_to_compare)]

In [None]:
# Plot the comparison
plt.figure(figsize=(14, 8))
sns.lineplot(data=comparison_df, x='Date', y='7-Day Rolling Avg', hue='Country/Region')
plt.title('COVID-19 New Cases (7-Day Rolling Average) by Country')
plt.xlabel('Date')
plt.ylabel('7-Day Avg. New Cases')
plt.legend(title='Country')
plt.show()

## 5. Conclusion

This analysis provided a clear visual overview of the COVID-19 pandemic's progression. Key takeaways include:

1.  **Global Waves:** The 7-day rolling average effectively visualized the major global waves of infection, showing distinct peaks over the years.
2.  **Country-Specific Trajectories:** The pandemic unfolded differently across countries, with nations experiencing major waves at different times and scales, as seen in the comparison plot.
3.  **Data Preprocessing:** The initial data was in a 'wide' format, and a significant part of the work involved preprocessing it into a 'long' time-series format suitable for analysis and visualization with tools like Pandas and Matplotlib.