### **COVID-19 End-to-End Analysis (Time-Series + Predictive Modeling)**

### Summary – End-to-End Data Science Project 

This day focuses on completing a full, end-to-end data science project using a real dataset.  
The workflow includes data cleaning, exploratory data analysis, visualizations, feature preparation, model training, evaluation, and final reporting.  

---

**Goal:** Explore COVID-19 trends, clean the dataset, perform EDA, engineer new features, and build a simple predictive model for future confirmed cases.


---
#### Dataset: covid_19_clean_complete.csv 
**Author:** Saba Arif



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load and Inspect Data

### What I Did
- Loaded the COVID dataset and reviewed columns, shape, and first rows.

### What the Output Shows
- Confirms dataset structure and available features.

### Quick Interpretation
- Dataset includes daily cumulative counts for each country/region.


In [None]:
df = pd.read_csv("datasets/covid_19_clean_complete.csv")
print(df.shape)
df.head()
df.columns


## Clean Column Names & Parse Dates

### What I Did
- Standardized column names to lowercase.
- Converted the date column to datetime.
- Removed rows with missing date values.

### What the Output Shows
- Cleaned dataset with consistent structure.

### Insights
- Date handling is essential for time-series analysis.


In [None]:
df.columns = [c.lower().strip() for c in df.columns]
df['date'] = pd.to_datetime(df['date'], errors='coerce')

df = df.dropna(subset=['date'])
df.info()


## Aggregate Data by Country and Date

### What I Did
- Grouped by date and country/region.
- Calculated total confirmed, deaths, recovered, and active per day.

### What the Output Shows
- A consolidated dataset ideal for trend analysis and modeling.

### Insights
- Aggregation removes provincial duplication and reveals national-level trends.


In [None]:
df_country = (
    df.groupby(['date', 'country/region'])[['confirmed','deaths','recovered','active']]
      .sum()
      .reset_index()
)

df_country.head()


## Feature Engineering: New Cases and New Deaths

### What I Did
- Sorted values by country and date.
- Created:
  - `new_confirmed` = day-to-day difference
  - `new_deaths` = day-to-day difference

### What the Output Shows
- Daily metrics required for meaningful EDA and modeling.

### Insights
- These dynamic metrics are more informative than raw cumulative data.


In [None]:
df_country = df_country.sort_values(['country/region', 'date'])

df_country['new_confirmed'] = df_country.groupby('country/region')['confirmed'].diff()
df_country['new_deaths'] = df_country.groupby('country/region')['deaths'].diff()

df_country.head()


## Monthly Confirmed Cases Trend

### What I Did
- Filtered for a primary country (e.g., US).
- Resampled confirmed cases to monthly sums.

### What the Output Shows
- Long-term COVID-19 waves by month.

### Quick Interpretation
- Helps identify peak periods and decline phases.


In [None]:
country = "US"
df_us = df_country[df_country['country/region'] == country].copy()
df_us = df_us.set_index('date')

monthly_confirmed = df_us['new_confirmed'].resample('M').sum()

plt.figure(figsize=(12,5))
monthly_confirmed.plot(marker='o')
plt.title(f"Monthly New Confirmed Cases — {country}")
plt.ylabel("New Cases")
plt.grid(True)
plt.show()


## Top Countries by Total Confirmed Cases

### What I Did
- Aggregated maximum confirmed cases per country.
- Created a bar chart of the top 10 countries.

### What the Output Shows
- Global comparison of total case counts.

### Insights
- Highlights most impacted countries.


In [None]:
top_confirmed = (
    df_country.groupby('country/region')['confirmed']
    .max()
    .sort_values(ascending=False)
    .head(10)
)

top_confirmed.plot(kind='bar', figsize=(10,5))
plt.title("Top 10 Countries by Confirmed Cases")
plt.ylabel("Confirmed Cases")
plt.show()


## Modeling: Predict Next-Day Confirmed Cases

### What I Did
- Created lag features (previous day's values).
- Selected features and target.
- Split into train/test.

### What the Output Shows
- Model-ready dataset with lag-based features.

### Insights
- Lag features capture temporal relationships in the trend.


In [None]:
df_us['confirmed_lag1'] = df_us['confirmed'].shift(1)
df_us['deaths_lag1'] = df_us['deaths'].shift(1)

df_us_model = df_us.dropna().copy()

X = df_us_model[['confirmed_lag1','deaths_lag1']]
y = df_us_model['confirmed']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False
)


## Model Training and Evaluation

### What I Did
- Trained a linear regression model.
- Evaluated using R² score and error metrics.

### What the Output Shows
- Model performance on predicting next day's confirmed cases.

### Insights
- Indicates whether the model captures temporal trends sufficiently.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("R² Score:", r2_score(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))


## Final Insights & Next Steps

### Key Findings
- Monthly confirmed trends clearly show major waves of COVID-19 in the selected country.
- Daily new cases and new deaths provide a realistic view of pandemic progression.
- Top-country comparison helps position national trends globally.

### Model Performance
- The simple regression model captures part of the trend but misses sudden spikes.
- Further improvements needed: rolling averages, more lag features, moving-window regression.
- The model achieved an R² of 0.986, showing it explains most of the variation in daily COVID-19 cases.
- However, the MAE of ~60k indicates substantial error in absolute terms, highlighting the volatility of daily case data and the limitations of using simple lag-based regression for forecasting."

### Next Steps
- Add mobility data or vaccination data.
- Build a forecasting model using ARIMA or LSTM.
