### **1. Purpose -**

Evaluate predictive performance of COVID-19 case forecasting model using robust regression metrics to ensure reliability for public health trend monitoring.

### **1. Import libraries**

In [12]:
import pandas as pd
import numpy as np
import joblib

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

### **2. Load dataset**

In [13]:
df = pd.read_csv("/content/sample_data/India_COVID19_Statewise_TimeSeries_Analytics_2021.csv")

In [17]:
df.head(1)

Unnamed: 0,Date,State_UT,Population,New_Cases,New_Deaths,New_Recoveries,Total_Cases,Total_Deaths,Total_Recoveries,Active_Cases
0,2021-01-01,Andaman and Nicobar,57755036,477,10,395,477,10,395,72


In [14]:
# Convert Date
df["Date"] = pd.to_datetime(df["Date"])

In [15]:
# Sort for time consistency
df = df.sort_values(["State_UT", "Date"])

### **3. Feature Feature Engineering (Lag-Based Predictors)**




Since COVID data is time-dependent, lag features were created to capture temporal patterns.

In [26]:
# Creating lag features
df["Lag_1_New_Cases"] = df["New_Cases"].shift(1)
df["Lag_7_New_Cases"] = df["New_Cases"].shift(7)

# Remove null values created due to shifting
df = df.dropna()



### **4. Feature & Target Definition**

Clear separation ensures modelling clarity.

In [27]:
# Features
X = df[["Lag_1_New_Cases", "Lag_7_New_Cases"]]

# Target variable
y = df["New_Cases"]


### **5.Time-Aware Train-Test Split**

For time-series modelling, shuffling must be avoided.

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    shuffle=False
)

### **6. Model Selection & Training**

Random Forest Regressor was selected due to:

• Ability to capture non-linear relationships

• Robustness to noise

• Strong baseline performance

In [29]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

model.fit(X_train, y_train)

### **7. Model Evaluation**

In [30]:
from sklearn.metrics import mean_absolute_error, r2_score

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Absolute Error:", mae)
print("R2 Score:", r2)

Mean Absolute Error: 20.139215191655772
R2 Score: -0.3214624453113517


Evaluation Metrics:

• MAE → Measures average prediction error

• R² → Explains variance captured by model

### **8. Model Serialization (Production Readiness)**

To make the model reusable and deployment-ready, it was saved in .pkl format.

In [31]:
import pickle

with open("covid_forecasting_model.pkl", "wb") as file:
    pickle.dump(model, file)

###**9. Model Loading (Deployment Simulation)**

In [32]:
with open("covid_forecasting_model.pkl", "rb") as file:
    loaded_model = pickle.load(file)

predictions = loaded_model.predict(X_test)

Ensures:

• Model can be reused without retraining

• Ready for API integration / deployment