# COVID-19 Malaysia Data Preprocessing & Splitting

This notebook processes Malaysia’s COVID-19 datasets (cases, deaths, hospital, ICU, tests, vaccination), merges them, creates lag features, filters a specific date range, shuffles the rows, and splits the data into **5 training sets** and **1 testing set**.


In [None]:
# ========================
# Import libraries
# ========================
import pandas as pd
import os


## Load CSV Files

We import datasets from CSV files and parse the `date` column as datetime.


In [None]:
cases = pd.read_csv("cases_malaysia.csv", parse_dates=["date"])
deaths = pd.read_csv("deaths_malaysia.csv", parse_dates=["date"])
hospital = pd.read_csv("hospital.csv", parse_dates=["date"])
icu = pd.read_csv("icu.csv", parse_dates=["date"])
tests = pd.read_csv("tests_malaysia.csv", parse_dates=["date"])
vax = pd.read_csv("vax_malaysia.csv", parse_dates=["date"])


## Select Useful Features

We only keep relevant columns from each dataset.


In [None]:
cases = cases[["date", "cases_new", "cases_active"]]
deaths = deaths[["date", "deaths_new"]]
hospital = hospital.groupby("date", as_index=False)[["admitted_covid", "hosp_covid"]].sum()
icu = icu.groupby("date", as_index=False)[["icu_covid", "vent_covid"]].sum()
tests = tests.rename(columns={"rtk-ag": "rtk_ag"})[["date", "rtk_ag", "pcr"]]
vax = vax[[
    "date", "daily_partial", "daily_full", "daily_booster",
    "cumul_partial", "cumul_full", "cumul_booster"
]]


## Merge Datasets

We merge all the datasets into a single DataFrame.


In [None]:
df = cases.merge(deaths, on="date", how="left")
df = df.merge(hospital, on="date", how="left")
df = df.merge(icu, on="date", how="left")
df = df.merge(tests, on="date", how="left")
df = df.merge(vax, on="date", how="left")


## Add Lag Features

We add **7-day** and **14-day lag features** for trend analysis.


In [None]:
lag_features = [
    "cases_new", "cases_active", "admitted_covid", "hosp_covid",
    "icu_covid", "vent_covid", "rtk_ag", "pcr",
    "daily_partial", "daily_full", "daily_booster",
    "cumul_partial", "cumul_full", "cumul_booster"
]

for col in lag_features:
    df[f"{col}_lag7"] = df[col].shift(7)
    df[f"{col}_lag14"] = df[col].shift(14)

# Drop rows with NaN (from lagging)
df = df.dropna().reset_index(drop=True)


## Filter Date Range

We keep only data between **15 July 2021** and **15 January 2022**.


In [None]:
start_date = "2021-07-15"
end_date   = "2022-01-15"

df = df[(df["date"] >= start_date) & (df["date"] <= end_date)].reset_index(drop=True)


## Shuffle Rows

We shuffle rows to remove any chronological ordering.


In [None]:
df = df.sample(frac=1, random_state=42).reset_index(drop=True)


## Split into 6 Parts

We divide the dataset into **6 equal parts**:
- **5 parts for training**
- **1 part for testing**


In [None]:
n = len(df)
split_size = n // 6

splits = [df[i*split_size:(i+1)*split_size] for i in range(5)]
splits.append(df[5*split_size:])  # last split takes remainder


## Save to CSV Files

We save:
- `set_1.csv` to `set_5.csv` → inside **training_set/**
- `set_6.csv` → inside **testing_set/**


In [None]:
os.makedirs("training_set", exist_ok=True)
os.makedirs("testing_set", exist_ok=True)

for i, split in enumerate(splits, 1):
    if i < 6:
        split.to_csv(f"training_set/set_{i}.csv", index=False)
    else:
        split.to_csv(f"testing_set/set_{i}.csv", index=False)

print("✅ Data successfully split and saved!")
print(f"Training sets: {split_size*5} rows, Testing set: {len(splits[-1])} rows")
