# Notebook 1, Preparing the Dataset for Time Series Modeling

This notebook prepares the rainfall and temperature dataset for machine learning modeling.
The dataset is originally in wide format, where each row represents one region and one year containing monthly values.
We reshape the dataset into long format, create lag features for rainfall and temperature, and export the cleaned dataset for modeling.

We model separate forecasts for each region, so structuring the time series correctly is essential.

### Step 1: Load the dataset

In [1]:
import pandas as pd


df = pd.read_csv("../1_datasets/Final_dataset/final_merged_dataset.csv")
df.head()



Unnamed: 0,REGION,YEAR,JAN_RAIN,FEB_RAIN,MAR_RAIN,APR_RAIN,MAY_RAIN,JUN_RAIN,JUL_RAIN,AUG_RAIN,...,APR_TEMP,MAY_TEMP,JUN_TEMP,JUL_TEMP,AUG_TEMP,SEP_TEMP,OCT_TEMP,NOV_TEMP,DEC_TEMP,ANN_TEMP
0,Central,1990,0.0,0.0,0.0,0.0,0.002,0.026667,1.387333,0.429333,...,30.844667,33.960222,33.907,32.037444,32.395111,32.977667,32.436444,29.210333,27.665444,28.028111
1,Central,1991,0.0,0.0,0.0,0.047333,0.216,0.036667,0.713333,1.087333,...,33.137556,35.484,34.458778,32.991222,31.911333,33.060556,32.269778,27.483222,22.874889,28.717444
2,Central,1992,0.0,0.0,0.0,0.003,0.292333,0.145,1.118,2.190333,...,30.957444,32.862667,33.887667,32.620111,30.592222,31.677111,31.273667,26.608444,21.895444,26.553889
3,Central,1993,0.0,0.0,0.000333,0.119333,0.646333,0.173667,3.025,2.957667,...,30.436333,32.704222,32.951444,31.131556,30.119111,30.464778,30.612667,28.435111,24.960222,26.655667
4,Central,1994,0.0,0.0,0.0,0.0,0.389333,0.144667,2.592,1.556667,...,31.737556,33.064222,33.141778,30.683111,30.624778,31.021778,31.593778,25.942444,22.093889,27.520667


### Step 2: Reshape Dataset to Long Format

Your current columns are like:

JAN_RAIN, FEB_RAIN, ...
JAN_TEMP, FEB_TEMP, ...

We need:

Region, Year, Month, Rainfall, Temperature

This is required to create time series lags.

In [2]:
# Identify month names in order
months = ["JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"]

# Create Rainfall long format
rain_long = df.melt(
    id_vars=["REGION","YEAR"],
    value_vars=[m + "_RAIN" for m in months],
    var_name="Month_Rain",
    value_name="Rainfall"
)

# Create Temperature long format
temp_long = df.melt(
    id_vars=["REGION","YEAR"],
    value_vars=[m + "_TEMP" for m in months],
    var_name="Month_Temp",
    value_name="Temperature"
)

# Extract month from column names
rain_long["Month"] = rain_long["Month_Rain"].str.split("_").str[0]
temp_long["Month"] = temp_long["Month_Temp"].str.split("_").str[0]

# Merge on region, year and month
df_long = pd.merge(
    rain_long[["REGION","YEAR","Month","Rainfall"]],
    temp_long[["REGION","YEAR","Month","Temperature"]],
    on=["REGION","YEAR","Month"]
)

df_long.head()



Unnamed: 0,REGION,YEAR,Month,Rainfall,Temperature
0,Central,1990,JAN,0.0,23.057667
1,Central,1991,JAN,0.0,22.685222
2,Central,1992,JAN,0.0,21.373889
3,Central,1993,JAN,0.0,22.352778
4,Central,1994,JAN,0.0,25.334111


### Step 3: Convert Month to Numerical Order and Sort

In [3]:
month_order = {
    "JAN":1,"FEB":2,"MAR":3,"APR":4,"MAY":5,"JUN":6,
    "JUL":7,"AUG":8,"SEP":9,"OCT":10,"NOV":11,"DEC":12
}

df_long["Month_Num"] = df_long["Month"].map(month_order)

df_long = df_long.sort_values(["REGION","YEAR","Month_Num"]).reset_index(drop=True)
df_long.head(14)


Unnamed: 0,REGION,YEAR,Month,Rainfall,Temperature,Month_Num
0,Central,1990,JAN,0.0,23.057667,1
1,Central,1990,FEB,0.0,22.018,2
2,Central,1990,MAR,0.0,25.144778,3
3,Central,1990,APR,0.0,30.844667,4
4,Central,1990,MAY,0.002,33.960222,5
5,Central,1990,JUN,0.026667,33.907,6
6,Central,1990,JUL,1.387333,32.037444,7
7,Central,1990,AUG,0.429333,32.395111,8
8,Central,1990,SEP,0.44,32.977667,9
9,Central,1990,OCT,0.387,32.436444,10


### Step 4: Create a Time Index

Machine learning models like having a continuous time column.

In [4]:
df_long["Time"] = df_long.groupby("REGION").cumcount()
df_long.head()


Unnamed: 0,REGION,YEAR,Month,Rainfall,Temperature,Month_Num,Time
0,Central,1990,JAN,0.0,23.057667,1,0
1,Central,1990,FEB,0.0,22.018,2,1
2,Central,1990,MAR,0.0,25.144778,3,2
3,Central,1990,APR,0.0,30.844667,4,3
4,Central,1990,MAY,0.002,33.960222,5,4


### Step 5: Create Lag Features

We generate:

lag 1, lag 2, lag 3, lag 12

for rainfall and temperature.

In [5]:
lags = [1,2,3,12]

df_lags = df_long.copy()

for lag in lags:
    df_lags[f"Rain_lag_{lag}"] = df_lags.groupby("REGION")["Rainfall"].shift(lag)
    df_lags[f"Temp_lag_{lag}"] = df_lags.groupby("REGION")["Temperature"].shift(lag)

df_lags.head(15)


Unnamed: 0,REGION,YEAR,Month,Rainfall,Temperature,Month_Num,Time,Rain_lag_1,Temp_lag_1,Rain_lag_2,Temp_lag_2,Rain_lag_3,Temp_lag_3,Rain_lag_12,Temp_lag_12
0,Central,1990,JAN,0.0,23.057667,1,0,,,,,,,,
1,Central,1990,FEB,0.0,22.018,2,1,0.0,23.057667,,,,,,
2,Central,1990,MAR,0.0,25.144778,3,2,0.0,22.018,0.0,23.057667,,,,
3,Central,1990,APR,0.0,30.844667,4,3,0.0,25.144778,0.0,22.018,0.0,23.057667,,
4,Central,1990,MAY,0.002,33.960222,5,4,0.0,30.844667,0.0,25.144778,0.0,22.018,,
5,Central,1990,JUN,0.026667,33.907,6,5,0.002,33.960222,0.0,30.844667,0.0,25.144778,,
6,Central,1990,JUL,1.387333,32.037444,7,6,0.026667,33.907,0.002,33.960222,0.0,30.844667,,
7,Central,1990,AUG,0.429333,32.395111,8,7,1.387333,32.037444,0.026667,33.907,0.002,33.960222,,
8,Central,1990,SEP,0.44,32.977667,9,8,0.429333,32.395111,1.387333,32.037444,0.026667,33.907,,
9,Central,1990,OCT,0.387,32.436444,10,9,0.44,32.977667,0.429333,32.395111,1.387333,32.037444,,


### Step 6: We can now drop misssing lag rows

In [6]:
df_lags = df_lags.dropna().reset_index(drop=True)
df_lags.head()


Unnamed: 0,REGION,YEAR,Month,Rainfall,Temperature,Month_Num,Time,Rain_lag_1,Temp_lag_1,Rain_lag_2,Temp_lag_2,Rain_lag_3,Temp_lag_3,Rain_lag_12,Temp_lag_12
0,Central,1991,JAN,0.0,22.685222,1,12,0.0,27.665444,0.0,29.210333,0.387,32.436444,0.0,23.057667
1,Central,1991,FEB,0.0,24.895333,2,13,0.0,22.685222,0.0,27.665444,0.0,29.210333,0.0,22.018
2,Central,1991,MAR,0.0,27.808889,3,14,0.0,24.895333,0.0,22.685222,0.0,27.665444,0.0,25.144778
3,Central,1991,APR,0.047333,33.137556,4,15,0.0,27.808889,0.0,24.895333,0.0,22.685222,0.0,30.844667
4,Central,1991,MAY,0.216,35.484,5,16,0.047333,33.137556,0.0,27.808889,0.0,24.895333,0.002,33.960222


### Step 7: Save the Processed Dataset

In [7]:
df_lags.to_csv("../4_data_analysis/model_datasets/model_ready_dataset.csv", index=False)
df_lags.head()


Unnamed: 0,REGION,YEAR,Month,Rainfall,Temperature,Month_Num,Time,Rain_lag_1,Temp_lag_1,Rain_lag_2,Temp_lag_2,Rain_lag_3,Temp_lag_3,Rain_lag_12,Temp_lag_12
0,Central,1991,JAN,0.0,22.685222,1,12,0.0,27.665444,0.0,29.210333,0.387,32.436444,0.0,23.057667
1,Central,1991,FEB,0.0,24.895333,2,13,0.0,22.685222,0.0,27.665444,0.0,29.210333,0.0,22.018
2,Central,1991,MAR,0.0,27.808889,3,14,0.0,24.895333,0.0,22.685222,0.0,27.665444,0.0,25.144778
3,Central,1991,APR,0.047333,33.137556,4,15,0.0,27.808889,0.0,24.895333,0.0,22.685222,0.0,30.844667
4,Central,1991,MAY,0.216,35.484,5,16,0.047333,33.137556,0.0,27.808889,0.0,24.895333,0.002,33.960222
