# Model is Overfitting

## Key Issues Observed:

* Training loss is decreasing (from 138.5 to 2.1) but validation loss is increasing (from 4.6 to 14.7) - this is a clear sign of overfitting
* Very low accuracy values (around 0.0002 for training and 0 for validation)
* Increasing gap between training and validation RMSE

## Suggestive Measures:

1. **Regularization Techniques:**
* Add dropout layers (start with 0.2-0.3)
* Add L1/L2 regularization to your layers
* Try batch normalization

2. **Data Preprocessing:**
* Normalize/standardize your input data
* Check for and handle outliers
* Ensure proper train/validation split methodology for time series (maintain temporal order)
* Consider differencing or detrending if your data has strong trends

3. **Model Architecture:**
* Reduce model complexity if you have a deep network
* If using LSTM/GRU, adjust the number of units
* Try simpler architectures first (fewer layers)
* Consider adding skip connections

4. **Training Process:**
* Implement early stopping (your model starts overfitting after epoch 4)
* Reduce learning rate (try using learning rate scheduler)
* Experiment with different batch sizes
* Use K-fold cross-validation adapted for time series

5. **Feature Engineering:**
* Add relevant lag features
* Include domain-specific features
* Consider adding cyclical encodings for seasonal data
* Create rolling statistics features

In [1]:
import os
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1.0 Data Ingestion

In [2]:
df = pd.read_csv('../artifacts/dataset/01-hourly_historical_analyzed_data.csv')
df = df.drop(columns=['hour', 'day', 'month', 'year'])

df1 = df.copy()
df1.index = pd.to_datetime(df1['time'], format='%Y-%m-%d %H:%M:%S')

rain = df1['rain']
rain_df = pd.DataFrame({'Rain': rain})

rain_df[rain_df['Rain'] != 0]['Rain']

rain_df['Seconds'] = rain_df.index.map(pd.Timestamp.timestamp)

# 2.0 Data Pre-processing

## 2.1 Convert: Time stamp -> Cyclic Signals

### Summary:

The cyclical component of a time series is a long-term variation in data that repeats in a systematic way over time. It's characterized by rises and falls that are not fixed in period, and are usually at least two years in duration. The cyclical component is often represented by a wave-shaped curve that shows alternating periods of expansion and contraction. 

Here are some examples of cyclical components in time series:
* **Canadian lynx data:** This data shows population cycles of about 10 years. 
* **Stock market:** The stock market cycles between periods of high and low values, but there is no set amount of time between those fluctuations. 
* **Home prices:** There is a cyclical effect due to the market, but there is also a seasonal effect because most people would rather move in the summer.

Cyclical components are different from seasonal components, which are variations that occur periodically in the data and are usually associated with changes in seasons, days of the week, or hours of the day.

#### Research/References:

1. [Components of Time Series Analysis](https://www.toppr.com/guides/business-mathematics-and-statistics/time-series-analysis/components-of-time-series/#:~:text=The%20variations%20in%20a%20time,called%20the%20'Business%20Cycle'.)
2. [Cyclic and seasonal time series by Rob J Hyndman](https://robjhyndman.com/hyndsight/cyclicts/)
3. [Time Series Analysis by Arief Wicaksono](https://medium.com/@ariefwcks303/time-series-analysis-bb61d1d1b3d5)

### How to Convert:

To convert a timestamp into a time series cyclic component, you need to extract cyclical features from the timestamp by calculating trigonometric functions like sine and cosine based on the relevant time unit (e.g., hour, day, month) within the cycle, effectively mapping the time point onto a circular representation where each cycle is represented by a full rotation on the unit circle; this is often done using the "sinusoidal encoding" method. 

#### Steps:

1. **Convert timestamp to datetime object:** Use your programming language's datetime functions to convert the timestamp into a datetime object, allowing you to easily extract components like hour, day, month, etc.
2. **Calculate cyclical features:**
   * **Extract relevant time unit:** Depending on your analysis, extract the specific time unit that represents the cycle (e.g., hour for daily cycles, month for yearly cycles). 
   * **Normalize the unit:** Divide the extracted time unit by the maximum value within the cycle (e.g., divide hour by 23 to get a value between 0 and 1). 
   * **Apply sine and cosine functions:** Calculate the sine and cosine of the normalized time unit. These values will represent the cyclical component of your timestamp.

#### Research/References:

1. [Components of Time Series](https://ming-zhao.github.io/Business-Analytics/html/docs/time_series/components.html#:~:text=A%20seasonal%20behavior%20is%20very,seasonal%20(or%20cyclical)%20effects.)
2. [Feature engineering of timestamp for time series analysis](https://datascience.stackexchange.com/questions/107215/feature-engineering-of-timestamp-for-time-series-analysis)
3. [SQL: How can I generate a time series from timestamp data and calculate cumulative sums across different event types?](https://stackoverflow.com/questions/76295454/sql-how-can-i-generate-a-time-series-from-timestamp-data-and-calculate-cumulati)
4. [Time Series Analysis Through Vectorization](https://www.pinecone.io/learn/time-series-vectors/)
5. [Cyclical features in time series forecasting](https://skforecast.org/0.9.0/faq/cyclical-features-time-series#:~:text=Basis%20functions:%20Basis%20functions%20are,a%20piecewise%20combination%20of%20polynomials.)
6. [Cyclical Encoding: An Alternative to One-Hot Encoding for Time Series Features](https://towardsdatascience.com/cyclical-encoding-an-alternative-to-one-hot-encoding-for-time-series-features-4db46248ebba#:~:text=It's%20fairly%20easy%20to%20transform,dt.)

In [3]:
day = 24 * 60 * 60
year = (365.2425) * day

rain_df['Day sin'] = np.sin(rain_df['Seconds'] * (2 * np.pi / day))
rain_df['Day cos'] = np.cos(rain_df['Seconds'] * (2 * np.pi / day))
rain_df['Year sin'] = np.sin(rain_df['Seconds'] * (2 * np.pi / year))
rain_df['Year cos'] = np.cos(rain_df['Seconds'] * (2 * np.pi / year))

rain_df = rain_df.drop(['Seconds'], axis=1)

df2 = pd.concat([df1, rain_df], axis=1)
df2 = df2.drop(['time','rain'], axis=1)

## 2.2 Create: Lagged Features

### Summary:

A lag feature in a time series is a feature that contains the value of a time series at a previous time point. The user sets the lag, or the number of periods in the past, for the feature. For example, a lag of 1 means the feature contains the previous time point value, while a lag of 3 means the feature contains the value three time points before.

Lag features can be created by shifting the original data by one or more time steps. For example, if you have a daily time series of sales, you can create a lagged feature that shows the sales of the previous day, the same day last week, or the same day last year.

In Python, you can create lag features using the pandas method shift. For example, X[my_variable].shift(freq=”1H”, axis=0) creates a new feature that contains lagged values of my_variable by one hour.

The LagFeatures feature in Feature-engine has the same functionality as pandas shift(), but it can create multiple lags at the same time.

#### Research/References:

1. [What are lagged features?](https://www.hopsworks.ai/dictionary/lagged-features#:~:text=A%20lagged%20feature%20is%20created,at%20the%20current%20time%20point.)
2. [How can you use lagged features to capture temporal dependencies in time series data?: LinkedIn](https://www.linkedin.com/advice/0/how-can-you-use-lagged-features-capture-temporal-ks4kc#:~:text=Lagged%20features%20are%20features%20that,the%20same%20day%20last%20year.)
3. [LagFeatures: Automating lag feature creation](https://feature-engine.trainindata.com/en/1.8.x/user_guide/timeseries/forecasting/LagFeatures.html)
4. [Time Series as Features by Ryan Halbrook and Alexis Cook](https://www.kaggle.com/code/ryanholbrook/time-series-as-features)
5. [Introduction to feature engineering for time series forecasting by Francesca Lazzeri](https://medium.com/data-science-at-microsoft/introduction-to-feature-engineering-for-time-series-forecasting-620aa55fcab0)
6. [Lagged features for time series forecasting: Scikit Learn](https://scikit-learn.org/1.5/auto_examples/applications/plot_time_series_lagged_features.html)
7. [Lag features for time-series forecasting in AutoML](https://learn.microsoft.com/en-us/azure/machine-learning/concept-automl-forecasting-lags?view=azureml-api-2)
8. [Analyzing the Impact of Lagged Features in Time Series Forecasting: A Linear Regression Approach](https://cubed.run/blog/analyzing-the-impact-of-lagged-features-in-time-series-forecasting-a-linear-regression-approach-730aaa99dfd6)

In [4]:
def df_to_X_y(df, window_size=6):
    df_as_np = df.to_numpy()
    X = []
    y = []
    
    for i in range(len(df_as_np) - window_size):
        row = [r for r in df_as_np[i: i + window_size]]
        X.append(row)
        
        label = df_as_np[i + window_size][0]
        y.append(label)
    
    return np.array(X), np.array(y)

# 24 hour window
X2, y2 = df_to_X_y(df2, window_size=24)
X2.shape, y2.shape

((219144, 24, 15), (219144,))

## 2.3 Rolling Window Split

In [None]:
def rolling_window_split(X, y, window_size, horizon=1, step=1):
    """
    Creates rolling window splits for time series data.
    
    Parameters:
    - X: Feature matrix
    - y: Target values
    - window_size: Size of the training window
    - horizon: Forecast horizon (how many steps ahead to predict)
    - step: Number of steps to move the window forward
    
    Returns:
    - Generator of (X_train, X_val, y_train, y_val) splits
    """
    total_samples = len(X)
    
    # Create splits
    for i in range(0, total_samples - window_size - horizon + 1, step):
        # Define split points
        train_end = i + window_size
        val_end = train_end + horizon
        
        # Split the data
        X_train = X[i:train_end]
        X_val = X[train_end:val_end]
        y_train = y[i:train_end]
        y_val = y[train_end:val_end]
        
        yield X_train, X_val, y_train, y_val

In [None]:
def split_time_series_data(X, y, train_ratio=0.8, val_ratio=0.1):
    """
    Splits time series data into training, validation, and testing sets sequentially.

    Parameters:
    - X: Features (numpy array, DataFrame).
    - y: Labels (numpy array, Series).
    - train_ratio: Proportion of data for training (default 0.8).
    - val_ratio: Proportion of data for validation (default 0.1).

    Returns:
    - X_train, X_val, X_test, y_train, y_val, y_test: Sequentially split data.
    """
    # Calculate the number of samples for each split
    n = len(X)
    train_end = int(n * train_ratio)
    val_end = train_end + int(n * val_ratio)

    # Split data sequentially
    X_train, y_train = X[:train_end], y[:train_end]
    X_val, y_val = X[train_end:val_end], y[train_end:val_end]
    X_test, y_test = X[val_end:], y[val_end:]

    return X_train, X_val, X_test, y_train, y_val, y_test

X_train2, X_val2, X_test2, y_train2, y_val2, y_test2 = split_time_series_data(X2, y2)
X_train2.shape, y_train2.shape, X_val2.shape, y_val2.shape, X_test2.shape, y_test2.shape

In [None]:
rain_training_mean2 = np.mean(X_train2[:, :, 0])
rain_training_std2 = np.std(X_train2[:, :, 0])

def preprocess_standardize2(X):
    X[:, :, 0] = (X[:, :, 0] - rain_training_mean2) / rain_training_std2
    return X

def preprocess__standardize_output(y):
    # Check if y is 1D or 2D
    if len(y.shape) == 2:  # If it's 2D (like a column vector), you can index it
        y[:, 0] = (y[:, 0] - rain_training_mean2) / rain_training_std2
    else:  # If it's 1D, you don't need the extra index
        y = (y - rain_training_mean2) / rain_training_std2
    return y

preprocess_standardize2(X_train2)
preprocess_standardize2(X_val2)
preprocess_standardize2(X_test2)

preprocess__standardize_output(y_train2)
preprocess__standardize_output(y_val2)
preprocess__standardize_output(y_test2)

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Reshape the 3D array to 2D (combine samples and timesteps)
num_samples, num_timesteps, num_features = X_train2.shape
X_train2_reshaped = X_train2.reshape(-1, num_features)

# Fit and transform the scaler on training data
X_train2_scaled = scaler.fit_transform(X_train2_reshaped)

# Reshape back to 3D
X_train2 = X_train2_scaled.reshape(num_samples, num_timesteps, num_features)

# Repeat for validation and test datasets
X_val2_reshaped = X_val2.reshape(-1, num_features)
X_val2_scaled = scaler.transform(X_val2_reshaped)
X_val2 = X_val2_scaled.reshape(X_val2.shape)

X_test2_reshaped = X_test2.reshape(-1, num_features)
X_test2_scaled = scaler.transform(X_test2_reshaped)
X_test2 = X_test2_scaled.reshape(X_test2.shape)

print("Scaling completed!")

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import *
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import load_model