## Long Short Term Memory on Weather Station Data

- Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to model sequential data by maintaining a memory of past information while selectively updating and forgetting information over time.

- They are particularly effective in capturing long-range dependencies in sequences, making them well-suited for tasks like natural language processing, speech recognition, and time series forecasting. They combat the vanishing gradient problem of traditional RNNs through a sophisticated gating mechanism that enables them to learn and retain information for longer periods.

- In this notebook, we grasp all data based on data steps. For example: if k (step) = 100, we take convert each data point in our dataset to (100, 1) means using 100 data points before that time step to predict the next 101 time step.

In [None]:
# import dependencies
import pandas as pd
import numpy as np
import sqlalchemy as sq
import sys
import os
import pickle
from imblearn.combine import SMOTEENN
from imblearn.ensemble import (  # type: ignore
    RUSBoostClassifier,
)

from sklearn.metrics import (  # type: ignore
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    classification_report,
)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout

sys.path.append("../../")
os.chdir("../../")
from ModelBuilderMethods import getConn, extractYears

In [None]:
# unlimited line output
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 500)

### <u>**Step 1**</u>: Data Selection

In this step, we would choose the particular data/table, pick attributes from existing tables. Further aggregation/feature engineer can be done here to support the point of the research.

Particular, for this notebook, we grab the following data and merge them (on year, district) into a single table:
- Monthly weather station
- ergot data (downgrade)

In [None]:
# Set the query text
weatherStationQuery = sq.text(
    """
    SELECT * from dataset_cross_monthly_station
"""
)

ergotTargetQuery = sq.text(
    """
    SELECT year, district, downgrade from ergot_sample_feat_eng
"""
)

In [None]:
conn = getConn()

stationDf = pd.read_sql(weatherStationQuery, conn)
ergotTargetDf = pd.read_sql(ergotTargetQuery, conn)

conn.close()
del conn

In [None]:
tempdf = stationDf

# merge on year and district
datasetDf = pd.merge(ergotTargetDf, tempdf, on=["year", "district"], how="left")
del ergotTargetDf
del tempdf

In [None]:
# encode district
datasetDf["district"] = datasetDf["district"].astype("category")

temp = pd.get_dummies(datasetDf["district"], prefix="district", drop_first=True)
datasetDf = pd.concat([datasetDf, temp], axis=1)

datasetDf = datasetDf.drop(columns=["district"])

del temp

### <u>**Step 2**</u>: Splitting dataset

- We split the whole dataset into the train/test split. Particularly, split them by year (1995 - 2015 for training, 2016 - 2020 for testing) since this is a time series data.

In [None]:
# train 1995 - 2015 test 2016 - 2020
trainDf = extractYears(datasetDf, 1995, 2015)
testDf = extractYears(datasetDf, 2016, 2020)
del datasetDf

In [None]:
# drop year
trainDf = trainDf.drop(columns=["year"])
testDf = testDf.drop(columns=["year"])

### <u>**Step 3**</u>: [Balancing the dataset](https://imbalanced-learn.org/stable/)

- Our dataset is unbalanced and can lead to bias when training/testing. Balacing step would help to eliminate the bias of the dataset, thus provide more reliable results.

In [None]:
# pre balancing check
# print value counts downgrade
print(trainDf["downgrade"].value_counts())
print(testDf["downgrade"].value_counts())

In [None]:
# count nan
print(trainDf.isna().sum())
# set nan to 0
trainDf = trainDf.fillna(0)

In [None]:
balancer = SMOTEENN(sampling_strategy=1, random_state=42)
balancedTrainDfX, balancedTrainDfY = balancer.fit_resample(
    trainDf.drop(columns="downgrade"), trainDf["downgrade"]
)

In [None]:
# post balancing check
# print value counts downgrade
print(balancedTrainDfY.value_counts())

### <u>**Step 4**</u>: Regularization / Normalization
some blurb about scalers  

1. [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)             
2. [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html)  
3. [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)  
4. [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)  
5. [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html)  
6. [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)  
7. [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html)  

In [None]:
def printMetrics(model_name, y_true, y_pred):
    print(model_name)
    print("Accuracy: ", accuracy_score(y_true, y_pred))
    print("Precision: ", precision_score(y_true, y_pred))
    print("Recall: ", recall_score(y_true, y_pred))
    print("F1: ", f1_score(y_true, y_pred))
    print("ROC AUC: ", roc_auc_score(y_true, y_pred))
    print("Classification Report: \n", classification_report(y_true, y_pred))
    print()

### <u>**Step 5**</u>: Long Short Term Memory Model

##### <u>**Step 5.0**</u>: Create input-output pair

In [None]:
def create_io_pair(
    X_train: np.ndarray, Y_train: np.ndarray, k=100
) -> "tuple(np.ndarray, np.ndarray)":
    """
    k: time step
    return: (input, output) pairs from given data
    """
    windows = []
    windows_y = []

    for i, sequence in enumerate(X_train):
        len_seq = len(sequence)
        for window_start in range(0, len_seq - k + 1):
            window_end = window_start + k
            window = sequence[window_start:window_end]
            windows.append(window)
            windows_y.append(Y_train[i])
    return (np.array(windows), np.array(windows_y))

In [None]:
X_train = np.array(trainDf.drop(columns=["downgrade"]))
Y_train = np.array(trainDf["downgrade"])
x_train, y_train = create_io_pair(X_train, Y_train, k=100)
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], 1)

In [None]:
print(x_train.shape, y_train.shape)

In [None]:
X_test = np.array(testDf.drop(columns=["downgrade"]))
Y_test = np.array(testDf["downgrade"])
x_test, y_test = create_io_pair(X_test, Y_test, k=100)
x_test = x_test.reshape(x_test.shape[0], x_test.shape[1], 1)

In [None]:
print(x_test.shape, y_test.shape)

##### <u>**Step 5.1**</u>: Initialize the model

In [None]:
def LSTM_model(n_input, n_output, units=50, dropout_rate=0.2, optimizer="adam"):
    # using sequential to build LSTM model
    model = Sequential()

    # Adding the first LSTM layer and some Dropout regularisation
    model.add(LSTM(units=units, return_sequences=True, input_shape=(n_input, 1)))
    model.add(Dropout(dropout_rate))

    # Adding a second LSTM layer and some Dropout regularisation
    model.add(LSTM(units=units, return_sequences=True))
    model.add(Dropout(dropout_rate))

    # Adding a third LSTM layer and some Dropout regularisation
    model.add(LSTM(units=units, return_sequences=True))
    model.add(Dropout(dropout_rate))

    # Adding a fourth LSTM layer and some Dropout regularisation
    model.add(LSTM(units=units))
    model.add(Dropout(dropout_rate))

    # Adding the output layer
    model.add(Dense(units=n_output))

    # Compiling the RNN
    model.compile(optimizer=optimizer, loss="mean_absolute_error")

    return model

In [None]:
model = LSTM_model(
    x_train.shape[1], x_train.shape[2], units=50, dropout_rate=0.2, optimizer="adam"
)

In [None]:
model.summary()

##### <u>**Step 5.2**</u>: Fit the training data to the model

In [None]:
history = model.fit(
    x_train, y_train, epochs=100, batch_size=64
)  ## this will be terminated when the input size is too large

##### <u>**Step 5.3**</u>: Test the model on the testing dataset

In [None]:
prediction_bal_LSTM = model.predict(x_test)

##### <u>**Step 5.4**</u>: Evaluate models based on different metrics:
- ACCURACY:
- PRECISION:
- RECALL:
- F1:
- ROC AUC:

In [None]:
printMetrics(
    "LSTM unbalanced train set",
    testDf["downgrade"],
    prediction_bal_LSTM,
)