## Gradient Boosting Model on Satellite Data

- Gradient builds an additive model in a forward stage-wise fashion, which allows for the optimization of arbitrary differentiable loss functions. 

In [26]:
# dependencies
import pandas as pd
import sqlalchemy as sq
import sys, os
import pickle
from imblearn.combine import SMOTEENN, SMOTETomek
from xgboost import XGBClassifier
from sklearn.ensemble import (  # type: ignore
    GradientBoostingClassifier,
)
from imblearn.ensemble import (  # type: ignore
    RUSBoostClassifier,
)

from sklearn.metrics import (  # type: ignore
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    classification_report,
)

sys.path.append("../../")
os.chdir("../../")
from ModelBuilderMethods import getConn, extractYears

In [27]:
# unlimited line output
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 500)

### <u>**Step 1**</u>: Data Selection

In this step, we would choose the particular data/table, pick attributes from existing tables. Further aggregation/feature engineer can be done here to support the point of the research.

Particular, for this notebook, we grab the following data and merge them (on year, district) into a single table:
- weather data from Satellite
- ergot data (downgrade)

In [28]:
weatherSatQuery = sq.text(
    """
    SELECT * from dataset_cross_monthly_sat
"""
)

ergotTargetQuery = sq.text(
    """
    SELECT year, district, downgrade from ergot_sample_feat_eng
"""
)

In [29]:
conn = getConn("./.env")

# stationDf = pd.read_sql(weatherStationQuery, conn)
satelliteDf = pd.read_sql(weatherSatQuery, conn)
ergotTargetDf = pd.read_sql(ergotTargetQuery, conn)

conn.close()
del conn

In [30]:
# merge on year and district
tempdf = satelliteDf

# merge on year and district
datasetDf = pd.merge(ergotTargetDf, tempdf, on=["year", "district"], how="left")
del ergotTargetDf
del tempdf

Then, the district number would be categorized using [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) techniques.

One-hot encoding techniques would take input, which is a categorical value. Then, the features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse_output parameter).


In [31]:
# encode district
datasetDf["district"] = datasetDf["district"].astype("category")

temp = pd.get_dummies(datasetDf["district"], prefix="district", drop_first=True)
datasetDf = pd.concat([datasetDf, temp], axis=1)

datasetDf = datasetDf.drop(columns=["district"])

del temp

### <u>**Step 2**</u>: Splitting dataset

- We split the whole dataset into the train/test split. Particularly, split them by year (1995 - 2015 for training, 2016 - 2020 for testing) since this is a time series data.

In [32]:
# train 1995 - 2015 test 2016 - 2020
trainDf = extractYears(datasetDf, 1995, 2015)
testDf = extractYears(datasetDf, 2016, 2020)
del datasetDf

In [33]:
# drop year
trainDf = trainDf.drop(columns=["year"])
testDf = testDf.drop(columns=["year"])

### <u>**Step 3**</u>: [Balancing the dataset](https://imbalanced-learn.org/stable/)

- Our dataset is unbalanced and can lead to bias when training/testing. Balacing step would help to eliminate the bias of the dataset, thus provide more reliable results.


In [34]:
# pre balancing check
# print value counts downgrade
print(trainDf["downgrade"].value_counts())
print(testDf["downgrade"].value_counts())

downgrade
False    122202
True       2082
Name: count, dtype: int64
downgrade
False    26307
True      1016
Name: count, dtype: int64


In [35]:
# count nan
print(trainDf.isna().sum())
# set nan to 0
trainDf = trainDf.fillna(0)

downgrade                    0
1:min_temp_x              1246
1:max_temp_x              1246
1:mean_temp_x             1246
1:min_dew_point_temp      1246
1:max_dew_point_temp      1246
1:mean_dew_point_temp     1246
1:min_humidex             1246
1:max_humidex             1246
1:mean_humidex            1246
1:min_precip              1246
1:max_precip              1246
1:mean_precip             1246
1:min_rel_humid           1246
1:max_rel_humid           1246
1:mean_rel_humid          1246
1:min_stn_press           1246
1:max_stn_press           1246
1:mean_stn_press          1246
1:min_visibility          1246
1:max_visibility          1246
1:mean_visibility         1246
1:max_temp_y              1246
1:min_temp_y              1246
1:mean_temp_y             1246
1:min_total_rain          1246
1:max_total_rain          1246
1:mean_total_rain         1246
1:min_total_snow          1246
1:max_total_snow          1246
1:mean_total_snow         1246
1:min_total_precip        1246
1:max_to

In [36]:
balancer = SMOTEENN(sampling_strategy=1, random_state=42)
balancedTrainDfX, balancedTrainDfY = balancer.fit_resample(
    trainDf.drop(columns="downgrade"), trainDf["downgrade"]
)

In [37]:
# post balancing check
# print value counts downgrade
print(balancedTrainDfY.value_counts())

downgrade
False    115179
True      23757
Name: count, dtype: int64


### <u>**Step 4**</u>: Regularization / Normalization
some blurb about scalers  

1. [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)             
2. [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html)  
3. [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)  
4. [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)  
5. [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html)  
6. [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)  
7. [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html)  

In [38]:
# df = pd.DataFrame()
# scaled = scaleColumns(df, ['max_temp'], None, 1)

categorical values [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)  


In [39]:
# encoded = encodeColumns(df, ['max_temp'], None)

In [40]:
def printMetrics(model_name, y_true, y_pred):
    print(model_name)
    print("Accuracy: ", accuracy_score(y_true, y_pred))
    print("Precision: ", precision_score(y_true, y_pred))
    print("Recall: ", recall_score(y_true, y_pred))
    print("F1: ", f1_score(y_true, y_pred))
    print("ROC AUC: ", roc_auc_score(y_true, y_pred))
    print("Classification Report: \n", classification_report(y_true, y_pred))
    print()

### <u>**Step 5**</u>: Gradient Boosting Classifier Model

##### <u>**Step 5.1**</u>: Initialize the model

In [41]:
ESTIMATORS = 400
DEPTH = 40
CORES = -1
MINSPLSPLIT = 8
MINSAMPLELEAF = 4

gradient_boosting_model = GradientBoostingClassifier(
    n_estimators=ESTIMATORS,
    random_state=42,
    max_depth=DEPTH,
    verbose=1,
    n_iter_no_change=200,
)
balanced_gradient_boosting_model = GradientBoostingClassifier(
    n_estimators=ESTIMATORS,
    random_state=42,
    max_depth=DEPTH,
    verbose=1,
    n_iter_no_change=200,
)
rusboost_model = RUSBoostClassifier(
    n_estimators=ESTIMATORS, random_state=42, sampling_strategy=0.5
)
balanced_rusboost_model = RUSBoostClassifier(
    n_estimators=ESTIMATORS, random_state=42, sampling_strategy=0.5
)
xgboost_model = XGBClassifier(
    n_estimators=ESTIMATORS, random_state=42, max_depth=DEPTH, verbosity=1, n_jobs=CORES
)
balanced_xgboost_model = XGBClassifier(
    n_estimators=ESTIMATORS, random_state=42, max_depth=DEPTH, verbosity=1, n_jobs=CORES
)

##### <u>**Step 5.2**</u>: Fit the training data to the model

In [42]:
gradient_boosting_model.fit(trainDf.drop(columns="downgrade"), trainDf["downgrade"])
balanced_gradient_boosting_model.fit(balancedTrainDfX, balancedTrainDfY)

      Iter       Train Loss   Remaining Time 
         1           0.1563           76.55m
         2           0.1517           76.15m
         3           0.1484           76.18m
         4           0.1459           76.06m
         5           0.1438           75.35m
         6           0.1422           74.86m
         7           0.1408           74.84m
         8           0.1397           73.82m
         9           0.1387           72.84m
        10           0.1378           72.05m
        20           0.1336           66.49m
        30           0.1324           62.26m
        40           0.1320           57.46m
        50           0.1318           77.18m
        60           0.1318           77.62m


eval procedure

In [None]:
rusboost_model.fit(trainDf.drop(columns="downgrade"), trainDf["downgrade"])
balanced_rusboost_model.fit(balancedTrainDfX, balancedTrainDfY)

In [None]:
xgboost_model.fit(trainDf.drop(columns="downgrade"), trainDf["downgrade"])
balanced_xgboost_model.fit(balancedTrainDfX, balancedTrainDfY)

##### <u>**Step 5.3**</u>: Test the model on the testing dataset

In [None]:
# set nan to 0
testDf = testDf.fillna(0)

In [None]:
# get predictions

predictions_gradient_boosting = gradient_boosting_model.predict(
    testDf.drop(columns="downgrade")
)
predictions_balanced_gradient_boosting = balanced_gradient_boosting_model.predict(
    testDf.drop(columns="downgrade")
)
predictions_rusboost = rusboost_model.predict(testDf.drop(columns="downgrade"))
predictions_balanced_rusboost = balanced_rusboost_model.predict(
    testDf.drop(columns="downgrade")
)
predictions_xgboost = xgboost_model.predict(testDf.drop(columns="downgrade"))
predictions_balanced_xgboost = balanced_xgboost_model.predict(
    testDf.drop(columns="downgrade")
)

In [None]:
print(pd.DataFrame(predictions_gradient_boosting).value_counts())
print(pd.DataFrame(predictions_balanced_gradient_boosting).value_counts())
print(pd.DataFrame(predictions_rusboost).value_counts())
print(pd.DataFrame(predictions_balanced_rusboost).value_counts())
print(pd.DataFrame(predictions_xgboost).value_counts())
print(pd.DataFrame(predictions_balanced_xgboost).value_counts())

##### <u>**Step 5.4**</u>: Evaluate models based on different metrics:
- ACCURACY:
- PRECISION:
- RECALL:
- F1:
- ROC AUC:

In [None]:
# get accuracy precision recall f1 roc_auc
printMetrics(
    "sk GB imbalanced train set", testDf["downgrade"], predictions_gradient_boosting
)
printMetrics(
    "imb GB balanced train set",
    testDf["downgrade"],
    predictions_balanced_gradient_boosting,
)
printMetrics("sk RUS imbalanced train set", testDf["downgrade"], predictions_rusboost)
printMetrics(
    "imb RUS balanced train set", testDf["downgrade"], predictions_balanced_rusboost
)
printMetrics("sk XGB imbalanced train set", testDf["downgrade"], predictions_xgboost)
printMetrics(
    "imb XGB balanced train set", testDf["downgrade"], predictions_balanced_xgboost
)

##### <u>**Step 5.5**</u>: Exports metric results to csv