# NOAA: Detecting SST Anomalies Using A Unsupervised Learning Approach

***Introduction***

The purpose of this notebook serves to further build on our learnings from the noaa_evaluation EDA and attempt to train supervised model training where we will cross validate the results against our random forest regressor using a time-series structure in order to further strengthen the conclusion and results that we have managed to create. We aim to do this by making a comparative analysis between the timestamps of SST anomalies and the prediction results from our Random Forest Regressor that will be tested on the NOAA dataset.

Please keep in mind that the noaa_evaluation notebook can be looked at as a mandatory pre-requisite to this part of our findings as we build on the fundamental learnings and results.

***Contents***
For the contents of this notebook, we can break it down into three main sections Feature Engineering, Unsupervised Training, Cross-Validation of Models

- Feature Engineering: This section builds the neccessary feature variables that are used throughout the unsupervised training

- Unsupervised Training: This section covers the selection and training of our unsupervised model where we will attempt to accurately detect SST anomalies

- Cross-Validation of Models: This section builds on the previous sections model where we shall cross validate it's results with our Random Forest Regressor and test it's predictive accuracy to further prove our models strength


In [None]:
# Standard Libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import sys

sys.path.append('..')

# Custom Tools
from load import load_noaa_station_data
from utils import create_noaa_date_column, create_noaa_seasonal_column, convert_to_numeric
from scipy.stats import zscore

# Wrangling tools
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Regression Models
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans, DBSCAN

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

Firstly, let's read the dataset that we pulled and combined for multiple regions and stations in our Multi-Set Analysis (Section 2). This has already be ran through the mini data transformation pipeline that we wrote, so the Season and Date columns are already appended together with the appropriate data type castings:

In [None]:
combined_regional_df = pd.read_csv('../../data/noaa/pulled_data/coral_reef_data_combined.csv')

print('Number of rows:', len(combined_regional_df))
print(combined_regional_df.describe())
print(combined_regional_df.dtypes)

Number of rows: 177043
                YYYY             MM             DD       SST_MIN  \
count  177043.000000  177043.000000  177043.000000  177043.00000   
mean     2004.700525       6.487887      15.722932      26.78480   
std        11.662090       3.452751       8.797549       1.96167   
min      1985.000000       1.000000       1.000000      17.09000   
25%      1995.000000       3.000000       8.000000      25.65000   
50%      2005.000000       6.000000      16.000000      27.23000   
75%      2015.000000       9.000000      23.000000      28.24000   
max      2025.000000      12.000000      31.000000      31.44000   

             SST_MAX    SST@90th_HS   SSTA@90th_HS      90th_HS>0  \
count  177043.000000  177043.000000  177043.000000  177043.000000   
mean       28.117110      27.775844       0.533198       0.187283   
std         1.601725       1.627570       0.623586       0.373325   
min        21.400000      21.240000      -2.350300       0.000000   
25%        27.11000

## Section 1: EDA Revision & Extended Analysis

### Sub Sections:

- Sub-section 1.1: EDA Revision

- Sub-section 1.2: Deeper Time-Series Analysis

### Sub-section 1.1: EDA Revision

***EDA Recap***

Just to give a short recap to what we discovered in our previous findings, if we recall to the Single-Set Anlaysis and particularly our discoveries within the sub-section 1.3 of our distdistributionribtuion analysis, you will see that we found a normal distribution when analysing the SSTA@90th_HS>0. 

We logically built on that finding by mapping the seasonal fluctuations via grouping the columns by seasons then re-plotting the distributions from each season in the subsequent sub-section 1.4. It was from this analysis that we were able to conclude that the SST anomaly distribution was in fact ***not*** influenced by seasonal changes. 

Another key insight and derived data that we were able to retrieve from this section can be seen in sub-section 1.5 where we made a time-series analysis and pinpointed the exact time delays measured in days for the maximum correlation that these results had with when mapped to the BAA severity levels. We will also be leveraging these results and building the majority of the next sub-section from this finding.

***Building on Those Concepts***

It is within this section that we can truly leverage the findings from sub-section 1.4 and 1.5 by building on the concept and trying to engineer ourselves a model that will help detect SST anomalies before they even happen. 

The target of this build is to not only further strengthen the conclusions drawn for our prior analysis, but to also build an extended model that will allow individuals and scientists to predict SST anomalies before they occur, ultimately creating an early warning system for potential SST anomalies with hopes to be married 
together with our RFR prediction model from previous notebook.

Before we dive into the feature engineering of our unsupervised learning model, we need to particularly revisit a single part of our EDA analysis from the first notebook. This is will be the time-series analysis from subsections 1.5 and 2.5. We must do this as it is the pinnacle for the model.

### Sub-section 1.2: Deep Time-Series Analysis

Within this revisit, we will be taking a deeper dive into the time-series lags for the correlations that the values have with not only the BAA severity levels, but with each other. 

As we will be focusing particularly onto the SST anomalies within this section, we would like to do a time-series analysis around the 

## Feature Engineering

Coming int

## Model Training
