# Train a Random Forest Classifier to predict next-day rain in Australia

## Executive Summary

In this notebook, we will perform all data preprocessing, splitting dataset into a training set and a test set as well as train and fine-tune a Random Forest Classifier to predict whether it will rain tomorrow based on today's weather readings. 

**Approach for hyperparameter optimisation**

To fine-tune an "out-of-the-box" Random Forest Classifier, the first step will be to evaluate a wide range of values for each hyperparameter with RandomizedSearchCV. By doing that, I can obtain a rough idea of ideal hyperparameters ranges based on the best parameters found through the random search. 

Given the ideal ranges to concentrate our search, the next step would be to explicitly specify every combination of hyperparameters to try with GridSearchCV. Instead of randomly sampling from a set of hyperparameter combinations (like RandomizedSearchCV), GridSearchCV would evaluate all combinations I define, thus providing an accurate combination of hyperparameters with the best performance. 

**Results**

After fine-tuning hyperparameters with RandomizedSearchCV and GridSearchCV, I obtained a new model with accuracy of 0.916, slightly higher than the "out-of-the-box" version with 0.909 accuracy. This new model has then been serialised, ready to be used for predictions. 

I have also documented several lessons and future improvements to facilitate learning about the project. 

## 1. Load required Python libraries and data

For this project, I am using the 10 years of daily weather observations from many location across Australia obtained from [Kaggle](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package). The dataset has been saved into Google Cloud Storage to allow ease of access and mimic a central data depository (e.g. data lake) for a business. 

In [23]:
from google.cloud import storage
import os
import pandas as pd
import numpy as np
from numpy import asarray
from datetime import datetime
from random import randint

# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")

# Data preparation
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# ML model training
import sklearn
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB

# Model evaluation & selection
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

# Fine-tune model performance
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

# Save ML model
import pickle

In [2]:
# Display all columns without truncation in dataframes
pd.set_option('display.max_columns', 500)

In [3]:
# Load weather observation data
weatherAUS_path = "gs://australiarain-aiplatform/input/weatherAUS.csv"
weather_df = pd.read_csv(weatherAUS_path)
weather_df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


## 2. Data Preprocessing

### 2.1. Date

In [4]:
# Convert Date column to the correct data type
weather_df['Date'] = pd.to_datetime(weather_df['Date'])

# Extract Month & Year from Date column
weather_df['Month'] = pd.DatetimeIndex(weather_df['Date']).month
weather_df['Year'] = pd.DatetimeIndex(weather_df['Date']).year

## 2.2. Handle missing values for numerical features

Weather observations greatly vary according to specific locations and seasons. Therefore, it would be flawed to impute missing numerical values with the mean value of the entire population. I decided to impute missing values for all numerical features with forward fill method since it would best reflect the locations and seasonality based on how the data is currently sorted. 

In [5]:
# Forward fill missing values
weather_df[['MinTemp', 'MaxTemp', 'Temp9am', 'Temp3pm', 'Rainfall', 
            'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 
           'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm']] = weather_df[['MinTemp', 'MaxTemp', 'Temp9am', 'Temp3pm', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 
           'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm']].fillna(method='ffill')
weather_df.isnull().sum()

Date                 0
Location             0
MinTemp              0
MaxTemp              0
Rainfall             0
Evaporation       6049
Sunshine          6049
WindGustDir      10326
WindGustSpeed        0
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am         0
WindSpeed3pm         0
Humidity9am          0
Humidity3pm          0
Pressure9am          0
Pressure3pm          0
Cloud9am             0
Cloud3pm             2
Temp9am              0
Temp3pm              0
RainToday         3261
RainTomorrow      3267
Month                0
Year                 0
dtype: int64

There are 2 records with missing Cloud3pm. I will remove these 2 records since the 2 records are insignificant to the overall information. 

However, Evaporation and Sunshine have a lot more missing values missing after applying forward fill method. Assuming weather observations would be similar for nearly locations, I will impute the remaining missing values for numerical features with observations at neighbouring places. Specifically, Albury's incomplete observations will be impute with Canberra's readings. Badgerys Creek's null values will be replaced with Sydney's observations. 

In [6]:
# Remove 2 records having null Cloud3pm
weather_df = weather_df[weather_df['Cloud3pm'].notnull()]

In [7]:
# Extract observations at Canberra
canberra_observations = weather_df.query('Location == "Canberra"')[['Date','Evaporation', 'Sunshine']]

# Extract Albury's records
albury_observations = weather_df.query('Location == "Albury"')

# Merge 2 dataframes and impute missing values
albury_merge = pd.merge(albury_observations, canberra_observations, on = ['Date', 'Date'])
albury_merge['Evaporation'] = albury_merge['Evaporation_y']
albury_merge['Sunshine'] = albury_merge['Sunshine_y']
albury_merge = albury_merge.drop(columns = ['Evaporation_x', 'Sunshine_x', 'Evaporation_y', 'Sunshine_y'])
albury_merge

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Month,Year,Evaporation,Sunshine
0,2008-12-03,Albury,12.9,25.7,0.0,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,8.0,2.0,21.0,23.2,No,No,12,2008,10.2,13.2
1,2008-12-04,Albury,9.2,28.0,0.0,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,8.0,2.0,18.1,26.5,No,No,12,2008,11.0,10.8
2,2008-12-05,Albury,17.5,32.3,1.0,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,12,2008,6.6,8.1
3,2008-12-06,Albury,14.6,29.7,0.2,WNW,56.0,W,W,19.0,24.0,55.0,23.0,1009.2,1005.4,7.0,8.0,20.6,28.9,No,No,12,2008,6.4,9.4
4,2008-12-07,Albury,14.3,25.0,0.0,W,50.0,SW,W,20.0,24.0,49.0,19.0,1009.6,1008.2,1.0,8.0,18.1,24.6,No,No,12,2008,12.4,12.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3033,2017-06-21,Albury,1.2,15.2,0.4,ENE,15.0,,NNE,0.0,2.0,100.0,62.0,1029.4,1026.7,8.0,7.0,2.9,14.3,No,No,6,2017,1.6,2.8
3034,2017-06-22,Albury,0.8,13.4,0.0,W,17.0,S,,6.0,0.0,100.0,66.0,1029.4,1025.9,8.0,1.0,3.6,13.3,No,No,6,2017,1.6,2.8
3035,2017-06-23,Albury,1.1,11.9,0.0,SE,44.0,SSE,SSE,9.0,2.0,100.0,81.0,1022.3,1017.7,8.0,1.0,2.7,10.2,No,No,6,2017,1.6,2.8
3036,2017-06-24,Albury,1.1,14.1,0.2,WSW,28.0,SW,W,4.0,15.0,100.0,49.0,1018.8,1017.2,7.0,6.0,3.9,13.1,No,No,6,2017,1.6,2.8


In [8]:
# Extract observations at Sydney
sydney_observations = weather_df.query('Location == "Sydney"')[['Date','Evaporation', 'Sunshine']]

# Extract Albury's records
badgery_observations = weather_df.query('Location == "BadgerysCreek"')
badgery_observations

# Merge 2 dataframes and impute missing values
badgery_merge = pd.merge(badgery_observations, sydney_observations, on = ['Date', 'Date'])
badgery_merge['Evaporation'] = badgery_merge['Evaporation_y']
badgery_merge['Sunshine'] = badgery_merge['Sunshine_y']
badgery_merge = badgery_merge.drop(columns = ['Evaporation_x', 'Sunshine_x', 'Evaporation_y', 'Sunshine_y'])
badgery_merge

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Month,Year,Evaporation,Sunshine
0,2009-01-01,BadgerysCreek,13.3,34.2,0.0,W,61.0,NNE,,11.0,0.0,67.0,82.0,1005.6,1018.8,7.0,8.0,21.0,8.8,No,No,1,2009,9.8,12.9
1,2009-01-02,BadgerysCreek,14.7,26.1,0.0,SE,46.0,SE,SE,7.0,24.0,59.0,54.0,1012.9,1013.5,7.0,8.0,20.7,22.2,No,No,1,2009,11.0,5.9
2,2009-01-03,BadgerysCreek,13.6,22.3,0.0,NNE,30.0,ESE,NE,6.0,15.0,57.0,51.0,1021.9,1019.2,7.0,8.0,17.9,21.7,No,No,1,2009,9.0,0.5
3,2009-01-04,BadgerysCreek,17.7,31.2,0.0,NE,39.0,NNE,N,9.0,15.0,62.0,43.0,1018.7,1013.6,7.0,8.0,22.0,30.6,No,No,1,2009,5.4,11.3
4,2009-01-05,BadgerysCreek,15.5,38.8,0.0,SW,50.0,NNE,W,7.0,17.0,67.0,19.0,1013.2,1007.6,7.0,8.0,22.7,37.6,No,No,1,2009,10.0,12.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3004,2017-06-21,BadgerysCreek,4.1,19.0,0.2,SSE,26.0,SW,SE,6.0,11.0,99.0,61.0,1026.5,1025.7,7.0,8.0,8.5,16.4,No,No,6,2017,2.0,7.8
3005,2017-06-22,BadgerysCreek,6.8,18.3,0.0,SW,17.0,SW,NNW,11.0,6.0,92.0,55.0,1028.8,1024.6,7.0,8.0,10.7,17.9,No,No,6,2017,2.0,9.2
3006,2017-06-23,BadgerysCreek,3.8,16.8,0.2,SW,17.0,,N,0.0,7.0,100.0,64.0,1021.0,1015.1,7.0,8.0,6.8,16.0,No,No,6,2017,2.4,2.7
3007,2017-06-24,BadgerysCreek,2.7,18.8,0.0,SSW,24.0,NW,WSW,4.0,7.0,96.0,40.0,1017.7,1015.4,7.0,8.0,8.6,18.5,No,No,6,2017,1.4,9.3


In [9]:
# Delete all Albury and Badgerys Creek's records (currently having missing values)
weather_df = weather_df[weather_df['Sunshine'].notnull()]
weather_df.shape

# Append Albury and Badgerys Creek's records (with missing values imputed)
weather_df = weather_df.append(albury_merge, ignore_index = True, verify_integrity = True)
weather_df = weather_df.append(badgery_merge, ignore_index = True, verify_integrity = True)

weather_df.shape

(145458, 25)

### 2.3. Handle missing labels

In [10]:
# Remove all records without labels (i.e. RainTomorrow is null)
weather_df = weather_df[weather_df['RainTomorrow'].notnull()]

### 2.4. Choose 25% of the dataset for model training (to speed up the process) 

In [11]:
# Random sampling 25% of the dataset to speed up the model training process
weather_df_condensed = weather_df.sample(frac = .5)
weather_df_condensed

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Month,Year
105249,2016-08-17,Albany,9.0,18.7,0.0,1.4,1.5,,33.0,NNE,,41.0,37.0,70.0,71.0,999.2,994.9,6.0,8.0,17.0,14.0,No,Yes,8,2016
42993,2008-12-20,Tuggeranong,8.0,21.4,0.0,1.6,2.8,SSE,41.0,SSW,NNE,13.0,13.0,57.0,36.0,1018.3,1016.2,8.0,8.0,11.8,20.9,No,No,12,2008
5928,2017-03-28,CoffsHarbour,20.0,29.0,0.0,3.8,3.2,NNE,39.0,NW,NNE,15.0,28.0,74.0,71.0,1013.6,1011.7,8.0,8.0,24.6,28.0,No,No,3,2017
76297,2012-10-10,Dartmoor,2.4,12.7,3.2,0.4,2.3,SW,57.0,,NNW,0.0,9.0,100.0,99.0,1012.9,1010.4,7.0,7.0,6.3,8.5,Yes,Yes,10,2012
36337,2017-01-16,Williamtown,17.9,29.7,0.0,47.0,7.4,,35.0,NE,ENE,17.0,31.0,67.0,52.0,1016.8,1013.4,8.0,8.0,25.0,29.0,No,No,1,2017
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24421,2008-11-21,Sydney,16.7,27.7,0.2,7.0,11.8,,37.0,WSW,E,19.0,31.0,34.0,46.0,1001.2,996.4,1.0,3.0,22.1,23.7,No,No,11,2008
111187,2016-05-29,PearceRAAF,10.8,18.9,0.4,8.4,4.4,E,26.0,N,S,6.0,6.0,92.0,58.0,1019.5,1017.5,8.0,8.0,13.7,17.1,No,No,5,2016
114753,2008-12-12,Perth,14.0,25.8,0.0,6.8,13.1,SSW,39.0,SE,SW,4.0,22.0,63.0,47.0,1017.3,1015.9,1.0,1.0,19.1,23.8,No,No,12,2008
106963,2013-01-02,Witchcliffe,17.3,26.1,0.6,8.4,1.5,WNW,39.0,W,WSW,15.0,13.0,56.0,51.0,1013.0,1012.1,3.0,8.0,21.7,23.1,No,No,1,2013


### 2.5. Split dataset into a training set and a test set

In [12]:
# Separate label (y) and predicting features (X)
X = weather_df_condensed.drop(columns = ['RainTomorrow', 'Date'])
y = weather_df_condensed['RainTomorrow']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### 2.6. Handle missing values and encoding for categorical values

As mentioned earlier, missing values for categorical features appear systematically. Therefore, I will treat missing values as another value and encode it accordingly. 

Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric. This means that if our data contains categorical data, we must encode it to numbers before we can fit and evaluate a model. 

I chose one-hot encoding since all categorical features neither have high cardinality (to consider binary encoding) nor imply order (to consider ordinal encoding). 

In [13]:
# One-hot encoding with category_encoders library

encoder = ce.OneHotEncoder(cols = ['RainToday'], return_df = True)
X_train_transformed = encoder.fit_transform(X_train, y_train)
X_test_transformed = encoder.transform(X_test)
X_train_transformed

# Binary encoding with category_encoders library
encoder = ce.BinaryEncoder(cols = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm'], return_df = True)
X_train_transformed = encoder.fit_transform(X_train_transformed, y_train)
X_test_transformed = encoder.transform(X_test_transformed)
X_train_transformed

Unnamed: 0,Location_0,Location_1,Location_2,Location_3,Location_4,Location_5,Location_6,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir_0,WindGustDir_1,WindGustDir_2,WindGustDir_3,WindGustDir_4,WindGustDir_5,WindGustSpeed,WindDir9am_0,WindDir9am_1,WindDir9am_2,WindDir9am_3,WindDir9am_4,WindDir9am_5,WindDir3pm_0,WindDir3pm_1,WindDir3pm_2,WindDir3pm_3,WindDir3pm_4,WindDir3pm_5,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday_1,RainToday_2,RainToday_3,Month,Year
30785,0,0,0,0,0,0,1,19.1,38.5,0.0,8.6,10.3,0,0,0,0,0,1,31.0,0,0,0,0,0,1,0,0,0,0,0,1,9.0,13.0,43.0,17.0,1018.8,1013.8,6.0,7.0,27.1,36.9,1,0,0,11,2009
2951,0,0,0,0,0,1,0,5.3,21.3,0.0,11.0,3.9,0,0,0,0,1,0,31.0,0,0,0,0,1,0,0,0,0,0,1,0,9.0,11.0,72.0,39.0,1024.2,1021.0,7.0,4.0,12.3,20.9,1,0,0,4,2017
8643,0,0,0,0,0,1,1,9.6,18.9,0.0,1.0,10.6,0,0,0,0,1,1,44.0,0,0,0,0,1,1,0,0,0,0,0,1,17.0,30.0,71.0,50.0,1012.5,1010.6,8.0,8.0,12.8,17.7,1,0,0,6,2016
76414,0,0,0,0,1,0,0,11.0,20.2,0.0,3.2,8.1,0,0,0,1,0,0,35.0,0,0,0,1,0,0,0,0,0,0,1,1,9.0,19.0,86.0,51.0,1029.7,1028.0,7.0,7.0,12.6,19.3,1,0,0,4,2013
21782,0,0,0,0,1,0,1,8.8,25.8,0.2,3.4,8.0,0,0,0,1,0,1,48.0,0,0,0,1,0,1,0,0,0,1,0,0,2.0,33.0,76.0,55.0,1016.3,1013.4,7.0,2.0,16.4,23.7,1,0,0,10,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4907,0,1,0,1,0,1,0,12.9,20.6,15.2,3.8,3.2,0,0,1,1,0,0,35.0,0,0,1,1,0,1,0,0,1,1,0,1,19.0,9.0,65.0,53.0,1013.8,1013.0,7.0,7.0,16.2,19.9,0,1,0,6,2014
23147,0,0,0,0,1,0,1,12.0,19.8,0.0,2.9,8.0,0,0,0,1,0,1,35.0,0,0,1,0,1,0,0,0,1,0,0,1,22.0,20.0,63.0,66.0,1029.1,1029.5,8.0,8.0,18.7,18.2,1,0,0,10,2014
139681,0,0,0,1,1,1,0,6.3,11.1,13.4,4.0,9.4,0,0,0,0,0,1,56.0,0,0,0,1,1,1,0,0,0,0,0,1,26.0,20.0,77.0,57.0,1014.2,1015.2,8.0,8.0,6.7,10.7,0,1,0,8,2009
128849,0,1,0,1,1,1,1,11.4,20.8,1.6,0.8,5.4,0,0,1,0,1,0,19.0,0,0,1,0,1,1,0,0,0,1,1,1,7.0,2.0,91.0,76.0,1018.5,1013.5,7.0,1.0,16.1,19.4,0,1,0,4,2014


### 2.7. Feature Scaling

In [14]:
scaler = StandardScaler()
X_train_transformed = scaler.fit_transform(X_train_transformed, y_train)
X_test_transformed = scaler.transform(X_test_transformed)
X_train_transformed

array([[ 0.        , -0.76136029, -0.7475876 , ..., -0.10176703,
         1.34110535, -1.47824954],
       [ 0.        , -0.76136029, -0.7475876 , ..., -0.10176703,
        -0.69930628,  1.6632862 ],
       [ 0.        , -0.76136029, -0.7475876 , ..., -0.10176703,
        -0.11633153,  1.27059423],
       ...,
       [ 0.        , -0.76136029, -0.7475876 , ..., -0.10176703,
         0.46664322, -1.47824954],
       [ 0.        ,  1.31343861, -0.7475876 , ..., -0.10176703,
        -0.69930628,  0.48521029],
       [ 0.        ,  1.31343861, -0.7475876 , ..., -0.10176703,
        -1.28228103, -0.69286561]])

### 2.8. Correct imbalanced dataset by oversampling with SMOTE

It is observed that the number of days with RainTomorrow = Yes is much lower than that of RainTomorrow = No. Such an imbalanced dataset could potentially explain the poor prediction performance for RainTomorrow = Yes indicated in the confusion matrix across all ML models. 

One approach to address imbalanced datasets is to oversample the minority class. There are 2 methods to oversample the minority class. 
1. Duplicate examples in the minority class: simple to implement but does not add any new information to the model 
2. Synthesis new examples from the existing samples with Synthetic Minority Oversampling Technique (SMOTE)

I decided to go with SMOTE because it adds new information to the model, which hopefully translates to better performance. To avoid data leakage (when information that would not be available at prediction time is used when building the model), I will only be resample the training dataset instead of the entire dataset. 

In [16]:
X_resampled, y_resampled = SMOTE().fit_resample(X_train_transformed, y_train)

## 3. Recap: Experiment with different binary classification models

In [16]:
seed = 7
models = []
models.append(('SGD', SGDClassifier()))
models.append(('LOG', LogisticRegression()))
models.append(('RDF', RandomForestClassifier()))
models.append(('ADA', AdaBoostClassifier()))
models.append(('GDT', GradientBoostingClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('SVM', SVC()))

results = []
names = []
scoring = 'accuracy'
for name, model in models: 
    kfold = KFold(n_splits = 10, random_state = seed, shuffle = True)
    cv_results = cross_val_score(model, X_train_transformed, y_train, cv = kfold, scoring = scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

SGD: 0.837014 (0.006097)
LOG: 0.842623 (0.005233)
RDF: 0.850640 (0.003497)
ADA: 0.840829 (0.004276)
GDT: 0.847458 (0.004427)
KNN: 0.813876 (0.004152)
GNB: 0.795748 (0.005746)
SVM: 0.853822 (0.005198)


Based on accuracy, simpler models such as Gaussian Naive Bayes, K-Nearest Neighbors and Stochastic Gradient Descent Classifier performed worse than other sophisticated methods. Out of the 8 models that I have experimented with, Random Forest Classifier and Support Vector Machine resulted in the highest accuracy. Also, SVM took a lot longer to train than Random Forest. 

To optimise for speed, I would prioritise fine-tuning Random Forest Classifier first.

## 4. Fine-tune Random Forest Classifier

I will use randomizedsearch and gridsearch to further fine-tune my hyperparameters for the Random Forest model. The code and the approach is adopted from [this article](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) written by Will Koehrsen.

**Overall approach for fine-tuning**

According to the article, since I do not have a concrete idea of the best hyperparameters for Random Forest Classifier, the first step will be to evaluate a wide range of values for each hyperparameter with RandomizedSearchCV. By doing that, I can obtain a rough idea of ideal hyperparameters ranges based on the best parameters found through the random search. 

Given the ideal ranges to concentrate our search, the next step would be to explicitly specify every combination of hyperparameters to try with GridSearchCV. Instead of randomly sampling from a set of hyperparameter combinations (like RandomizedSearchCV), GridSearchCV would evaluate all combinations we define, thus providing an accurate combination of hyperparameters with the best performance. 

**Diminising return and when to stop fine-tuning**

Before fine-tuning, it is noted that accuracy, precision and recall of the existing Random Forest model are 0.90 or beyond. Such performance is acceptable for a general rain forecast solution. 

Although I can continue trying different combinations of hyperparameters to improve the model performance, there will come a point that I would reach diminishing returns for hyperparameter tuning. In other words, performance improvement will not worth the time and effort spent on fine-tuning. Therefore, if after the first round of RandomizedSearchCV and GridSearchCV, the model performance does not improve beyond 2%, I will stop fine-tuning. 

### 4.1. Initial state of Random Forest Classifier

To monitor the performance improvement owing to fine-tuning, I will first establish a benchmark by calculating various metrics for the default Random Forest Classifier. 

In [17]:
# Calculate accuracy for the base model 
start_time = datetime.now()
forest_clf = RandomForestClassifier(random_state = 42)
rdf_baseline = forest_clf.fit(X_resampled, y_resampled)
y_train_pred = cross_val_predict(forest_clf, X_resampled, y_resampled, cv = 10)
end_time = datetime.now()
print('Total running time:', (end_time - start_time).total_seconds())

base_model_accuracy = accuracy_score(y_resampled, y_train_pred)
base_model_accuracy

Total running time: 235.977824


0.909326953603338

### 4.2. Narrow the search with RandomizedSearchCV 

Although there are many hyperparameters that could be fine-tuned, based on [scikit-learn documentation](https://scikit-learn.org/stable/modules/ensemble.html#random-forest-parameters), I will focus on fine-tuning the following 6 parameters for Random Forest Classifier.

Below is a quick recap on what each hyperparameter means according to [this article](https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/).

**1. n_estimators: the number of trees in the forest**
- The larger the better, but also the longer it will take to compute.
- Results will stop getting significantly better beyond a critical number of trees

**2. max_features: the size of the randome subsets of features to consider when splitting a node**
- Resembles the number of maximum features provided to each tree in a random forest.
- The lower the greater the reduction of variance, but also the greater the increase in bias. 
- Empirical good default values are max_features=None and max_features="sqrt" (i.e. use a random subset of size sqrt(n_features) 

**3. max_depth: the maximum number of levels in each decision tree**
- Can also be defined as the longest path between the root node and the leaft node
- As the max depth of the decision tree increases, the performance of the test set would increase initially but after a certain point it starts to decrease rapidly. This is because the tree starts to overfit the training set, hence the model is not able to generalise over the unseen points in the test set. 

**4. min_samples_split: the minimum number of data points placed in the node before the node is split**
- If any terminal node has more than two observations and is not a pure node, we can split it further into subnodes
- Default value is assigned to 2. However, having a default value as 2 poses the issue that a tree often keeps on splitting until the nodes are completely pure. As a result, the tree grows in size and therefore overfits the data.
- By increasing the value of the min_sample_split, we can reduce the number of splits that happen in the decision tree and therefore prevent the model from overfitting. However, if the value is too high, the model will start to underfit. 

**5. min_samples_leaf: the maximum number of data points allowed in a leaf node**
- Specifies the minimum number of samples that should be present in the leaf node after splitting a node. 
- As the hyperparameter increases, this helps to prevent overfitting. However, if the value is too high, the model would drift towards the realm of underfitting.

**6. bootstrap: method for sampling data points**

In [18]:
# Create a parameter grid to sample from during fitting

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1400, num = 7)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 70, num = 7)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [5, 10, 15]
# Minimum number of samples required at each leaf node
min_samples_leaf = [2, 3, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, None], 'min_samples_split': [5, 10, 15], 'min_samples_leaf': [2, 3, 4], 'bootstrap': [True, False]}


In [19]:
# Use the random grid to search for best hyperparameters

# First create the base model to tune
rf = RandomForestClassifier()

# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

start_time = datetime.now()
# Fit the random search model
rf_random.fit(X_resampled, y_resampled)
end_time = datetime.now()
print('Total running time:', (end_time - start_time).total_seconds())

# View the best parameters from fitting the random search
rf_random.best_params_

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Total running time: 4398.513673


{'n_estimators': 1000,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_features': 'auto',
 'max_depth': 50,
 'bootstrap': False}

In [20]:
# Calculate accuracy score for the best parameters obtain from RandomizedSearchCV

start_time = datetime.now()
forest_clf = RandomForestClassifier(random_state = 42
                                    , n_estimators = 1000
                                    , max_depth = 50
                                    , min_samples_split = 5
                                    , min_samples_leaf = 2
                                    , max_features = 'auto'
                                    , bootstrap = False)
rdf_random_search = forest_clf.fit(X_resampled, y_resampled)
y_train_pred = cross_val_predict(forest_clf, X_resampled, y_resampled, cv = 10)
end_time = datetime.now()
print('Total running time:', (end_time - start_time).total_seconds())

random_search_accuracy = accuracy_score(y_resampled, y_train_pred)
random_search_accuracy

Total running time: 3635.506409


0.9144178874325366

### 4.3. Implement GridSearchCV

In [21]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [False],
    'max_depth': [50, 60, 70],
    'max_features': ['auto'],
    'min_samples_leaf': [2, 3],
    'min_samples_split': [4, 5, 6],
    'n_estimators': [600, 800, 1000, 1200]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [24]:
# Fit the grid search to the data
start_time = datetime.now()
grid_search.fit(X_resampled, y_resampled)
end_time = datetime.now()
print('Total running time:', (end_time - start_time).total_seconds())

# View the best parameters from fitting the grid search
grid_search.best_params_
best_grid = grid_search.best_estimator_

Fitting 3 folds for each of 72 candidates, totalling 216 fits
Total running time: 4809.46868


In [25]:
best_grid

RandomForestClassifier(bootstrap=False, max_depth=50, min_samples_leaf=2,
                       min_samples_split=4, n_estimators=1200)

In [26]:
# Calculate F1 score for the best parameters obtain from RandomizedSearchCV

start_time = datetime.now()
forest_clf = best_grid
rdf_grid_search = forest_clf.fit(X_resampled, y_resampled)
y_train_pred = cross_val_predict(forest_clf, X_resampled, y_resampled, cv = 10)
end_time = datetime.now()
print('Total running time:', (end_time - start_time).total_seconds())

grid_search_accuracy = accuracy_score(y_resampled, y_train_pred)
grid_search_accuracy

Total running time: 4430.081335


0.915687786294163

In [28]:
print("Base Model Accuracy: ", base_model_accuracy)
print("RandomSearchCV's Best Model Accuracy: ", random_search_accuracy)
print("GridSearchCV's Best Model Accuracy: ", grid_search_accuracy)

Base Model Accuracy:  0.909326953603338
RandomSearchCV's Best Model Accuracy:  0.9144178874325366
GridSearchCV's Best Model Accuracy:  0.915687786294163


## 5. Evaluate Random Forest models with test data

Before deciding on an optimal Random Forest model, let's how the 2 new models performed with unseen data. I am using the [score method](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.score) which returns the mean accuracy on the given test data and labels. 

In [31]:
# Check whether the RandomSearch model is overfitting
rdf_random_search.score(X_test_transformed, y_test)

0.8479606188466948

In [32]:
# Check whether the GridSearch model is overfitting
rdf_grid_search.score(X_test_transformed, y_test)

0.8478199718706048

The gap between the training set accuracy and the test set accuracy does not indicate overfitting or underfitting. Additionally, fine-tuning with RandomizedSearchCV and GridSearchCV does not increase the accuracy to a great extent. Therefore, I will opt to retain the Random Forest model obtained from GridSearchCV as the best model to be used for predictions. 

In real-life, as more data is collected, it is advised to build a full-fledged ML pipeline to automate the ML life cycle steps. In other words, when new training data becomes available, a workflow which includes data validation, preprocessing, model training, analysis, and deployment should be triggered. 

## 6. Save Random Forest model using Pickle

Since I now have a Random Forest model to predict next-day rain, the next step will be to serialise the ML algorithm and save the serialised format to a file. In doing so, I can later load this file to deserialise my model and use it to make new predictions. The tool I will be using is the pickle operation. 

In [33]:
filename = 'randomforestmodel.sav'
pickle.dump(rdf_grid_search, open(filename, 'wb'))

The best Random Forest model is serialised and saved as the filed randomforestmodel.sav, which could later be deserialised to make new predictions with unseen data. Instructions and code on how to do so can be found in [this article](https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/).

## 7. Lessons Learned and Recommended Future Improvements

### 7.1. Lessons Learned

Looking back at this ML project which I started on 10 April 2021, below are some crucial lessons I have picked up along the way about this binary classification problem. Do take it with a pinch of salt because the world of Machine Learning and Data Science is so diverse and nuanced. 

1. Exploratory Data Analysis (EDA)

    - Starting with EDA is a must to obtain a solid background about the problem and the dataset. There is no point in wasting time on the wrong problem and/ or the wrong dataset.  
    - EDA should be done with a clear purpose and structure: 
        - Load data & basic exploration (data types, descriptive statistics, null values): Does the dataset fit for purpose? Does it need to be preprocessed? 
        - Univariate analysis: It's all about data characteristics and data quality
            - Categorical features: missing values, invalid & inconsistent values to be fixed
            - Numerical features: skewness, outliers, missing values, invalid values (based on max and min)
        - Bivariate analysis and multi-variate analysis: It's all about the predictive power. Which features should be retained, discarded or KIV for further feature engineering? 
        - Key takeaways and next steps: Should we proceed with the given task and dataset? If yes, how do we proceed with the task? 


2. Model Selection for Binary Classification Problem

    - Remember to look beyond accuracy as the only evaluation metrics for binary classification model. Other alternatives include precision, recall, F1 score and ROC AUC. Select a metrics for evaluation model based on business context (Is it more important to minimise FP or reducing FN?)
    - Always glance through the confusion matrix 
    - Simple models are so much faster to train but might result in lower accuracy. Vice versa, sophisticated ensemble methods are slow to train but the prediction performance could be superior. Therefore, it is always a matter of trying out and deciding which factor is more important (speed vs accuracy, simplicity vs sophistication). 


3. Hyperparameter Fine-tuning

    - It is all about controlled experiment and knowing when to stop (because you can go on forever with fine-tuning and end up with an overfitting model). 
    - Define a threshold for performance improvement. If the fine-tuning stops exceeding the threshold, then halt the fine-tuning because you are probably hitting a point of diminishing return. This is very similar to the concept of early stopping in BigQuery ML. 
    - Start with the RandomizedSearchCV to narrow down a reasonable range of hyperparameter values. Then use GridSearchCV to identify the best performing model. 
    - As it simply took too long for my local machine to iterate through different combinations of hyperparameter for RandomizedSearchCV and GridSearchCV, there are 2 options I have used to rescue yourself: 
        - Migrate the workload to cloud platform (e.g. AWS, GCP, Azure). Of course it will cost some money if you need it frequently, but definitely faster and more scalable than running it on a bigger machine. For a one-off project, free trials from cloud platforms could be more than sufficient. In my case, Jupyter Notebooks on Google AI Platform absolutely rocks (fast to provision, Jupyter Lab and standard Python libraries readily installed + direct linkage to GitHub).
        - Reduce the volume of the data used for training. For example, randomly sampling 25% and use it to train the model. But again, this is a deliberate trade-off for speed. 
        

4. Evaluate the model with test data

    - After spending all the time and efforts to select and fine-tune an ML model, evaluating with test data is definitely a must to check for underfitting or overfitting. 
    - Stick to the same evaluation metrics. It does not make sense if you earlier choose an "out-of-the-box" model for the highest accuracy, then evaluating the same model with test data using F1 score.
     

### 7.2. Recommended Future Improvements

Here are a few points that I wish I had done and would love to revisit when time permits. 
1. Experimenting with further feature engineering
2. Leverage more functions to streamline my coding and make it easier to understand
3. Experiment with fine-tuning Support Vector Machine model
4. Build an end-to-end ML pipeline to productionise the ML model and build a rain forecast app with weather observation data streamed from BOM website or other sources