<a href="https://colab.research.google.com/github/G1useppe/ma5851_capstone_T0CE/blob/main/MACHINE_LEARNING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Setup

In [None]:
from google.colab import drive #mounting google drive to access processing output

import pandas as pd #general
import numpy as np
from pprint import pprint

from sklearn.ensemble import RandomForestRegressor as rfR #machine learning
from sklearn.model_selection import train_test_split as tts 
from sklearn.model_selection import RandomizedSearchCV as rsCV
from sklearn import metrics

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


To commence the third and final part of this workflow, we can bring in the data generated from our cleaning, wrangling and feature extraction.

In [None]:
path = '/content/drive/MyDrive/feature_data.xlsx'
feature_data = pd.read_excel(path, na_values = 'NaN')
feature_data = feature_data.drop(["Unnamed: 0"], axis = 1)
feature_data.head()

Unnamed: 0,Salary FTE,Capital,State_ACT,State_NSW,State_NT,State_QLD,State_SA,State_TAS,State_VIC,State_WA,...,desc_helping,desc_pipeline,desc_data management,desc_user,desc_insurance,desc_bachelor degree,desc_aws,desc_marketing,desc_energy,desc_sale
0,103308.0,False,0,1,0,0,0,0,0,0,...,0.0596,0.0,0.0,0.0,0.054629,0.0,0.0,0.0,0.0,0.0
1,109738.5,False,0,0,0,0,0,0,1,0,...,0.0,0.0,0.091901,0.071474,0.0,0.0,0.0,0.0,0.0,0.0
2,64924.5,False,0,0,0,0,0,0,1,0,...,0.0,0.076432,0.0,0.075535,0.0,0.0,0.0,0.0,0.0,0.0
3,79186.8,False,0,0,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079768,0.0
4,76648.0,True,0,0,0,0,0,0,1,0,...,0.0,0.0,0.45719,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# 4.1 Machine Learning Structure

## Train/Test Split

There was no need to deviate from conventional wisdom for the train/test split, and a 80/20 split was preferred.

In [None]:
salary = feature_data[feature_data["Salary FTE"].notnull()]
no_salary = feature_data[feature_data["Salary FTE"].isnull()]

In [None]:
y = list(salary["Salary FTE"])
X = salary.drop(["Salary FTE"], axis = 1) 

In [None]:
X_train, X_test, y_train, y_test = tts(X, y, test_size = 0.2, random_state = 52)

## 4.1.1 Random Forest Selection Rationale and Structure

The random forest regressor was selected as the machine learning model due to the strong performance in the surveyed literature. In the work of Jackman & Reid (xxxx), it is the strongest performing model among the five regressors. In the work Martin et al. (xxxx), the random forest outpeforms all other models in a classification problem with the exception of voting classifiers, which are outside the scope of the course material. Additionally, Biau and Scornet (2015) praise the random forest algorithm for its 'ability to deal with small sample sizes and
high-dimensional feature spaces', which will be the case for the testing data set. 

To understand the structure of the random forest we first must understand the structure of a decision tree. In a machine learning capacity, decision trees are comprised of nodes that are trained to split observations in a manner such that the groups that it generates are as dissimilar to eachother as possible. The amount of nodes that makes up a tree, and the manner in which the nodes behave is constrained by user-specified parameters. A random forest is a collection of decision trees that operate as a community to make a decision. In the case of our regressor, each tree in the forest will make a prediction on the case which it is passed. Following this, the random forest will return the average prediction among the trees. The underlying concept behind the random forest is a simple one, strength in numbers. The essence that the decision trees all come from a seperate origin and are uncorellated is key to the process, where each tree can sample a fraction of the data, grow a tree, and contribute to the community aggregation.

The software to implement the machine learning will be RandomForestRegressor from scikit-learn. 

## 4.1.2 Hyperparameters

### Default Random Forest

To set a baseline for our machine learning investigation, and more importantly our hyperparameter optimization, a model was produced using default parameters from sklearn. The default parameters were returned and the model was subsequtly trained on the training set.

In [None]:
default_rf = rfR()
pprint(default_rf.get_params())

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


In [None]:
default_rf.fit(X_train, y_train)

RandomForestRegressor()

The below helper function was written to save on doubling up, and generate some data to add to a data frame for the evluation.

In [None]:
def evaluator(model, X, y):
  predictions = model.predict(X)
  errors = abs(predictions - y)
  mae = np.mean(errors)
  mape = 100 * np.mean(errors / y)
  accuracy = 100 - mape
  print("Accuracy: " + str(round(accuracy, 2)) + "%")
  print("Mean Absolute Error: AUD " + str(round(mae, 2)))
  print("These results have been saved in a list")
  return [str(model), X, accuracy, mae]

In [None]:
default_rf_metrics = evaluator(default_rf, X_train, y_train)

Accuracy: 93.61%
Mean Absolute Error: AUD 5253.18
These results have been saved in a list


The initial performance of the model is very strong, however it is caveated by potential overfitting. This can be attributed to the lack of cross-validation and the absence of constraint on the max_depth field.

In [None]:
feature_importance = pd.Series(data = default_rf.feature_importances_, index = X_train.columns)
feature_importance.sort_values(ascending = False)[0:10]

desc_collaborate               0.198491
title_analyst                  0.098962
desc_leave                     0.071788
desc_scientific                0.060944
desc_strait islander           0.046349
desc_vision                    0.035617
desc_scientist                 0.033894
company_health                 0.026475
desc_torres strait islander    0.018457
desc_operational               0.013687
dtype: float64

The variable importance for this model produced a surprising result, with the "collaborate" feature from the job description is four times as important as any other feature.

### First Random Grid

The first step in improving our model is trying to tune the hyperparameters. The chosen tuning implementation was the Randmized Search Cross-Validation (rsCV) method from sci-kit learn, in which a random set from a parameter grid is iterated through with cross-validation. In our case, we propose a grid with 1,536 different parameter combinations. Given the provided constraints, rsCV will randomly choose 100 candidates and fit 8 folds on each. The advantage this model provides is the ability to narrow down a very wide hyperparameter search to provide a rough idea of where the best hyperparameters sit, without sacrificing the integrity of the fit. This method is designed to be an intermediate tune before a more focused tune down the line. 

In [None]:
n_estimators = [100, 200, 400, 800]
max_features = ['sqrt', 'auto']
max_depth = [5, 10, 20, 40, 80, 160]
min_samples_leaf = [1, 2, 4, 8]
min_samples_split = [2, 4, 8, 16]
bootstrap = [False]

In [None]:
params_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_leaf': min_samples_leaf,
    'min_samples_split': min_samples_split,
    'bootstrap': bootstrap
}

In [None]:
random_grid = rsCV(estimator = default_rf, param_distributions = params_grid, n_iter = 100, cv = 8, verbose = 2, n_jobs = -1)

In [None]:
random_grid.fit(X_train, y_train)
random_grid.best_params_

Fitting 8 folds for each of 100 candidates, totalling 800 fits


{'n_estimators': 100,
 'min_samples_split': 4,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 80,
 'bootstrap': False}

In [None]:
random_grid_metrics = evaluator(random_grid, X_train, y_train)

Accuracy: 98.91%
Mean Absolute Error: AUD 991.73
These results have been saved in a list


Whilst this result appears to be only a slight improvement, there is a increased sense of confidence that this is a robust model as the issues in the default model have been mitigated. Whilst max_depth took on the highest available value, it has been constrained. Furthermore, the 8-fold validation lends weight to the fact that this model is not overfit.

### Refined Random Grid

Using the best hyperparameters from the random grid to govern the selection, a new set of hyperparameters is proposed. rsCV is implemented to trial all 192 combinations, again across 8 folds.

In [None]:
refined_grid = {
    'n_estimators': [100, 150],
    'max_features': ['auto', 'sqrt'],
    'max_depth': [120, 160, 240, 320],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [2, 3, 4],
    'bootstrap': [False]
}

In [None]:
refined_grid = rsCV(estimator = default_rf, param_distributions = refined_grid, cv = 8, n_iter=192, verbose = 2, n_jobs = -1)

In [None]:
refined_grid.fit(X_train, y_train)
refined_grid.best_params_

Fitting 8 folds for each of 96 candidates, totalling 768 fits




{'n_estimators': 150,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 240,
 'bootstrap': False}

In [None]:
refined_grid_metrics =  evaluator(refined_grid, X_train, y_train)

Accuracy: 100.0%
Mean Absolute Error: AUD 0.0
These results have been saved in a list


This is certainly a staggering result, and the refinement offers a drastic improvement over the first tuning pass. Comparing the best parameters

In [None]:
feature_importance = pd.Series(data = refined_grid.best_estimator_.feature_importances_, index = X_train.columns)
feature_importance.sort_values(ascending = False)

desc_leave            0.052052
desc_collaborate      0.042866
title_analyst         0.035644
desc_scientist        0.031964
desc_similar          0.020918
                        ...   
company_murdoch       0.000000
title_graduate        0.000000
company_rmit          0.000000
company_department    0.000000
company_university    0.000000
Length: 309, dtype: float64

### Model Testing 

In [None]:
test_metrics = evaluator(refined_grid, X_test, y_test)

Accuracy: 89.74%
Mean Absolute Error: AUD 10427.12
These results have been saved in a list


## 4.1.3 A note on computation

During the proposal for this work, sources of GPU were discussed for the machine learning component. The decision to use a random forest regressor on sklearn eliminated the need for GPU, as it is programmed for CPU only.

# 4.2 Evaluation