In [1]:
from datetime import datetime
import pandas as pd
import numpy as np
from dateutil.parser import parse
pd.set_option("display.max_rows", 15)
pd.set_option("display.max_columns", 15)
import datetime
from dateutil.parser import parse
import math

from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"

In [2]:
# read df pickle
df_alg = pd.read_pickle("objects/df_alg")
df_cons = pd.read_pickle("objects/df_cons4")

# Modelling and Preliminary Results

---

Due to the needs for specific algorithms the modelling has now been carried out in Weka and below are two selected models and evaluation scores.

### Decision Tree Classifier: J48 (C4.5) 

![J48](Weka_results/Seminar2/tree.J48-C0.25-M4-10F.png)

![DecisionTree](Weka_results/Seminar2/tree1.png)

![tree](Weka_results/Seminar2/tree2.png)

### Ensemble Classifier: Random Forest

![randomForest](Weka_results/Seminar2/RandomForest1.png)

![randomForest](Weka_results/Seminar2/RandomForest2.png)

# Scikit-learn Analysis

In [3]:
# Change date to month only
df_cons["date"] = df_alg["date"].dt.month_name()

# Create df for the sckiti-learn; Remove sampling station, sampling depth, sampling method
df_cons_SL = df_cons.drop(columns=["sampling station", "sampling depth", "sampling method"])
df_cons_SL

Unnamed: 0,date,PSP,DSP,DSP_like,ASP,Dinophysis caudata,Dinophysis fortii,...,NO3-N,PO4-P,SiO3-Si,O2,pH,Soca,lipophylic_toxins
0,May,1206.0,68.0,,,27.0,0.0,...,6.73,0.15,4.91,230.449997,8.00,1471.231,
3,May,4188.0,17.0,,,8.0,0.0,...,1.12,0.12,3.02,250.100006,8.15,1471.231,
6,June,0.0,27.0,,,16.0,3.0,...,1.23,0.22,1.82,274.209991,8.15,3304.045,
7,June,324.0,23.0,,,3.0,0.0,...,2.52,0.09,0.83,275.109985,8.22,2263.119,
8,June,0.0,20.0,,,8.0,0.0,...,,,,,,1873.218,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1519,December,0.0,20.0,0.0,2500.0,0.0,10.0,...,4.69,0.10,9.21,212.506997,8.17,4071.444,neg
1517,December,0.0,0.0,0.0,2100.0,0.0,0.0,...,2.34,0.01,5.26,217.990202,8.17,4071.444,neg
1522,January,20.0,40.0,0.0,200.0,10.0,20.0,...,2.73,0.07,6.58,237.135858,8.17,1709.376,
1521,January,0.0,0.0,0.0,2100.0,0.0,0.0,...,1.26,0.04,3.68,228.221284,8.19,1709.376,


In [4]:
# Preprocessing for NN in scikit_learn
# one-hot encoding of date feature
df_cons_SL = pd.get_dummies(df_cons_SL, columns=["date"])

# Remove missing values
df_cons_SL = df_cons_SL.dropna(how="any").copy()

# Class distribution
df_cons_SL["lipophylic_toxins"].value_counts()

neg    232
poz     22
Name: lipophylic_toxins, dtype: int64

In [5]:
# [Continuation...] Preprocessing for NN in scikit_learn
X = df_cons_SL.copy().drop("lipophylic_toxins", axis=1)
y = df_cons_SL["lipophylic_toxins"]

# sklearn lable encoding
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(y)
print(f"Class labels pre-transform: {list(le.classes_)}")
y = le.transform(y)
print(f"Class labels post-transform: {np.unique(y)}")

# scalling numeric values
from sklearn.preprocessing import StandardScaler
scaled_array = StandardScaler().fit_transform(X)
X = pd.DataFrame(scaled_array, columns=X.columns)

Class labels pre-transform: ['neg', 'poz']
Class labels post-transform: [0 1]


In [6]:
# Fix class imbalance with Synthetic Minority Oversampling Technique - SMOTE – and undersampling!
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# summarize class distribution
counter = Counter(y)
print('Original dataset shape: %s' % Counter(y))

# define SMOTE pipeline (oversampling instances with poz (minority class) and undersampling those with neg label)
over = SMOTE(sampling_strategy=0.5)
under = RandomUnderSampler(sampling_strategy=0.8)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

# transform the dataset
X_res, y_res = pipeline.fit_resample(X, y)

# summarize the new class distribution
counter = Counter(y_res)
print('Resampled dataset shape: %s' % Counter(y_res))

ModuleNotFoundError: No module named 'imblearn'

## Model Training and Evaluation

### Random Forest Model

#### Model evaluation (Random Forest)

In [None]:
# Model evaluation with the pipeline of SMOTE oversampling and undersampling on the training dataset only (within each cross-validation fold)!
# Evaluate k (SMOTE) parameter. 

from numpy import mean
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# Cross-validation of model with ROC AUC with SMOTE pipeline  


# Find best performing k-value for SMOTE
k_values = list(range(1,11))
RF_auc_best_k = (_, 0)
for k in k_values:   
    # define pipeline
    cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)
    clf = RandomForestClassifier(n_estimators=100, max_depth=None, max_features=None)
    over = SMOTE(sampling_strategy=0.5, k_neighbors=k)
    under = RandomUnderSampler(sampling_strategy=0.8)
    steps = [('over', over), ('under', under), ('clf', clf)]
    pipeline = Pipeline(steps=steps)
    scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
    score = mean(scores)
    print('> k=%d, Mean ROC AUC: %.3f' % (k, score))
    if score > RF_auc_best_k[1]:
        RF_auc_best_k = (k, score)

print(f">>Best k-value: k={RF_auc_best_k[0]} with Mean ROC AUC on resampled dataset: {round(RF_auc_best_k[1], 2)}")  

# Cross-validation of model with ROC AUC without SMOTE pipeline  
scores = cross_val_score(clf, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
RF_auc_score = mean(scores)
print(f">>Mean ROC AUC on unsampled dataset: {round(RF_auc_score, 2)}\n")

# Cross-validation of model with Recall with SMOTE pipeline
# Find best performing k-value for SMOTE
RF_recall_best_k = (_, 0)
for k in k_values:
    scores = cross_val_score(pipeline, X, y, scoring='recall', cv=cv, n_jobs=-1)
    score = mean(scores)
    print('> k=%d, Mean Recall: %.3f' % (k, score))
    if score > RF_recall_best_k[1]:
        RF_recall_best_k = (k, score)

print(f">>Best k-value: k={RF_recall_best_k[0]} with Mean Recall on resampled dataset: {round(RF_recall_best_k[1], 2)}")   

# Cross-validation of model with Recall without SMOTE pipeline  
scores = cross_val_score(clf, X, y, scoring='recall', cv=cv, n_jobs=-1)
RF_recall_score = mean(scores)
print(f">>Mean Recall on unsampled dataset: {round(RF_recall_score, 2)}")

#### Feature importance (Random Forest)

In [None]:
# Feature importance of model (RandomForest) with three methods!
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import shap
from matplotlib import pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2)
plt.subplots_adjust(wspace=1.3)

# Split data and fit model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
rf = RandomForestClassifier(n_estimators=100, max_depth=None, max_features=None)
rf.fit(X_train, y_train)

# Get feature importance with Gini importance (mean decreased impurity)
# print(rf.feature_importances_)
gini_sorted_idx = rf.feature_importances_.argsort()
x1 = X.columns[gini_sorted_idx]
y1 = rf.feature_importances_[gini_sorted_idx]
ax1.barh(x1, y1)
ax1.set_title("Gini Feature Importance")

# Get feature importance with Permutation Based Feature Importance (randomly shuffles each feature and compute the 
# change in the model’s performance. The features which impact the performance the most are the most important one).
perm_importance = permutation_importance(rf, X_test, y_test)
perm_sorted_idx = perm_importance.importances_mean.argsort()
x2 = X.columns[perm_sorted_idx]
y2 = perm_importance.importances_mean[perm_sorted_idx]
ax2.barh(x2, y2)
ax2.set_title("Permutation Importance Random Forest")

In [None]:
# Get feature importance with SHAP
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
RF_shap = shap.summary_plot(shap_values, X_test, plot_type="bar")

### Neural Network Model

#### Model Evaluation (MLP)

In [None]:
# Model evaluation with the pipeline of SMOTE oversampling and undersampling on the training dataset only (within each cross-validation fold)!
# Evaluate k (SMOTE) parameter. 
from sklearn.neural_network import MLPClassifier

# Cross-validation of model with ROC AUC with SMOTE pipeline  
# define pipeline
clf = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(20,10), max_iter=3000, random_state=1)
over = SMOTE(sampling_strategy=0.5, k_neighbors=k)
under = RandomUnderSampler(sampling_strategy=0.8)
steps = [('over', over), ('under', under), ('clf', clf)]
pipeline = Pipeline(steps=steps)
cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)

# Find best performing k-value for SMOTE
k_values = list(range(1,11))
MLP_auc_best_k = (_, 0)
for k in k_values:
    # evaluate pipeline
    scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
    score = mean(scores)
    print('> k=%d, Mean ROC AUC: %.3f' % (k, score))
    if score > MLP_auc_best_k[1]:
        MLP_auc_best_k = (k, score)

print(f">>Best k: k={MLP_auc_best_k[0]} with Mean ROC AUC on resampled dataset: {round(MLP_auc_best_k[1], 2)}")  

# Cross-validation of model with ROC AUC without SMOTE pipeline  
scores = cross_val_score(clf, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
MLP_auc_score = mean(scores)
print(f">>Mean ROC AUC on unsampled dataset: {round(MLP_auc_score, 2)}\n")

# Cross-validation of model with Recall with SMOTE pipeline
# Find best performing k-value for SMOTE
MLP_recall_best_k = (_, 0)
for k in k_values:
    # evaluate pipeline
    scores = cross_val_score(pipeline, X, y, scoring='recall', cv=cv, n_jobs=-1)
    score = mean(scores)
    print('> k=%d, Mean Recall: %.3f' % (k, score))
    if score > MLP_recall_best_k[1]:
        MLP_recall_best_k = (k, score)

print(f">>Best k: k={MLP_recall_best_k[0]} with Mean Recall on unsampled dataset: {round(MLP_recall_best_k[1], 2)}")  

# Cross-validation of model with Recall without SMOTE pipeline  
scores = cross_val_score(clf, X, y, scoring='recall', cv=cv, n_jobs=-1)
MLP_recall_score = mean(scores)
print(f">>Mean Recall on unsampled dataset: {round(MLP_recall_score, 2)}")

In [None]:
# GRid search
from sklearn.model_selection import GridSearchCV

NN = MLPClassifier(solver='lbfgs', max_iter=3000)
parameters = {"hidden_layer_sizes": [(3,), (3, 3), (3, 3, 3)]}
gs = GridSearchCV(NN, parameters, scoring="roc_auc", cv=3) # "recall"
gs.fit(X_res, y_res)

In [None]:
gs.cv_results_

In [None]:
gs.cv_results_

#### Feature Importance (MLP)

In [None]:
# Feature importance of model (RandomForest) with three methods (no cross-validation!)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import shap
from matplotlib import pyplot as plt
fig, (ax2) = plt.subplots(1, 1)
plt.subplots_adjust(wspace=1.3)

# Split data and fit model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
MLP = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(20,10), max_iter=3000, random_state=1)
MLP.fit(X_train, y_train)

# Get feature importance with Permutation Based Feature Importance (randomly shuffles each feature and compute the 
# change in the model’s performance. The features which impact the performance the most are the most important one).
perm_importance = permutation_importance(MLP, X_test, y_test)
perm_sorted_idx = perm_importance.importances_mean.argsort()
x2 = X.columns[perm_sorted_idx]
y2 = perm_importance.importances_mean[perm_sorted_idx]
ax2.barh(x2, y2)
ax2.set_title("Permutation Importance MLP")

In [None]:
# explain the model's predictions using SHAP
import shap
# explainer = shap.KernelExplainer(clf.predict_proba, X_train)
explainer = shap.KernelExplainer(MLP.predict_proba, shap.sample(pd.DataFrame(X_train, columns=X.columns), 100))
shap_values = explainer.shap_values(X_test)

In [None]:
# visualize the first prediction's explanation (Shapley value is the average contribution of features which are predicting in different situation).
shap.summary_plot(shap_values, X_test)

### Conclusion

In [None]:
# Summary table of prediction results
RF_recall = round(RF_recall_best_k[1], 2)
RF_auc = round(RF_auc_best_k[1], 2)
MLP_recall = round(MLP_recall_best_k[1], 2)
MLP_auc = round(MLP_auc_best_k[1], 2)

summary = pd.DataFrame(
    [
        (
            "RF",
            RF_recall_score,
            RF_auc_score,
        ),
        (
            "MLP",
            MLP_recall_score,
            MLP_auc_score,
        ),
        (
            "RF (smote)",
            RF_recall,
            RF_auc,
        ),
        (
            "MLP (smote)",
            MLP_recall,
            MLP_auc,
        ),
        (
            "Decision tree (J48)*",
            0.56,
            0.18,
        ),
    ],
    columns=("Model", "Recall", "ROC AUC"),
).set_index("Model")

print("Table summarising the prediction results of the used classifiers, both with and without SMOTE resampling:\n")
summary.round(2)

As can be seen resampling with SMOTE helped to improve the results substantially, especially when calculating recall. The highest recall and ROC AUC was achieved with Random Forest with the re-sampled data. Both recall and ROC AUC suggest Random Forest as beeing the better classifier for this particular problem. Recall is a crucial metric as it gives indication of what fraction of true positive instances have been predicted. Since the models predict toxins in seashells (food) it is crucial that as few positives as possible are missed.

Due to the use of SMOTE resampling (upsampling and downsampling) in combinaiton with cross-validation it was curcial to do the resampling within each fold to avoid data lekeage and validate on original (unsampled) data. In addition, I have optimised the model with regard to the k-values of SMOTE, all of which brought along some complexity. So for the parameter tuning of Random Forest and MLP various parameter settings have been tried  and the model with best performing settings has been chosen.

The decision tree J48 algorithm was run within Weka on a slightly different dataset (missing values were not removed to use as many instances as possible, cross validation was 10-fold as opposed to 3-fold due to a higher dataset etc.) thus this results are not directly comparable but were provided as a reference to give an indication of the performance of this algorithm. 

As can be seen in the feature importance bar plots above, similar features were on the top despite using two different classification algorithms and two different feature ranking methods. If we consider just the three highest-ranking features of each of the feature ranking methods for both algortihms (RF and MLP) the features that overlap are DSP, DSP_like, ASP, Dinophysis fortii and Dinophysis caudata. These can be shown to the domain experts for validation and interpretation.