# Building and Assessing ML Models

## 1. Import needed modules

In [None]:
using Downloads
using CSV
using MLJ
using DataFrames
using GLM
using ROCCurves
using FreqTables
using StatsPlots

## 2. Data Loading, pre-processing and splitting into train and validation subsets

### Data loading & pre-processing

In [None]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/australian/australian.dat"
Downloads.download(url, "australian.csv")

In [None]:
dataset = CSV.read("australian.csv", DataFrame, delim=' ';
                header = append!([string("V", i) for i in 0:13], ["class"]))

In [None]:
dataset.V3 = ifelse.(dataset.V3 .== 1, 0, 1)
dataset.V11 = ifelse.(dataset.V11 .== 1, 0, 1)
dataset.V13 = log.(dataset.V13) 
dataset

In [None]:
training_fraction = 0.6
train, test = partition(eachindex(dataset.class), training_fraction, shuffle=true)

In [None]:
train_dataset = dataset[train,:]
test_dataset = dataset[test,:]

In [None]:
size.([train_dataset, test_dataset])

## 3. Building logistic regression model
[Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression)

In [None]:
logistic(x) = 1 / (1 + exp(-x))
x = collect(-6:.1:6)
plot(x, logistic.(x), legend=false)
vline!([0], line=:dash, color=:grey)
hline!([1/2], line=:dash, color=:grey)

In [None]:
model_log_reg_fit = glm(Term(:class) ~ sum(Term.(Symbol.(names(dataset[:, Not(:class)])))),
                        train_dataset, Binomial())

Model coefficients:

In [None]:
coef(model_log_reg_fit)

### Prediction

In [None]:
train_pred = GLM.predict(model_log_reg_fit)

In [None]:
test_pred = GLM.predict(model_log_reg_fit, test_dataset)

In [None]:
ideal_pred = test_dataset.class

In [None]:
random_pred = rand(length(test_dataset.class))

## 4. Assessing model performance


### Descriptive analysis - confusion matrix and related metrics

❗ Remember class indicator (0, 1,...) and actual or predicted values may be switched in confusion matrix

<img src="https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png" width=400>

<img src="https://miro.medium.com/max/1780/1*LQ1YMKBlbDhH9K6Ujz8QTw.jpeg"  width=400>

**Performance measures derived from confusion matrix:**

- Accuracy - percentage of correct predictions

`ACC = (TP + TN)/(TP + FP + TN + FN)`

- Precision - percentage of positive predictions which were actually correct

`PREC = TP / (TP + FP)`

-  Recall - what percentage of actual positives were predicted correctly
 (Recall = Sensitivity = Hit rate = True Positive Rate (TPR))
 
`REC = TP / (TP + FN)`

- Specificity - what percentage of actual negatives were predicted correctly (Specificity = True Negative Rate)

`TNR = TN / (TN + FP)`

- F1 Score - traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/1bf179c30b00db201ce1895d88fe2915d58e6bfd)

Calculating confusion matrices for prediction on train and test data as well as random and wizard models with a 0.5 cut-off threshold:

In [None]:
conf_mat_train = freqtable(train_pred .> 0.5, train_dataset.class)

In [None]:
conf_mat_test = freqtable(test_pred .> 0.5, test_dataset.class)

In [None]:
conf_mat_ideal = freqtable(ideal_pred .> 0.5, test_dataset.class)

In [None]:
conf_mat_random = freqtable(random_pred .> 0.5, test_dataset.class)

Writing a function to calculate accuracy, precision, recall and f1-score:

In [None]:
function quality_report(mat::AbstractMatrix)
    acc = (mat[1,1] + mat[2,2]) / sum(mat)
    prec = mat[2,2] / sum(mat[2,:])
    rec = mat[2,2] / sum(mat[:,2])
    f1 = 2 * prec * rec / (prec + rec)
    println("----Classification quality report----")
    println("Accuracy: ", round(acc*100,digits=2), "%")
    println("Precision: ", round(prec*100,digits=2), "%")
    println("Recall: ", round(rec*100,digits=2), "%")
    println("F1-score: ", round(f1*100,digits=2), "%")
end

Printing the reports for both datasets, as well as wizard and random models. Results are quite close similar to what we have seen on ROC curves:

In [None]:
println("Train set")
quality_report(conf_mat_train)
println("\nTest set")
quality_report(conf_mat_test)
println("\nWizard model:")
quality_report(conf_mat_ideal)
println("\nRandom model:")
quality_report(conf_mat_random)

### Visual analysis of a model

#### ROC curve

Calculating and plotting ROC curves for both training and test datasets as well as for wizard and random models:

In [None]:
train_fpr, train_tpr = ROCCurves.roc(train_pred, train_dataset.class)
test_fpr, test_tpr = ROCCurves.roc(test_pred, test_dataset.class)
ideal_fpr, ideal_tpr = ROCCurves.roc(ideal_pred, test_dataset.class)
random_fpr, random_tpr = ROCCurves.roc(random_pred, dataset.class[test])

plot(test_fpr, test_tpr, label="Test", xlabel="False Positive Rate (FPR)",
    ylabel="True Positive Rate (TPR)",
    title ="Receiver Operating Characteristic (ROC) curve", linewidth=2, legend=:bottomright)
plot!(train_fpr, train_tpr, label="Train", linewidth=2)
plot!(ideal_fpr, ideal_tpr, label="Wizard", linewidth=2)
plot!(random_fpr, random_tpr, label="Random", linewidth=2)
Plots.abline!(1, 0, line=:dash, label = "TPR=FPR")

Ideal model has ROC composed of 1-point at $(0, 1)$ resulting in perfect identification of $Y=1$ while making no errors. The better model, the closer its ROC is to this ideal point of $(0,1)$ resulting also in higher area the curve, which is numerical measurement of model performance presented below.

### AUC - Area Under Curve

Calculating AUC for training, test data and wizard and random models:

In [None]:
println("AUC metric on train dataset is equal to: ", auc_roc(train_fpr, train_tpr))
println("AUC metric on test dataset is equal to: ", auc_roc(test_fpr, test_tpr))
println("AUC metric of wizard model is equal to: ", auc_roc(ideal_fpr, ideal_tpr))
println("AUC metric of random model is equal to: ", auc_roc(random_fpr, random_tpr))

- discrepancy beetween train and test AUC measurement is the sign of overfitting
- wizard model would have AUC equal to 1 and random model around 0.5
- simulating ROC for random model with the same number of observation as test set has, enables to understand the sampling error of test set ROC 

### Gain chart

In [None]:
test_rpp = collect(0:length(test_tpr)-1)./(length(test_tpr)-1)
train_rpp = collect(0:length(train_tpr)-1)./(length(train_tpr)-1)

plot(test_rpp, test_tpr, label="Test", xlabel="Rate of Positive Predictions (RPP)",
     ylabel="True Positive Rate (TPR)", title= "Gain chart", linewidth=2, legend=:bottomright)
plot!(train_rpp, train_tpr, label="Train", linewidth=2)
plot!(test_rpp, ideal_tpr, label="Wizard", linewidth=2)
plot!(test_rpp, random_tpr, label="Random", linewidth=2)
vline!([mean(dataset.class[test])] ,line=:dash, label = "P(Y=1)")

- wizard model will increase linearly, so it achieves maximum of $TPR = 1$ for $RPP = P(Y=1)$
- random model gain chart is around 45-degree line

### Lift chart

In [None]:
plot(test_rpp, test_tpr ./ test_rpp, label="Test", xlabel="Rate of Positive Predictions (RPP)",
     ylabel="LIFT = TPR / RPP", title= "Lift chart", linewidth=2, legend=:topright,
     ylim = (1, 0.25+1/mean(dataset.class[test])))
plot!(train_rpp, train_tpr ./ train_rpp, label="Train", linewidth=2)
plot!(test_rpp, ideal_tpr ./ test_rpp, label="Wizard", linewidth=2)
plot!(test_rpp, random_tpr ./ test_rpp, label="Random", linewidth=2)
Plots.abline!(0, 1/mean(dataset.class[test]),  
     line=:dash, label = "1 / P(Y=1)")

- Lift measures how many times model's TPR is higher in comparison to TPR of random model for a given RPP, e.g. LIFT = 2 means that models indetifies 2 times more label $Y=1$ than a random model.
- Wizard model LIFT is equal to $ 1/P(Y=1)$ for $RPP < P(Y=1)$ and afterward decrease hiperbolically towards $LIFT =1$, which is a benchmark value for a random model. 

#### Score-density plots
Predicting labels on test dataset:

In [None]:
test_pred_1 = test_pred[test_dataset.class .== 1]
test_pred_0 = test_pred[test_dataset.class .== 0];

Visualizing model's score on histogram with two series - one for each class of 'target' column:

In [None]:
histogram(test_pred_1, normalize=true, bins=10, label=1)
density!(test_pred_1, label=1, linewidth=2)
histogram!(test_pred_0, normalize=true, bins=10, label=0, seriesalpha=0.7)
density!(test_pred_0, label=0, linewidth=2)


The more non-overlapping distributions the better predictive model.

#### Wizard model - ideal predicitons

In [None]:
ideal_pred_1 = ideal_pred[test_dataset.class .== 1]
ideal_pred_0 = ideal_pred[test_dataset.class .== 0];

In [None]:
histogram(ideal_pred_1.-0.5, normalize=true, bins=10, label=1)
density!(ideal_pred_1, label=1, bandwidth=.2, linewidth=2)
histogram!(ideal_pred_0.-0.5, normalize=true, bins=10, label=0, seriesalpha=0.7)
density!(ideal_pred_0, label=0, bandwidth=0.2, linewidth=2)


No overlapping scores between $Y=1$ and $Y=0$ is equivalent with a perfect model. Some overlapped kernel densities above results from high bandwidth hyperparameter used in kerned density estimation procedure, but empiracally there's no overlap.

#### Random model

In [None]:
random_pred_1 = random_pred[test_dataset.class .== 1]
random_pred_0 = random_pred[test_dataset.class .== 0];

In [None]:
histogram(random_pred_1, normalize=true, bins=10, label=1)
density!(random_pred_1, label=1, linewidth=2)
histogram!(random_pred_0, normalize=true, bins=10, label=0, seriesalpha=0.7)
density!(random_pred_0, label=0, linewidth=2)

Both distribution overlap heavily each other and are difficult to be distinguished. This is a sign of very poor predictive performance.

*Preparation of this workshop has been supported by the Polish National Agency for Academic Exchange under the Strategic Partnerships programme, grant number BPI/PST/2021/1/00069/U/00001.*

![SGH & NAWA](../logo.png)