# Simple binary classification

In this notebook we study the classification problem using different techniques and a dataset based on sonar-based object detection.

## Problem statement
The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the
strength of sonar returns at different angles. It is a binary (2-class) classification problem. The
number of observations for each class is not balanced. There are 208 observations with 60 input
variables and 1 output variable.

The file "sonar.mines" contains 111 patterns obtained by bouncing sonar
signals off a metal cylinder at various angles and under various
conditions.  The file "sonar.rocks" contains 97 patterns obtained from
rocks under similar conditions.  The transmitted sonar signal is a
frequency-modulated chirp, rising in frequency.  The data set contains
signals obtained from a variety of different aspect angles, spanning 90
degrees for the cylinder and 180 degrees for the rock.

## Code requirements

In [99]:
using DataFrames
using DataFramesMeta # Regressions
using CSV
using StatsBase
using GLM # Regressions
using MLJ # Knn
using LIBSVM #SVM
using Printf
using NearestNeighborModels # Knn
using MLJBase #Knn
using DecisionTree

# Step 1

The data set is processed and divided into two parts:
- A first section consisting of 70 random samples of objects classified as mines and 70 more of objects classified as rocks.
- A training section containing the remaining elements of the dataset.

In [113]:
myPath = "Data/sonar.csv"
function trainingSelection(n, path=myPath)
    
    sonarRaw = CSV.read(path, DataFrame)
    sonar = sort!(sonarRaw, [:61])
    
    k = size(sonar,1)
    m = findfirst(isequal("R"), sonar[!,:61])
    
    if (n > (k-m) || n > (m-1))
        print("There is insufficient data for a balanced sample of n ")
        return nothing
    end
    
    RowsMine = sample(1:(m-1), n, replace=false, ordered=true) # Muestra de indices aleatorios donde la linea tiene clasificación M
    sonarMine = sonar[RowsMine, :] # Dadaframe con las filas resultantes de RowsMine
    
    RowsRock = sample(m:k, n, replace=false, ordered=true) # Muestra de indices aleatorios donde la linea tiene clasificación R
    sonarRock = sonar[RowsRock, :] #  Dadaframe con las filas resultantes de RowsMine
    
    testData = sonar[Not(union(RowsMine, RowsRock)),:] # Dataframe con las filas que no estan ni en RowsMine ni RowsRock
    
    trainingData = vcat(sonarMine, sonarRock); # Union de sonarMine con sonarRock (Conjunto de entrenamiento)
    
    return trainingData, testData # training = 140 -> 70 and 70, test = 68 -> 41 and 27
    
end

# Dump separated data

out = trainingSelection(70)
training = out[1]
testing = out[2]

training."f61".= replace.(training."f61", "M" => "1")
training."f61".= replace.(training."f61", "R" => "0")
training."f61" = parse.(Float64, training."f61")

testing."f61".= replace.(testing."f61", "M" => "1")
testing."f61".= replace.(testing."f61", "R" => "0")
testing."f61" = parse.(Float64, testing."f61");


Additionally the representation of the two classes by positive cases (mines -> 1) and negative cases (rocks -> 0) is transformed and the data is dumped into two new files to persist the data in case of further testing.

## Dump Data

In [114]:
CSV.write("Data/training.csv", training)
CSV.write("Data/test.csv", testing);

# Step 2

It is determined that **accuracy** will be the metric used to evaluate the performance of the machines to be considered:

Accuracy is one of the metrics for evaluating classification models in ML. It can be explained as the fraction of predictions that our model got right. Formally, we define it as follows:

$$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$

Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.

# Step 3

With the data processed and separated into training and testing, we proceed to study the following algorithms applied to this case

## Linear regression

In [122]:
fm = @formula(f61 ~ f1+f2+f3+f4+f5+f6+f7+f8+f9+f10+f11+f12+f13+f14+f15+f16+f17+f18+f19+f20+f21+f22+f23+f24+f25+f26+f27+f28+f29+f30+f31+f32+f33+f34+f35+f36+f37+f38+f39+f40+f41+f42+f43+f44+f45+f46+f47+f48+f49+f50+f51+f52+f53+f54+f55+f56+f57+f58+f59+f60)
linearRegressor = lm(fm, training)
prediction = GLM.predict(linearRegressor, testing)
prediction_class = [if x < 0.5 0 else 1 end for x in prediction]
# Falta generar métrica
@printf "Accuracy: %.3f%%\n" mean((prediction_class .== testing."f61"))*100

Accuracy: 69.118%


## Logistic regression

In [123]:
logit = glm(fm, training, Binomial(), ProbitLink())
prediction = GLM.predict(logit,testing)
prediction_class = [if x < 0.5 0 else 1 end for x in prediction]
# Falta generar métrica
@printf "Accuracy: %.3f%%\n" mean((prediction_class .== testing."f61"))*100

Accuracy: 76.471%


## SVM

In [117]:
X = Matrix(training[:,1:60])'
y = training."f61"

test = Matrix(testing[:,1:60])'
ytest = testing."f61"

model = svmtrain(X, y)
ŷ, decision_values = svmpredict(model, test)

# Compute accuracy
@printf "Accuracy: %.3f%%\n" mean((ŷ .== ytest))*100


Accuracy: 75.000%


## KNN

In [118]:
using NearestNeighborModels, MLJBase

complete = vcat(training,testing)
X = MLJ.table(Matrix(complete[:,1:60]))
y = categorical(complete."f61")

knnc = KNNClassifier() 
knnc_mach = machine(knnc, X, y) # MLJ Machine
MLJBase.fit!(knnc_mach, rows=1:140) # train machine on a subset of the wrapped data `X`

p = predict_mode(knnc_mach, rows=141:208)
# Falta generar métrica
@printf "Accuracy: %.3f%%\n" mean((p .== testing."f61"))*100

Accuracy: 79.412%


┌ Info: Training Machine{KNNClassifier,…}.
└ @ MLJBase /home/angel/.julia/packages/MLJBase/rMXo2/src/machines.jl:423


## Decision Tree

In [119]:
model = DecisionTree.DecisionTreeClassifier(max_depth=2)
fit!(model, Matrix(training[:,1:60]), training."f61")

p = predict(model, Matrix(testing[:,1:60]))
prediction_class = [if x < 0.5 0 else 1 end for x in p]

@printf "Accuracy: %.3f%%\n" mean((prediction_class .== testing."f61"))*100

Accuracy: 64.706%


## Best performance

Each implementation of the algorithms considers the calculation of performance metrics. We see then that in this case the Knn algorithm is the one that best classified the training data (with an accuracy of 79.412%)

# Conclusions

...