# Simple binary classification

In this notebook we study the classification problem using different techniques and a dataset based on sonar-based object detection.

## Problem statement
The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the
strength of sonar returns at different angles. It is a binary (2-class) classification problem. The
number of observations for each class is not balanced. There are 208 observations with 60 input
variables and 1 output variable.

The file "sonar.mines" contains 111 patterns obtained by bouncing sonar
signals off a metal cylinder at various angles and under various
conditions.  The file "sonar.rocks" contains 97 patterns obtained from
rocks under similar conditions.  The transmitted sonar signal is a
frequency-modulated chirp, rising in frequency.  The data set contains
signals obtained from a variety of different aspect angles, spanning 90
degrees for the cylinder and 180 degrees for the rock.

## Code requirements

In [2]:
using DataFrames
using CSV
using StatsBase
using Printf
using ScikitLearn
using ScikitLearn: fit!, predict
using StatsBase: sample

# Step 1

The data set is processed and divided into two parts:
- A first section consisting of 70 random samples of objects classified as mines and 70 more of objects classified as rocks.
- A testing section containing the remaining elements of the dataset.

In [3]:
myPath = "Data/sonar.csv"
function trainingSelection(n, path=myPath)
    
    sonarRaw = CSV.read(path, DataFrame)
    sonar = sort!(sonarRaw, [:61])
    
    k = size(sonar,1)
    m = findfirst(isequal("R"), sonar[!,:61])
    
    if (n > (k-m) || n > (m-1))
        print("There is insufficient data for a balanced sample of n ")
        return nothing
    end
    
    RowsMine = sample(1:(m-1), n, replace=false, ordered=true) # Muestra de indices aleatorios donde la linea tiene clasificación M
    sonarMine = sonar[RowsMine, :] # Dadaframe con las filas resultantes de RowsMine
    
    RowsRock = sample(m:k, n, replace=false, ordered=true) # Muestra de indices aleatorios donde la linea tiene clasificación R
    sonarRock = sonar[RowsRock, :] #  Dadaframe con las filas resultantes de RowsMine
    
    testData = sonar[Not(union(RowsMine, RowsRock)),:] # Dataframe con las filas que no estan ni en RowsMine ni RowsRock
    
    trainingData = vcat(sonarMine, sonarRock); # Union de sonarMine con sonarRock (Conjunto de entrenamiento)
    
    return trainingData, testData
    
end

out = trainingSelection(60) # Muestra de datos 
training = out[1] # Entrenamiento
testing = out[2] # Prueba

training."f61" .= replace.(training."f61", "M" => "1") # Conversion de la clasificación
training."f61" .= replace.(training."f61", "R" => "0")
training."f61" = parse.(Float64, training."f61")
training[training."f61" .< 0.5,:f61] .= -1.0;

testing."f61" .= replace.(testing."f61", "M" => "1")
testing."f61" .= replace.(testing."f61", "R" => "0")
testing."f61" = parse.(Float64, testing."f61")
testing[testing."f61" .< 0.5,:f61] .= -1.0;

print(testing."f1")

[0.0335, 0.0307, 0.0116, 0.0331, 0.0428, 0.0094, 0.0228, 0.0363, 0.0261, 0.0162, 0.0249, 0.027, 0.0209, 0.0374, 0.0443, 0.0968, 0.079, 0.0731, 0.0164, 0.0707, 0.0526, 0.0721, 0.0654, 0.0207, 0.0209, 0.0131, 0.0117, 0.0258, 0.0217, 0.0163, 0.0221, 0.0411, 0.013, 0.018, 0.0635, 0.0201, 0.034, 0.0209, 0.0368, 0.0315, 0.0056, 0.0392, 0.0129, 0.0366, 0.0116, 0.0131, 0.0272, 0.0187, 0.0323, 0.0522, 0.0303, 0.01, 0.0317, 0.0164, 0.0039, 0.0079, 0.009, 0.0126, 0.0293, 0.0177, 0.0189, 0.0084, 0.0311, 0.0333, 0.0068, 0.0093, 0.0373, 0.0119, 0.0131, 0.0293, 0.0152, 0.0216, 0.0225, 0.013, 0.0067, 0.0216, 0.0208, 0.0139, 0.0202, 0.0239, 0.0336, 0.0409, 0.0188, 0.0856, 0.0126, 0.0253, 0.0025, 0.0291]

Additionally the representation of the two classes by positive cases (mines -> 1) and negative cases (rocks -> -1) is transformed and the data is dumped into two new files to persist the data in case of further testing.

## Dump Data

In [4]:
CSV.write("Data/training.csv", training)
CSV.write("Data/test.csv", testing);

# Step 2: Metrics

## Confussion Matrix

Confusion Matrix is a tabular visualization of the ground-truth labels versus model predictions. Each row of the confusion matrix represents the instances in a predicted class and each column represents the instances in an actual class. Confusion Matrix is not exactly a performance metric but sort of a basis on which other metrics evaluate the results.
Each cell in the confusion matrix represents an evaluation factor:

- True Positive(TP) signifies how many positive class samples your model predicted correctly.
- True Negative(TN) signifies how many negative class samples your model predicted correctly.
- False Positive(FP) signifies how many negative class samples your model predicted incorrectly. This factor represents Type-I error in statistical nomenclature. This error positioning in the confusion matrix depends on the choice of the null hypothesis.
- False Negative(FN) signifies how many positive class samples your model predicted incorrectly. This factor represents Type-II error in statistical nomenclature. This error positioning in the confusion matrix also depends on the choice of the null hypothesis.

## Selected Metric

It is determined that **accuracy** will be the metric used to evaluate the performance of the machines to be considered:

Accuracy is one of the metrics for evaluating classification models in ML. It can be explained as the fraction of predictions that our model got right. Formally, we define it as follows:

$$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$

# Step 3

With the data processed and separated into training and testing, we proceed to study the following algorithms applied to this case

In [5]:
x_train = Array(training)[:, 1:60] # Separación de variables de caracteristica y etiquetas
y_train = Array(training)[:,61]

x_test = Array(testing)[:, 1:60]
y_test = Array(testing)[:, 61];

## Linear regression

In [6]:
@sk_import linear_model: LinearRegression

linReg_model = LinearRegression()

fit!(linReg_model, x_train, y_train)

prediction = predict(linReg_model, x_test)
prediction_class = [if x < 0 -1 else 1 end for x in prediction]
@printf "Accuracy: %.3f%%\n" mean((prediction_class .== y_test))*100

Accuracy: 51.136%


## Logistic regression

In [7]:
@sk_import linear_model: LogisticRegression

logReg_model = LogisticRegression()

fit!(logReg_model, x_train, y_train)

prediction = predict(logReg_model, x_test)
prediction_class = [if x < 0 -1 else 1 end for x in prediction]

@printf "Accuracy: %.3f%%\n" mean((prediction_class .== y_test))*100

Accuracy: 57.955%


In [8]:
m1 = size(y_train,1) # Separación de datos de entrenamiento con el fin de hallar hiperparametros en svm y knn
function t1_t2_selector(m,t)
    sam = sample(1:m, t, replace=false, ordered=true)
    return sam
end

r = t1_t2_selector(m1,30) # Muestra de 30 datos para hallar hiperparametros

t1 = training[r,:]
t2 = training[Not(r),:]
    
x_htrain = Array(t1)[:, 1:60]
y_htrain = Array(t1)[:,61]

x_train = Array(training)[:, 1:60]
y_train = Array(training)[:,61];

## SVM

In [15]:
@sk_import svm: SVC

aciertos_c = []
for i in 0.1:0.1:100 # Busqueda del hiperparametro entre 0.1 y 100
    svm = SVC( C = i)
    fit!(svm, x_htrain, y_htrain)

    svm_prediccion = predict(svm, x_test)

    svm_point = mean((svm_prediccion .== y_test))*100

    push!(aciertos_c, svm_point)
end

Cop = 0.1*(1 + argmax(aciertos_c)) # Valor óptimo del hiperparametro en la metrica escogida
println("El valor de C es: $Cop")


svm = SVC( C = Cop) 
fit!(svm, x_train, y_train) 
svm_prediction = predict(svm, x_test)

@printf "Accuracy: %.3f%%\n" mean((svm_prediction .== y_test))*100

El valor de C es: 7.800000000000001
Accuracy: 90.909%




## KNN

In [10]:
@sk_import neighbors: KNeighborsClassifier

aciertos_c = []
for i in 1:10 # Busqueda del hiperparametro entre 1 y 10
    knn = KNeighborsClassifier(n_neighbors = i)
    fit!(knn,x_htrain, y_htrain)
    knn_prediccion = predict(knn, x_test)

    knn_point = mean((knn_prediccion .== y_test))*100
    push!(aciertos_c, knn_point)
end

knn = KNeighborsClassifier(n_neighbors = argmax(aciertos_c))
fit!(knn, x_train, y_train)
knn_prediction = predict(knn, x_test)

@printf "Accuracy: %.3f%%\n" mean((knn_prediction .== y_test))*100

Accuracy: 75.000%


## Decision Tree

In [11]:
@sk_import tree: DecisionTreeClassifier

arb_modelo = DecisionTreeClassifier()

fit!(arb_modelo, x_train, y_train) 

arb_prediccion = predict(arb_modelo, x_test);

@printf "Accuracy: %.3f%%\n" mean((arb_prediccion .== testing."f61"))*100

Accuracy: 76.136%


## Best performance

Each implementation of the algorithms considers the calculation of performance metrics. We see then that in this case the SVM algorithm is the one that best classified the training data (with an accuracy of 90.9%)