# The ScikitLearn.jl library

The Scikit-learn library is an open source machine learning library developed for the Python programming language, the first version of which dates back to 2010. It implements a large number of machine learning models, related to tasks such as classification, regression, clustering or dimensionality reduction. These models include Support Vector Machines (SVM), decision trees, random forests, or k-means. It is currently one of the most widely used libraries in the field of machine learning, due to the large number of functionalities it offers as well as its ease of use, since it provides a uniform interface for training and using models. The documentation for this library is available at https://scikit-learn.org/stable/.

For Julia, the ScikitLearn.jl library implements this interface and the algorithms contained in the scikit-learn library, supporting both Julia's own models and those of the scikit-learn library. The latter is done by means of the PyCall.jl library, which allows code written in Python to be executed from Julia in a transparent way for the user, who only needs to have ScikitLearn.jl installed. Documentation for this library can be found at https://scikitlearnjl.readthedocs.io/en/latest/.

As mentioned above, this library provides a uniform interface for training different models. This is reflected in the fact that the names of the functions for creating and training models will be the same regardless of the models to be developed. In the assignments of this course, in addition to ANNs, the following models available in the scikit-learn library will be used:

- Support Vector Machines (SVM)
- Decision trees
- kNN

In order to use these models, it is first necessary to import the library (using ScikitLearn, which must be previously installed with

```Julia
import Pkg;
Pkg.add("ScikitLearn"))
```

The scikit-learn library offers more than 100 types of  different models. To import the models to be used, you can use @sk_import. In this way, the following lines import respectively the first 3 models mentioned above that will be used in the practices of this subject:

```Julia
@sk_import svm: SVC
@sk_import tree: DecisionTreeClassifier
@sk_import neighbours: KNeighborsClassifier
```

When training a model, the first step is to generate it. This is done with a different function for each model. This function receives as parameters the model's own parameters. Below are 3 examples, one for each type of model that will be used in these course assignments:

```Julia
model = SVC(kernel="rbf", degree=3, gamma=2, C=1);
model = DecisionTreeClassifier(max_depth=4, random_state=1);
model = KNeighborsClassifier(3);
```

An explanation of the parameters accepted by each of these functions can be found in the library documentation. In the particular case of decision trees, as can be seen, one of these parameters is called `random_state`. This parameter controls the randomness in a particular part of the tree construction process, namely in the selection of features to split a node of the tree. The Scikit-Learn library uses a random number generator in this part, which is updated with each call, so that different calls to this function (together with its subsequent calls to the `fit!` function) to train the model will result in different models. To control the randomness of this process and make it deterministic, it is best to give it an integer value as shown in the example. Thus, the creation of a decision tree with a set of desired inputs and outputs and a given set of hyperparameters is a deterministic process. In general, it is more advisable to be able to control the randomness of the whole model development process (cross-validation, etc.) by means of a random seed that is set at the beginning of the whole process.

Once created, any of these models can be adjusted with the `fit!` function.

### Question

What does the fact that the name of this function ends in bang (!) indicate?

`By default, Julia passes the function parameters by value, not allowing them to be modified inside the function. Whenever we are calling a function with a ! we are indicating that the parameters passed are passed by reference, thus they can be modified inside the function. As we are calling fit! with a bang, we are assuming that the parameters we are passing to the fit function can be modified (i.e. the model after the fit! execution will be a trained model).`

Contrary to the Flux library, where it was necessary to write the ANN training loop, in this library the loop is already implemented, and it is called automatically when the `fit!` function is executed. Therefore, it is not necessary to write the code for the training loop.

### Question

As in the case of ANNs, a loop is necessary for training several models. Where in the code (inside or outside the loop) will you need to create the model? Which models will need to be trained several times and which ones only once? Why?

`Only artificial neural networks models will need to be trained several times, as they depend on random seeds to perform the train. Support vector machines, decision trees or k-nn models are all mathematical deterministic models that no matter how many times they are trained they will always output the same result. If we desire to train our models several times the model must be created inside the loop.`

An example of the use of this function can be seen in the following line:

```Julia
fit!(model, trainingInputs, trainingTargets);
```

As can be seen, the first argument of this function is the model, the second is an array of inputs, and the third is a vector of desired outputs. It is important to realise that this parameter with the desired outputs is not an array like in the case of ANNs but a vector whose each element will correspond to the label associated to that pattern, and can be of any type: integer, string, etc. The main reason for this is that there are some models that do not accept desired outputs with the one-hot-encoding.

An important issue to consider is the layout of the data to be used. As has been shown in previous assignments, the patterns must be arranged in columns to train an ANN, being each row an attribute. Outside the world of ANNs, and therefore with the rest of the techniques to be used in this course, the patterns are usually assumed to be arranged in rows, and therefore each column in the input matrix corresponds to an attribute, being a much more intuitive way.

### Question

Which condition must the matrix of inputs and the vector of desired outputs passed as an argument to this function fulfil?

`As stated earlier, when training ANNs the input matrix will have a pattern per column and each row will be a feature of such pattern while for KNN, SVM and DT models the input matrix will have a pattern per row where each column will be a feature of such pattern. The output matrix will be a vector where the element i correspond to the expected output for the input pattern i. In the case of ANNs the vector must have numeric elements, while for KNN, SVM and DT can be any type. Suffice to say that both the input and output matrix will have to have the same number of patterns.`

Finally, once the model has been trained, it can be used to make predictions. This is done by means of the predict function. An example of its use is shown below:

```Julia
testOutputs = predict(model, testInputs);
```

The model being used is an in-memory structure with different fields, and it can be very useful to look up the contents of these fields. To see which fields each model has, you can write the following:

```Julia
println(keys(model));
```

Depending on the type of model, there will be different fields. For example, for a kNN, the following fields, among others, could be consulted:

```Julia
model.n_neighbors
model.metric
model.weights
```

For an SVM, some other interesting fields could be the following:

```Julia
model.C
model.support_vectors_
model.support_
model.support_
```

In the case of an SVM, a particularly interesting function is `decision_function`, which returns the distances to the hyperplane of the passed patterns. This is useful, for example, to implement a "one-against-all" strategy to perform multi-class classification. An example of the use of this function is shown below:

```Julia
distances = decision_function(model, inputs);
```

### Question

In the case of using decision trees or kNN, a corresponding function is not necessary to perform the "one-against-all" strategy, why?

`The decision function returns an array where each member i indicates if the new input is to the 'left' or 'right' to the i hyperplane. As k-nearest neighbour algorithm does not need for hyperplanes, we wont need this function. `

However, the SVM implementation in the Scikit-Learn library already allows multi-class classification, so it is not necessary to use a "one-against-all" strategy for these cases.

Finally, it should be noted that these models usually receive pre-processed inputs and outputs, with the most common pre-processing being the normalisation already described in a previous assignment. Therefore, the developed normalisation functions should also be used on the data to be used by these models.

In this assignment, you are asked to develop a function called ```modelCrossValidation``` based on the functions developed in previous assignments that allows to validate models in the selected classification problem using the three techniques described here.

This function should perform cross-validation and use the metrics deemed most appropriate for the specific problem. This cross-validation can be done by modifying the code developed in the previous assignment.

This function must receive the following parameters:

- Algorithm to be trained, among the 4 used in this course, together with its parameter. The most important parameters to specify for each technique are:
    </br>
    
    - ANN
        - Architecture (number of hidden layers and number of neurons in each hidden layer) and transfer funtion in each layer. In "shallow" networks such as those used in this course, the transfer function has less impact, so a standard one, shuch as `tansig` or `logsig`, can be used.
        - Learning rate
        - Ratio of patterns used for validation
        - Number of consecutive iterations without improving the validation loss to stop the process
        - Number of times each ANN is trained.
        
        ### Question
        
        Why should a linear transfer function not be used for neurons in the hidden layers?
        
        ```
        In order for our model to be correctly trained artificial neural networks employ whats called the backpropagation algorithm. This algorithm adjusts the model weights in order to attempt to reduce the loss function. If we employ linear transfer functions in our model, then the backpropagation algorithm will not be able to know what connection weights adjust.```
        
        ### Question
        
        The other models do not have the number of times to train them as a parameter. Why? If you train several times, Which statistical properties will the results of these trainings have?
        
        ```Because the reason of training an ANN several times is the non-deterministic nature of ANNs. SVMs, KNNs or DTs are determinsitic, and, therefore, do not need to average the results obtained from repeated trainings.```
        
        
    </br>  
    
    - SVM
        - Kernel (and kernel-specific parameters)
        - C
        
    - Decision trees
        - Maximum tree depth
        
    - kNN
        - k (number of neighbours to be considered)
        
        
   </br> 
- Already standardised input and desired outputs matrices.
    </br>  
    - As stated above, the desired outputs must be indicated as a vector where each element is the label corresponding to each pattern (therefore, of type `Array{Any,1}`). In the case of ANN training, the desired outputs shall be encoded as done in previous assignments.
    </br>  
    - As previously described, in the case of using techniques such as SVM, decision trees or kNN, the one-hot-encoding configuration will not be used. In these cases, the `confusionMatrix` function developed in a previous assignment will be used to calculate the metrics, which accepts as input two vectors (outputs and desired outputs) of type `Array{Any,1}`.
    
    ### Question
    
    Has it been necessary to standardise the desired outputs? Why?
    
    ```We do not need to standarise the outputs because the classifiers we are employing are insensitive to the scale of the outputs.```

    </br> 
- Cross-validation indices. It is important to note that, as in the previous assignment, the partitioning of the patterns in each fold need to be done outside this function, because this allows this same partitioning to be used then training other models. In this way, cross-validation is performed with the same data and the same partitions in all classes.

Since most of the code will be the same, do not develop 4 different functions, one for each model, but only one function. Inside it, at the time of generation the model in each fold, and depending on the model, the following changes should be made:

- If the model is an ANN, the desired outputs shall be encoded by means of the code developed in previous assignments. As this model is non-deterministic, it will be nevessary to make a new loop to train several ANNs, splitting the training data into training and validation (if validation set is used) and calling the function defined in previous assignments to create and traing an ANN.

- If the model is not an ANN, the code that trains the model shall be developed. This code shall be the same for each of the rematining 3 types of models (SVM, decision trees, and KNN), with the line where the model is called being the only difference.

In turn, this function should return, at least, the values for the selected metrics. Once this function has been developed, the experimental part of the assignment begins. The objective is to determine which model with a specific combination of hyperparameters offers the best results, for which the above function will be run for each of the 4 types of models, and for each model it will be run with different values in its hyperparameters.

- The results obtained should be documented in the report to be produced, for which it will be useful to show the results in tabular and/or graphical form.

- When it comes to displaying a confusion matrix in the report, an important question is which one to show given that a lot of trainings have been performed. The cross-validation technique does not generate a final model, but allows comparing different algorithms and configurations to choose the model or parameter configuration that returns the best results. Once chosen, it is necessary to train a "final" model from scratch by using all the patterns as the training set, that is, without separating patterns for testing. In this way, the performance of this model and configuration is expected to be slightly higher than that obtained through cross-validation, since more patterns have been used to train it. This is the final model that would be used in production, and from which a confusion matrix can be obtained.

In [3]:
using ScikitLearn

@sk_import svm: SVC
@sk_import tree: DecisionTreeClassifier
@sk_import neighbors: KNeighborsClassifier


┌ Info: Running `conda install -y -c conda-forge 'libstdcxx-ng>=3.4,<11.4'` in root environment
└ @ Conda /home/poli/.julia/packages/Conda/x2UxR/src/Conda.jl:127


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Retrieving notices: ...working... done


┌ Info: Running `conda install -y -c conda-forge 'libstdcxx-ng>=3.4,<11.4'` in root environment
└ @ Conda /home/poli/.julia/packages/Conda/x2UxR/src/Conda.jl:127


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Retrieving notices: ...working... done


┌ Info: Running `conda install -y -c conda-forge 'libstdcxx-ng>=3.4,<11.4'` in root environment
└ @ Conda /home/poli/.julia/packages/Conda/x2UxR/src/Conda.jl:127


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Retrieving notices: ...working... done


PyObject <class 'sklearn.neighbors._classification.KNeighborsClassifier'>

In [5]:
# Test of the symbol type
# as we can see if we employ eval(symbol) 
# we get the reference

println(typeof(:SVC))
println(typeof(eval(:SVC)))

Symbol
PyCall.PyObject


In [82]:
using Flux;
using Flux.Losses;
using Dates;
using Statistics;
using Random;

#############################################################
############# FUNCTIONS FROM PREVIOUS NOTEBOOKS #############
#############  (crossvalidation functions were  #############
#############    remade to work as expected)    #############
#############################################################

####### ANN RELATED
####### FUNCTIONS

function holdOut(N::Int, P::Real)    
    # generate random index vector
    index_vector=Random.randperm(MersenneTwister(Dates.datetime2epochms(Dates.now())), N)
    cut_point = floor(Int,N*P)
    cut_set = index_vector[1:cut_point]
    train_set = index_vector[cut_point:length(index_vector)]
    return train_set, cut_set
end

function buildClassANN(numInputs::Int, topology::AbstractArray{<:Int,1}, numOutputs::Int;
                    transferFunctions::AbstractArray{<:Function,1}=fill(σ, length(topology))) 
    ann = Chain()
    numInputsLayer = numInputs
    for numOutputLayers = topology
        ann = Chain(ann..., Dense(numInputsLayer, numOutputLayers, σ))
        numInputsLayer = numOutputLayers
    end
    if (numOutputs == 1)
        ann = Chain(ann..., Dense(numInputsLayer, 1, σ))
    else
        ann = Chain(ann..., Dense(numInputsLayer, numOutputs, identity))
        ann = Chain(ann..., softmax)
    end
    return ann
end

# Function to train classification artificial neural networks
function trainClassANN(
        topology::AbstractArray{<:Int,1},  
        trainingDataset::Tuple{AbstractArray{<:Real,2}, AbstractArray{Bool,2}}; 
        validationDataset::Tuple{AbstractArray{<:Real,2}, AbstractArray{Bool,2}}= 
                    (Array{eltype(trainingDataset[1]),2}(undef,0,0), falses(0,0)), 
        testDataset::Tuple{AbstractArray{<:Real,2}, AbstractArray{Bool,2}}= 
                    (Array{eltype(trainingDataset[1]),2}(undef,0,0), falses(0,0)), 
        transferFunctions::AbstractArray{<:Function,1}=fill(σ, length(topology)), 
        maxEpochs::Int=1000, 
        minLoss::Real=0.0, 
        learningRate::Real=0.01,  
        maxEpochsVal::Int=20, 
        showText::Bool=false)
    
    # Create ANN and loss function for classification problems
    training_inputs, training_outputs = trainingDataset
    validation_inputs, validation_outputs = validationDataset
    test_inputs, test_outputs = testDataset
    
    input_feats_size, output_classes_size = size(training_inputs,2), size(training_outputs,2)
    
    ann = buildClassANN(input_feats_size, topology, output_classes_size)
    loss(x, y) = (size(y,1) == 1) ? Losses.binarycrossentropy(ann(x),y) : Losses.crossentropy(ann(x),y)
    
    # Compute base array
    training_losses = Float64[]
    validation_losses = Float64[]
    test_losses = Float64[]
    
    training_accuracies = Float64[]
    validation_accuracies = Float64[]
    test_accuracies = Float64[]

    # Metrics computation inner function
    
    current_epoch = 0  
    current_epoch_val = 0

    function compute_metrics()
        training_loss = loss(training_inputs', training_outputs')
        validation_loss = loss(validation_inputs', validation_outputs')
        test_loss = loss(test_inputs', test_outputs')
        
        training_ann_outputs = ann(training_inputs')
        validation_ann_outputs = ann(validation_inputs')
        test_ann_outputs = ann(test_inputs')
        
        training_accuracy = accuracy(training_ann_outputs', training_outputs)
        validation_accuracy = accuracy(validation_ann_outputs', validation_outputs)
        test_accuracy = accuracy(test_ann_outputs', test_outputs)

        if showText
            println("Epoch ", current_epoch)
            println("Training loss: ", training_loss, ", Training accuracy: ", 100*training_accuracy," %")
            if length(validation_inputs) > 1
                println("Validation loss: ", validation_loss, ", Validation accuracy: ", 100*validation_accuracy, " %") 
            end
            if length(test_inputs) > 1
                println("Test loss: ", test_loss, ", Test accuracy: ", 100*test_accuracy," %")
            end
        end
        return (training_loss, training_accuracy, validation_loss, 
            validation_accuracy, test_loss, test_accuracy)
    end
    
    
    # Compute initial metrics
    (training_loss, training_accuracy, validation_loss, 
        validation_accuracy, test_loss, test_accuracy) = compute_metrics()
    
    push!(training_losses, training_loss)
    push!(validation_losses, validation_loss)
    push!(test_losses, test_loss)
    
    push!(training_accuracies, training_accuracy)
    push!(validation_accuracies, validation_accuracy)
    push!(test_accuracies, test_accuracy)
    
        
    # Store initial ANN as the 'best'
    best_validation_loss = validation_loss;
    final_ann = deepcopy(ann);
    
    # Training loop
    while (current_epoch < maxEpochs) && (training_loss > minLoss) && (current_epoch_val < maxEpochsVal)
        current_epoch += 1
        
        Flux.train!(loss, Flux.params(ann), [(training_inputs', training_outputs')], ADAM(learningRate))
        (training_loss, training_accuracy, validation_loss, 
            validation_accuracy, test_loss, test_accuracy) = compute_metrics();
        
        push!(training_losses, training_loss)
        push!(validation_losses, validation_loss)
        push!(test_losses, test_loss)
        
        # Check for early stop (only if we have a validation dataset)
        if length(validation_inputs) > 1
            if (validation_loss < best_validation_loss)
                # reset the number of validation epochs, because we have an improved metric
                # and store current ann as best
                if showText
                    println("[->] Found new best model: old_val_loss=",validation_loss,", new_val_loss=",best_validation_loss)
                end
                current_epoch_val = 0;
                best_validation_loss = validation_loss;
                final_ann = deepcopy(ann);
            else
                current_epoch_val += 1;
            end
        end
        
    end
    
    return (final_ann, training_losses, validation_losses, test_losses, training_accuracies, validation_accuracies, test_accuracies)
end



function trainClassANN(
        topology::AbstractArray{<:Int,1}, 
        trainingDataset::Tuple{AbstractArray{<:Real,2}, AbstractArray{Bool,2}}, 
        kFoldIndices::	Array{Int64,1}; 
        transferFunctions::AbstractArray{<:Function,1}=fill(σ, length(topology)), 
        maxEpochs::Int=1000, 
        minLoss::Real=0.0, 
        learningRate::Real=0.01, 
        repetitionsTraining::Int=1, 
        validationRatio::Real=0.0, 
        maxEpochsVal::Int=20)

    
    # Transform the kFoldIndexes into a set to remove duplicate and
    # compute the number of folds as the size of that set
    numFolds = size(unique(kFoldIndices))[1]
    
    # create our metrics vectors
    train_losses_array = []
    train_accuracy_array = []
    
    test_losses_array  = []
    test_accuracy_array = []
    
    validation_losses_array = []
    validation_accuracy_array = []
        
    for numFold = 1:numFolds
        # Extract the train and test dataset using the k-folds
        train_in  = trainingDataset[1][kFoldIndices.!=numFold,:]
        train_out = trainingDataset[2][kFoldIndices.!=numFold,:]
        
        test_in   = trainingDataset[1][kFoldIndices.==numFold,:]
        test_out  = trainingDataset[2][kFoldIndices.==numFold,:]
        
        # If we are using a validation dataset, perform a holdout
        train_idx, val_idx = holdOut(size(train_in,1), validationRatio)
            
        train_dataset      = train_in[train_idx,:], train_out[train_idx,:]
        validation_dataset = train_in[val_idx,:], train_out[val_idx,:]
        test_dataset       = test_in, test_out

        for i=1:repetitionsTraining
            (final_ann, train_losses, validation_losses, test_losses, training_accuracies, 
                validation_accuracies, test_accuracies) = trainClassANN(topology, 
                                                                        train_dataset, 
                                                                        validationDataset=validation_dataset, 
                                                                        testDataset=test_dataset, 
                                                                        transferFunctions=transferFunctions, 
                                                                        maxEpochs=maxEpochs, 
                                                                        minLoss=minLoss, 
                                                                        learningRate=learningRate, 
                                                                        maxEpochsVal=maxEpochsVal,
                                                                        showText=true)

            # Append the data to our arrays
            append!(train_losses_array, train_losses)
            append!(train_accuracy_array, training_accuracies)

            append!(test_losses_array, test_losses)
            append!(test_accuracy_array, test_accuracies)

            append!(validation_losses_array, validation_losses)
            append!(validation_accuracy_array, validation_accuracies)
        end
    end
                
    return (train_losses_array, train_accuracy_array, test_losses_array, test_accuracy_array,
        validation_losses_array, validation_accuracy_array)
end


####### CONFUSION MATRIX 
####### RELATED FUNCTIONS



function accuracy(outputs::AbstractArray{Bool,2}, targets::AbstractArray{Bool,2}) 

    if (size(targets,2)==1)
        return accuracy(outputs[:,1], targets[:,1])
    else
        classComparison = targets .== outputs
        correctClassifications = all(classComparison, dims=2)
        return mean(correctClassifications)
    end
end

function accuracy(outputs::AbstractArray{<:Real,2}, targets::AbstractArray{Bool,2}, threshold::Real=0.5)
    if (size(targets,2)==1)
        return accuracy(outputs[:,1], targets[:,1])
    else
        classified_outputs=classifyOutputs(outputs)
        return accuracy(classified_outputs, targets)
    end
end

function classifyOutputs(outputs::AbstractArray{<:Real,2}; 
                        threshold::Real=0.5)
    if size(outputs, 2) == 1
        output = dataset .>= threshold
    else
        (_,indicesMaxEachInstance) = findmax(outputs, dims=2);
        bool_outputs = falses(size(outputs));
        bool_outputs[indicesMaxEachInstance] .= true
    end
    return bool_outputs
end

function confusionMatrix(outputs::AbstractArray{Bool,1}, targets::AbstractArray{Bool,1})
    
    tp = sum(outputs .& targets)      # select all true outputs that are true on target
    fp = sum(outputs .& .!targets)    # select all true outputs that are false on target
    tn = sum(.!outputs .& .!targets)  # select all false outputs that are false on target
    fn = sum(.!outputs .& targets)    # select all false outputs that are true on target
    
    conf_matrix = [tn fp; fn tp]
    
    accu= (tn+tp)/(tn+tp+fn+fp)
    erra= (fp+fn)/(tn+tp+fn+fp)
    reca= (tn==length(targets)) ? (tp/(fn+tp)) : 1
    spec= (tp==length(targets)) ? (tn/(fp+tn)) : 1
    prec= (tn==length(targets)) ? (tp/(tp+fp)) : 1
    npre= (tp==length(targets)) ? (tn/(tn+fn)) : 1
    
    f1 = (reca==prec==0) ? 2*(prec*reca/prec+reca) : 0
    
    return accu,erra,reca,spec,prec,npre,f1,conf_matrix
end

function confusionMatrix(outputs::AbstractArray{<:Real,1},targets::AbstractArray{Bool,1}; threshold::Real=0.5)
    outputs_boolean = outputs .> threshold
    return confusionMatrix(outputs_boolean, targets)
end


#########  ONE HOT ENCODING
#########  RELATED FUNCTIONS
function oneHotEncoding(feature::AbstractArray{<:Any,1}, classes::AbstractArray{<:Any,1})
    numClasses = length(unique(classes))

    if (numClasses == 2)
        oneHot = Array{Bool,2}(undef, size(feature,1), 1)
        oneHot[:,1] .= (feature.==classes[1])
    else
        oneHot = Array{Bool,2}(undef, size(feature,1), numClasses)
        for numClass = 1:numClasses
            oneHot[:,numClass] .= (feature.==classes[numClass])
        end
    end
    return oneHot
end
function oneHotEncoding(feature::AbstractArray{<:Any,1})
    return oneHotEncoding(feature, unique(feature))
end

#########  CROSSVALIDATION
#########  FUNCTIONS


## default k-fold
function crossvalidation(N::Int64, k::Int64)
    indices = repeat(1:k, Int64(ceil(N/k)))
    indices = indices[1:N]
    shuffle!(indices)
    return indices
end;

## k-fold with balanced k-sets
function crossvalidation(targets::AbstractArray{Bool,2}, k::Int64)
    # compute the nubmer of elements in our targets dataset
    n_rows = size(targets,1)
    
    indexes = zeros(Int64, size(targets,1))
    for class in eachcol(targets)
        n_elements = sum(class)
        current_class_indexes = crossvalidation(n_elements, k)
        
        i = 1
        j = 1
        for element in class
            if element == true 
                indexes[i] = current_class_indexes[j]
                j+=1
            end
            i+=1
        end
    end
    return indexes;
end

## k-fold with balanced k-sets and one-hot wrapper
function crossvalidation(targets::AbstractArray{<:Any,1}, k::Int64)
    targets = oneHotEncoding(targets, unique(targets));
    crossValidationIndices = crossvalidation(size(targets,1), k);
    
    return crossValidationIndices;
end;

#########  NORMALISATION
#########  FUNCTIONS

function stats(outputs)
    minimum = mapslices(Statistics.minimum, outputs; dims=1)[1]
    maximum = mapslices(Statistics.maximum, outputs; dims=1)[1]
    mean = mapslices(Statistics.mean, outputs; dims=1)[1]
    std = mapslices(Statistics.std, outputs; dims=1)[1]
    return [minimum, maximum, mean, std]
end

function calculateMinMaxNormalizationParameters(dataset::AbstractArray{<:Real,2})
    # function that takes a real matrix (i.e. array of reals with dimension 2)
    # this matrix is the data-set to our problem, where each row is a sample and each column is an attribute
    # return a 2-tuple of matrixes where each row is the minimum and maximum respectivelly
    
    min_matrix = []
    max_matrix = []
    
    for column in eachcol(dataset)
        r = stats(column)
        if min_matrix == [] || max_matrix == []
            min_matrix = r[1]
            max_matrix = r[2]
        else
            min_matrix = vcat(min_matrix, r[1])
            max_matrix = hcat(max_matrix, r[2])
        end
    end
    return reshape(min_matrix, (4,1)), reshape(max_matrix, (4,1))
end

function normalizeMinMax( dataset::AbstractArray{<:Real,2})
    # x scaled = x - min(x) / max(x) - min(x)
    min, max = calculateMinMaxNormalizationParameters(dataset)
    out = zeros(size(dataset, 1), size(dataset, 2))
    for i in axes(dataset, 1)
        for j in axes(dataset, 2)
            cmin, cmax = min[j], max[j]
            out[i,j] = dataset[i,j] - cmin / (cmax - cmin)
        end
    end
    
    return out
end
function normalizeZeroMean( dataset::AbstractArray{<:Real,2}) 
    mean, std = calculateZeroMeanNormalizationParameters(dataset)
    out = zeros(size(dataset, 1), size(dataset, 2))
    for i in axes(dataset, 1)
        for j in axes(dataset, 2)
            cmean, cstd = mean[j], std[j]
            out[i,j] = dataset[i,j] - cmean / cstd
        end
    end
    return out
end

function encode_categories(targets)
    if (length(unique(targets)) > 2)
        cats = unique(targets) .== permutedims(targets)
        return cats'
    else
        cats = targets .== unique(targets)[1]
        return cats
    end
end


encode_categories (generic function with 1 method)

In [207]:
function modelCrossValidation(modelType::Symbol,
        modelHyperparameters::Dict,
        inputs::AbstractArray{<:Real,2},
        targets::AbstractArray{<:Any,1},
        crossValidationIndices::Array{Int64,1})
    
    # Extract data from inputs and targets
    n_inputs  = size(inputs, 1)
    n_feats   = size(inputs, 2)
    n_classes = size(targets,2)
    
    
    # compute crossvalidation data
    
    # Transform the kFoldIndexes into a set to remove duplicate and
    # compute the number of folds as the size of that set
    numFolds = size(unique(crossValidationIndices),1)
    
    # create our metrics vectors
    # as we will not do a validation hold-out we will not create validation_xxx_arrays
    train_losses_array = []
    test_losses_array  = []
    train_accuracies_array = []
    test_accuracies_array = []
    
    # Build the model
    
    if modelType == :ANN
        println("Artificial Neural Network")
        
        targets = oneHotEncoding(targets)
        
        # get hyperparameters
        architecture = modelHyperparameters["architecture"]
        lr           = modelHyperparameters["lr"]
        val_ratio    = modelHyperparameters["val_ratio"]
        epochs       = modelHyperparameters["epochs"]
        early_stop   = modelHyperparameters["early_stop_epochs"]
        n_train      = modelHyperparameters["n_train"]
        # as we have already developed a crossvalidation function for
        # flux-built anns we will just return it here
        outputs = 
        (train_losses_array, train_accuracy_array, test_losses_array, test_accuracy_array,
        validation_losses_array, validation_accuracy_array) = trainClassANN(
            architecture, 
            (inputs, targets),
            crossValidationIndices, 
            validationRatio=val_ratio, 
            learningRate=lr, 
            maxEpochsVal=early_stop, 
            repetitionsTraining=n_train, 
            maxEpochs=epochs)

        return (train_losses_array, train_accuracy_array, test_losses_array, test_accuracy_array,
        validation_losses_array, validation_accuracy_array)
        
        
    elseif modelType == :SVC
        println("Support Vector Machine")
        
        kernel = modelHyperparameters["kernel"]
        pol_degree = modelHyperparameters["degree"]
        gamma = modelHyperparameters["gamma"]
        c_val = modelHyperparameters["c"]
        
        model = SVC(kernel=kernel, degree=pol_degree, gamma=gamma, C=c_val);

    elseif modelType == :DecisionTreeClassifier
        println("Decision Tree")
        
        max_depth = modelHyperparameters["max_depth"]
        
        model = DecisionTreeClassifier(max_depth=max_depth)
            
    elseif modelType == :KNeighborsClassifier
        print("K-Nearest Neighbors")
            
        n_neighbours = modelHyperparameters["n_neighbours"]
        
        model = KNeighborsClassifier(n_neighbours);
    else
        println("Unknown model type")
    end
    
    # now the model is built, perform training
    # this is performed the same way as with crossvalidation ANN
    
    test_accuracies = []
    classes = unique(targets)
    for numFold = 1:numFolds
        # Extract the train and test dataset using the k-folds
        train_input  = inputs[crossValidationIndices.!=numFold,:]
        train_target = targets[crossValidationIndices.!=numFold,:][:,1]
        
        test_input  = inputs[crossValidationIndices.==numFold,:]
        test_target  = targets[crossValidationIndices.==numFold,:][:,1]
        
        fit!(model, train_input, vec(train_target))
        test_outputs=predict(model, test_input)

        
        onehot_outputs = oneHotEncoding(test_outputs, classes)
        onehot_targets = oneHotEncoding(test_target, classes)
        append!(test_accuracies,accuracy(onehot_outputs, onehot_targets))
    end
    return test_accuracies
end

modelCrossValidation (generic function with 1 method)

In [9]:
using DelimitedFiles 


#### Read the data
dataset = readdlm("iris.data",',');

inputs = dataset[:,1:4];
inputs = convert(Array{Float32,2}, inputs); 
norm_input = normalizeMinMax(inputs)
targets = dataset[:,5];

#### Create the k-folds
k = 10
cross_val_indexes = crossvalidation(targets, k)

:Done

:Done

In [209]:
symbol = :KNeighborsClassifier

hyperparameters                    = Dict()

hyperparameters["n_neighbours"]    = 2

modelCrossValidation(symbol, hyperparameters, inputs, targets, cross_val_indexes)

K-Nearest Neighbors

10-element Vector{Any}:
 1.0
 0.8666666666666667
 1.0
 0.8666666666666667
 1.0
 0.9333333333333333
 0.9333333333333333
 0.9333333333333333
 0.9333333333333333
 1.0

In [210]:
symbol = :SVC

hyperparameters              = Dict()

hyperparameters["kernel"]    = "poly"
hyperparameters["degree"]    = 3
hyperparameters["gamma"]     = "auto"
hyperparameters["c"]         = 10

modelCrossValidation(symbol, hyperparameters, inputs, targets, cross_val_indexes)

Support Vector Machine


10-element Vector{Any}:
 1.0
 0.8
 1.0
 0.8
 1.0
 0.8666666666666667
 0.9333333333333333
 1.0
 1.0
 1.0

In [123]:
a = [1 2 3; 
    4 5 6]
b = [true; false]
s = (a,b)

print(a[b,:]);
s[1][b,:], s[2][b,:] 

[1 2 3]

([1 2 3], Bool[1;;])

In [192]:
## First train: ANNs

symbol = :ANN

hyperparameters = Dict()
hyperparameters["architecture"]=[2, 5]
hyperparameters["lr"]=0.001
hyperparameters["val_ratio"]=0.1
hyperparameters["epochs"]=10                
hyperparameters["early_stop_epochs"]=5
hyperparameters["n_train"]=1

modelCrossValidation(symbol, hyperparameters, inputs, targets, cross_val_indexes)

Artificial Neural NetworkEpoch 0
Training loss: 1.1058108, Training accuracy: 34.146341463414636 %
Validation loss: 1.1474886, Validation accuracy: 23.076923076923077 %
Test loss: 1.0411938, Test accuracy: 40.0 %
Epoch 1
Training loss: 1.104888, Training accuracy: 34.146341463414636 %
Validation loss: 1.1462607, Validation accuracy: 23.076923076923077 %
Test loss: 1.0421405, Test accuracy: 40.0 %
[->] Found new best model: old_val_loss=1.1462607, new_val_loss=1.1474886
Epoch 2
Training loss: 1.1039745, Training accuracy: 34.146341463414636 %
Validation loss: 1.1450423, Validation accuracy: 23.076923076923077 %
Test loss: 1.0430951, Test accuracy: 40.0 %
[->] Found new best model: old_val_loss=1.1450423, new_val_loss=1.1462607
Epoch 3
Training loss: 1.1030706, Training accuracy: 34.146341463414636 %
Validation loss: 1.143833, Validation accuracy: 23.076923076923077 %
Test loss: 1.0440577, Test accuracy: 40.0 %
[->] Found new best model: old_val_loss=1.143833, new_val_loss=1.1450423
Epoc

(Any[1.1058107614517212, 1.1048879623413086, 1.103974461555481, 1.103070616722107, 1.1021769046783447, 1.1012932062149048, 1.1004186868667603, 1.0995550155639648, 1.0987004041671753, 1.0978556871414185  …  1.106742024421692, 1.1060404777526855, 1.1053588390350342, 1.1046972274780273, 1.1040548086166382, 1.103432297706604, 1.1028281450271606, 1.1022428274154663, 1.1016758680343628, 1.101127028465271], Any[0.34146341463414637, 0.3333333333333333, 0.3252032520325203, 0.34146341463414637, 0.3170731707317073, 0.3008130081300813, 0.3089430894308943, 0.34959349593495936, 0.37398373983739835, 0.3008130081300813], Any[1.0411938428878784, 1.0421404838562012, 1.0430951118469238, 1.0440577268600464, 1.0450286865234375, 1.046007752418518, 1.0469954013824463, 1.047991156578064, 1.0487675666809082, 1.0495514869689941  …  1.0514878034591675, 1.0530509948730469, 1.0546318292617798, 1.0562301874160767, 1.0578460693359375, 1.0594792366027832, 1.0611220598220825, 1.062780737876892, 1.0644563436508179, 1.0

### Learn Julia

In this assignment, it is necessary to pass parameters which are dependent on the model. To do this, the simplest way is to create a variable of type Dictionary (actually the type is `Dict`) which works in a similar way to Python. For example, to specify the parameters of an SVM, you could create a variable as follows:

```Julia
parameters = Dict("kernel" => "rbf", "degree" => 3, "gamma" => 2, "C" => 1);
```

Another way of defining such a variable could be the following:

```Julia
parameters = Dict();

parameters["kernel"] = "rbf";
parameters["kernelDegree"] = 3;
parameters["kernelGamma"] = 2;
parameters["C"] = 1;
```

Once inside the function to be developed, the model parameters can be used to create the model objet as follows:

```Julia
model = SVC(kernel=parameters["kernel"], 
    degree=parameters["kernelDegree"], 
    gamma=parameters["kernelGamma"], 
    C=parameters["C"]);
```

In the same way, something similar could be done for decision trees and kNN.

Another type of Julia that may be interesting for this assignment is the `Symbol` type. An object of this type can be any symbol you want, simply by typing its name after a colon (":"). In this practice, you can use it to indicate which model you want to train, for example `:ANN`, `:SVM`, `:DecisionTree` or `:kNN`.