# Multiclass classification

When solving a classification problem, many existing machine learning models allow only two classes to be separated, usually referred to as "positive" and "negative". Positive patterns are commonly those related to what is to be detected, such as disease, alarm, or a type of object in an image. Negative patterns are often characterised by the absence of this characteristic that positive patterns have. To develop an ANN that classifies into two classes, a single neuron is needed in the output layer, with a logarithmic (or similar) sigmoidal transfer function, such that the output of the ANN will be between 0 and 1, and can be interpreted as the ANN's certainty in classifying a pattern as "positive". The classification into "negative" or "positive" is done in a simple way, by applying a threshold which is typically 0.5, although this can be changed.

However, there are many occasions when a system that is able to classify into more than two classes is desired. A simple example is a system that wants to classify an image according to whether a dog, cat or mouse, or some other type of animal is observed. In this case, you want to develop a 4-class classification system: "dog"/"cat"/"mouse"/"other". If an ANN to distinguish between these 3 animals is required, an output neuron for each class is needed, including the "other" class (4 output neurons in total).

In the multiclass classification scheme, as has been done in previous assignments, an encoding called one-hot-encoding is generally used, which is based on obtaining a boolean value for each pattern and each class, in such a way that each boolean value will be equal to 1 if that pattern belongs to that class, and 0 otherwise. When training an ANN with this scheme, each output neuron can be understood as a model specialised in classifying in a given class. In this type of networks, a linear transfer function is usually used in the output layer, whereby negative outputs indicate that a neuron does not classify the pattern into that class (i.e. from the point of view of that class it classifies it as "negative"), and positive outputs indicate that a neuron classifies the pattern as that class (i.e. from the point of view of that class it classifies it as "positive"). The absolute value of a neuron's output indicates that neuron's confidence in the classification. Finally, the softmax function receives these classification values and transforms them in such a way that they are between 0 and 1, and add up to 1, interpreted as the probability of belonging to each class. The pattern will be classified into the class whose output value is the highest. The softmax function is defined as follows: 

$$
softmax(y^i) = \frac{e^{y^i}}{\sum_j{e^{y^j}}}
$$

where $y^i$ is the output of the $i$-th neuron. For example, in a 3-class classification problem, if the outputs from the 3 neurons are `[2, 1, 0.2]`, they would classify the inputs as belonging to their respective classes, although the first one with much greater certainty. After applying the softmax function, the respective probabilities will be `[0.65, 0.24, 0.11]`, so the pattern will be classified as the first class.

In this way, the softmax function converts the real values produced by the output neurons into probability values, so that the more negative a value is (the more certainty of not belonging to that class), the closer it is to 0, and the more positive a value is (the more certainty of belonging to that class), the closer it is to 1. As indicated above, the sum of output probabilities will be equal to 1. Because of this fact, a fourth special class "other" is needed in the example above and in any other example where a pattern may not belong to any of the predefined classes.

### Question

Why is this extra class necessary when using the softmax function?

`The "other" class is necessary when using the softmax function because the softmax function transforms the output values of the neural network into probabilities that sum up to 1. In a multiclass classification problem, the output values of the neural network represent the confidence of the network in classifying the input pattern into each of the predefined classes. However, if the input pattern does not belong to any of the predefined classes, the sum of the output probabilities will not be equal to 1, which violates the requirements of the softmax function.` 

`In addition, the softmax function may assign a high probability to one of the negative output values, leading to a false classification, when the output values are negative and not close to each other. Therefore, an extra "other" class is needed to ensure that the model can detect when no classification is returned by the model. When the other neurons return negative values for their classes, the neuron assigned to the "other" class will be the one with the highest value, and after applying the softmax, it will be the one with the highest probability, indicating that the input pattern does not belong to any of the predefined classes. `

`In summary, the "other" class is necessary to ensure that the output probabilities sum up to 1 and to prevent false classifications when the output values are negative and not close to each other.` 

Tip: write in Julia `softmax([-1, -1, -0.2])`, and interpret the inputs (what does the vector `[-1, -1, -0.2]` represent and how is it interpreted?) and outputs of the function (how much do the values add up to? what does each say?). To use this function, import it from the Flux library.

In [1]:
using Flux: softmax;
x = softmax([-1, -1, -0.2])
s = sum(x)
x,s

([0.23665609135556676, 0.23665609135556676, 0.5266878172888664], 0.9999999999999999)

### Question

Might it not be necessary to create the additional class? What modification would have to be made to the ANN? How would the output be interpreted? How would the output class be generated based on the outputs of the output neurons?

`To modify the ANN for multiclass classification without creating an additional "other" class, there are two main approaches.` 

`The first approach is to add an output neuron for each class, including the "other" class if necessary, and use a linear transfer function for each neuron. The output values of the neurons are then passed through the softmax function to obtain probabilities that sum up to 1. The output class can be generated by selecting the class with the highest probability. This approach ensures that all possible classes in the domain are considered and that the output probabilities sum up to 1.`

`The second approach is to generate independent outputs with one neuron per category considered, using a sigmoid function on every neuron of the last layer instead of a softmax. The outputs are interpreted as the confidence of each neuron that the pattern belongs to the class they model. The final category is taken from the class that corresponds to the highest confidence between all the independent predictions. Alternatively, one-vs-all classification can be used, where each model uses just one neuron with the sigmoid activation. The final output can be the class that corresponds to the classifier with the positive result among all the models. This approach allows not to contemplate all the classes of the domain and can be more computationally efficient.`

`In both cases, if all the confidences are very low and do not reach the minimum threshold, all the results will be negative, indicating that no classification was performed or the "other" class is detected.`

### Question

In general, how does the output of a model have to be in order not to need this fourth class?

`To avoid the need for the "other" class in a multiclass classification problem, the output of the model needs to satisfy certain conditions.` 

`First, the output values of the model for each input pattern must sum up to 1, and each output value of the model must represent the probability of the input pattern belonging to the corresponding class. If these conditions are satisfied, then the output of the model can be interpreted as the probability of the input pattern belonging to each of the predefined classes, and the class with the highest probability can be selected as the output class.`

`If using a softmax function in the last layer, then there must be a neuron for every possible class in the domain. If that is not possible, independent confidences for each class must be generated using a sigmoid function on every neuron. This gives independent confidences for every class among which the greatest one is considered as the final result. `

`In summary, to avoid the need for the "other" class in a multiclass classification problem, the output of the model needs to satisfy the conditions that the output values sum up to 1 and represent the probability of the input pattern belonging to the corresponding class. If using a softmax function, there must be a neuron for every possible class in the domain. If that is not possible, independent confidences for each class must be generated using a sigmoid function on every neuron.`

### Question

Does a kNN model need this fourth class?

`A kNN model does not necessarily need the "other" class in a multiclass classification problem. The kNN algorithm assigns a class to a new input pattern based on the class of the k nearest neighbors in the training set. If the input pattern is not similar to any of the training patterns, then the kNN algorithm may not be able to assign a class to the input pattern.`

### Question

How many classes would be necessary if an ANN wanted to recognise those 3 types of animals, and, if it is not one of them, to say whether it is an animal or not? What if the model is a kNN?

`If an ANN wanted to recognize those 3 types of animals and, if it is not one of them, to say whether it is an animal or not, then 4 classes would be necessary. The three classes would correspond to the three types of animals, and the fourth class would correspond to the "not an animal" category.`

I`f the model is a kNN, then the number of classes would depend on the number of classes in the training set. If the training set contains only the three types of animals, then the kNN algorithm would only be able to classify input patterns into one of these three classes. If the input pattern does not belong to any of the three classes, then the kNN algorithm may not be able to assign a class to the input pattern. In this case, the kNN algorithm can simply return a "not classified" or "unknown" label for the input pattern, without the need for an additional "not an animal" class.`

Therefore, the "positive"/"negative" scheme no longer applies if more than two classes are required. The problem in these cases is that many of the machine learning models are only capable of separating two classes, so theoretically they could not be used. An example of such systems are Support Vector Machines (SVM), which are discussed in more detail in the theory class. Modifications have been made to the formulation of this model to allow multi-class classifications; however, in practice they are not commonly used, and instead a strategy that allows binary SVMs to be used to classify into multiple classes is often employed.

There are two main strategies for converting multi-class problems into binary classification problems. These strategies are called "one-against-one" or "one-against-all". Both are explained in theory class, but since "one-against-all" is much more widely used, this strategy will be used in the following.

The "one-against-all" strategy is based on generating L binary classifiers for a classification problem of L classes, one per class. In the l-th problem, class l must be separated from the rest, i.e., the patterns belonging to that class will be considered "positive", and those not belonging to it will be considered "negative". Continuing with the previous example of animals, 3 different classification problems would have to be solved: one to classify "dog"/"not dog", one to classify "cat"/"not cat", and one to classify "mouse"/"not mouse". Three classifiers would therefore be trained with the same inputs but with different desired outputs for each problem.

### Question

In the previously described problem, 4 classes were used for these 3 animals, including the class "other". Why not train a classifier for this class in a "one-against-all" scheme?

`In the previously described problem, the "one-against-all" strategy is used to convert the multiclass classification problem into three binary classification problems. In each binary classification problem, one class is considered as "positive" and the rest are considered as "negative". Therefore, in the case of the animals example, three binary classifiers would be trained to classify "dog"/"not dog", "cat"/"not cat", and "mouse"/"not mouse".`

`The "other" class is not included in the "one-against-all" scheme because it is not a predefined class in the problem. The purpose of the "one-against-all" scheme is to classify input patterns into one of the predefined classes, and the "other" class is used to handle input patterns that do not belong to any of the predefined classes. Therefore, the "other" class is not included in the binary classifiers because it is not a predefined class that needs to be classified. Instead, the "other" class is used to handle input patterns that are not classified by any of the binary classifiers.`

`Furthermore, the classifiers in the "one-against-all" scheme are independent, and we know that no classification is made (or "other" is detected) when all the classifiers do not return a positive classification. In addition, the fact that a softmax can be used and that the outputs are dependent on it will facilitate the training and creation of the most appropriate weights. `

`In summary, the "one-against-all" strategy is used to convert the multiclass classification problem into binary classification problems, and the "other" class is not included in the binary classifiers because it is not a predefined class that needs to be classified. The classifiers in the "one-against-all" scheme are independent, and we know that no classification is made (or "other" is detected) when all the classifiers do not return a positive classification. The use of a softmax can facilitate the training and creation of the most appropriate weights.`

Once the binary classifiers are trained, any given pattern is fed into all the classifiers and, depending on the output, a decision is made. If only one of the systems has positive output, or none of the three classifies it as positive, the decision is clear. However, sometimes more than one classifier will give a positive output for the same pattern. Fortunately, many classifiers give information about the level of certainty or confidence they have that the pattern is classified as "positive". If more than one binary model classifies the pattern as positive, it will be assigned to the class corresponding to the classifier that has a higher certainty in its classification.

### Question

Would it be possible to use the outputs of those 3 classifiers as the input of the softmax function? What would be the consequences?

`While it is possible to use the outputs of the three binary classifiers as the input of the softmax function, doing so may not be desirable. The classifiers are independent, so their weights and biases represent different knowledge. In addition, the softmax function will force to classify into one class, even if all the outputs are negative. Then, if all the three classifiers return negative values, the softmax function will take the highest one and return the positive class of its corresponding classifier as the detected class. Doing this will not solve the problem of the fourth class. `

`A more appropriate approach would be to use the output values of the binary classifiers to determine the class of the input pattern based on the classifier with the highest confidence in its classification. If more than one binary classifier classifies the pattern as positive, the class with the highest confidence can be selected as the output class. This approach takes into account the independent knowledge of each classifier and can provide a more accurate classification result.`

### Question

In general, when there are L classes and a pattern may not belong to any of them, what is the impact of using the softmax function on the outputs? In which cases could it be used? Why?

`In a multiclass classification problem where there are L classes and a pattern may not belong to any of them, using the softmax function on the outputs may not be appropriate. The softmax function is typically used to transform the output values of a neural network into probabilities that sum up to 1, which can be interpreted as the probability of the input pattern belonging to each of the predefined classes. However, the softmax function will try to detect always one class, giving the higher probability to the highest output, even if they are negative. Therefore, the class detected may be wrong if it does not belong to any of the classes considered by the model.`

`If the input pattern may not belong to any of the predefined classes, then an additional "other" class can be added to the classification problem. The output values of the neural network can then be transformed into probabilities that sum up to 1, including the probability of the input pattern belonging to the "other" class. This approach can be used when it is important to distinguish between input patterns that belong to one of the predefined classes and input patterns that do not belong to any of the predefined classes.`

`However, if we are sure that all the patterns passed to the model are represented by some of the neurons, then the softmax function could be used. In this case, there will be a real positive prediction among all of them, and the softmax will assign to it the higher probability correctly. If the domain is not restricted, then independent classifications could be done or an "other" class could be added, instead of the softmax.`

### Question

The softmax function is useful to get a loss value to train the ANN. However, if it were not used in the animal example above, would the fourth class "other" be necessary?

`If the softmax function were not used in the animal example above, the fourth class "other" would not be necessary if we are certain that all input patterns belong to one of the predefined classes. However, in practice, it is often difficult to ensure that all input patterns belong to one of the predefined classes, and it is possible that some input patterns do not belong to any of the predefined classes. In this case, the additional "other" class is necessary to ensure that the sum of the output probabilities is always equal to 1, as required by the softmax function.`

`It is also important to consider a different scenario when assigning patterns to classes, where the classes are not mutually exclusive. In this case, the use of a linear transfer function in the last layer together with the softmax function would not work, since the sum of the probabilities of belonging to the classes may be greater than 1. For these cases, logarithmic sigmoidal transfer functions can be used in the last layer instead of linear, which give an output between 0 and 1, and not to perform transformation using the softmax function. In this way, the final output of each output neuron is independent of the rest of the output neurons, and more than one can take values close to 1. The output of each neuron would again be interpreted as the probability of belonging to that class, but in this case, the sum of the probabilities does not have to be 1 (they are independent). Not applying the softmax function has two advantages: the first is that it allows classification into non-mutually exclusive classes, and the second is that an additional class ("other" in the example above) is no longer needed for cases where a set of inputs may not belong to any of the given classes.`

Finally, it is necessary to consider a different scenario when assigning patterns to classes. So far, and in most situations, the classes considered are mutually exclusive, i.e. in the example above, an animal is either a dog, a cat, a mouse, or none of the 3, but it cannot be of several classes at the same time. This is the most common case, but occasionally a problem will have classes that are not mutually exclusive. For example, when classifying animal sounds according to the animal that makes them, it may happen that several animals are mixed in one sound. In these cases, the use of a linear transfer function in the last layer together with the softmax function would not work, since, naturally, the sum of the probabilities of belonging to the classes may be greater than 1 (it may belong to several classes at the same time). For these cases, the scheme that can be used to train ANNs is to use logarithmic sigmoidal transfer functions in the last layer (instead of linear), which give an output between 0 and 1, and not to perform transformation using the softmax function. In this way, the final output of each output neuron is independent of the rest of the output neurons, and more than one can take values close to 1. The output of each neuron would again be interpreted as the probability of belonging to that class, but in this case the sum of the probabilities does not have to be 1 (they are independent). Not applying the softmax function has two advantages: the first, already mentioned, is that it allows classification into non-mutually exclusive classes; the second is that an additional class ("other" in the example above) is no longer needed for cases where a set of inputs may not belong to any of the given classes.

### Question

Why is this extra class no longer needed?

`The extra class is no longer needed when using logarithmic sigmoidal transfer functions in the last layer instead of linear transfer functions and not performing transformation using the softmax function because the output of each output neuron is independent of the rest of the output neurons, and more than one can take values close to 1. The output of each neuron would again be interpreted as the probability of belonging to that class, but in this case, the sum of the probabilities does not have to be 1 (they are independent). Therefore, an additional class ("other") is no longer needed for cases where a set of inputs may not belong to any of the given classes.`

`Also, we can detect that it does not belong to any class by thresholding the outputs of every neuron and checking that all of them return a negative result.`

Given a set of inputs, as always, it is classified into the class whose output neuron has shown the highest confidence. This scheme of non-mutually exclusive outputs is similar to the "one-against-all" scheme, in which one classifier per class is trained in parallel. The classifiers are independent and the final class is that of the classifier that has the highest certainty of belonging to that class. If all classifiers return "negative" as a classification and there is no possibility of not belonging to any class, the classifier with the lowest certainty of being negative is classified in the corresponding class. If all classifiers return "negative" as a classification and there is a possibility of non-class membership, it is simply classified as "other".

The following table shows a summary of the different scenarios when using an ANN to solve a classification problem. Note that in the case of binary classification, the possibility that a set of entries do not belong to any class is not considered, since in this case we would be in multi-class classification.

In the case of using a "one-against-all" strategy, this would be similar to the last row, except that the interval would not necessarily be `[0, 1]`, but would be conditioned by the model used, and therefore the threshold as well. For example, the outputs of a SVM range from $-\infty$ to $+\infty$, so the typical threshold is set to 0.

Another factor to consider when dealing with multiclass problems is the performance metric. Most of the metrics studied (PPV, sensitivity, etc.) correspond to binary classification problems. When the number of classes is greater than 2, these metrics can still be used; however, their use is slightly different.

When the number of classes is greater than two, the PPV, NPV, sensitivity and specificity metrics can be calculated separately for each class. Thus, from the point of view of a particular class, that class will be referred to as the positive class and the rest of classes will be put together in the negative class. In this way, from the exclusive point of view of that class, TP, TN, FP and FN can be calculated, and from them the sensitivity, specificity, PPV and NPV values for that particular class, and finally the F-score value. This way of treating classes separately is similar to the development of several classifiers in the "one-against-all" strategy (in the case of training binary classifiers that do not allow multi-class classification). Once these values have been calculated, they can be combined into a single value that will be used to evaluate the performance of the classifier. In this regard, there are 3 strategies: macro, weighted, and micro. We will use only the first two:

- **Macro**. In this strategy, those metrics such as the PPV or the F-score are calculated as the arithmetic mean of the metrics of each class. As it is an arithmetic average, it does not consider the possible imbalance between classes.
- **Weighted**. In this stratey, the metrics corresponding to each class are averaged, weighting them with the number of patterns that belong (desired output) to each class. It is therefore suitable when classes are unbalanced.
- **Micro**. TP, FN, and FP are calculated globally. When the classes are not mutually exclusive, the micro-PPV or micro-F-score is equal to the accuracy value. Therefore, this metric is useful when there are mutually exclusive classes. 

In this assignment, you are asked to:

1. Develop the code necessary to perform a "one-against-all" strategy. Although it is not necessary to develop it for multiclass classification with ANNs, it will be used in future assignments. A simple way of doing it is the following:

    - Calculate the number of classes and create a 2-dimensional matrix of real values, with as many rows as patterns and as many columns as classes.
    
    ```Julia
    outputs = Array{Float32,2}(undef, numInstances, numClasses);
    ```
    
    - Make a loop that iterates over each class. Inside this loop, the desired outputs corresponding to that class are created and the corresponding model is trained with those inputs and the new desired outputs corresponding to that class. In other words, a model is created for each class that indicates whether or not the pattern belongs to that class. Subsequently, this model is applied to the inputs (training and/or test) to calculate the outputs, which will be copied into the previously created matrix. The code would be similar to the following, in which a supposed fit function has been used to train a binary classification model:
    
    ```Julia
    for numClass in 1:numClasses
        model = fit(inputs, targets[:,[numClass]]); outputs[:,numClass] .= model(inputs);
    end;
    ```
    
    ### Question

    In this code it has also been assumed that `targets` is of type `AbstractArray{Bool,2}`. How could this be done if it were a vector with classes of any type (e.g. containing ["car", 17, "motorbike"]), i.e. of type Array{Any,1}?
    
    `If "targets" were a vector with classes of any type (containing ["car", 17, "motorbike"]), of type "Array{Any,1}", we could use a dictionary to map each class to a unique integer label. Then, we could create a 2-dimensional matrix of real values, with as many rows as patterns and as many columns as classes, where each column represents a binary classification problem (whether the pattern belongs to that class or not). Finally, we could use the integer labels to index the desired outputs corresponding to each class. Also, we could calculate the numClasses as the length of the list of unique values in the targets (uniqueClasses = unique(targets)). By iterating these classes, inside the loop, it would be used targets .== uniqueClasses[numClass] as the targets.`
    <br/>
    
     - Once the outputs are in the `outputs` matrix, the highest value is taken for each row (each pattern), i.e. the class of the model that has the highest certainty that it belongs to "its" class is taken.
    
        - Optionally, the softmax function can be passed. The end result is the same: the class with the highest value will be taken. However, the softmax function allows you to interpret the outputs as the probability of belonging to each class. One problem in using softmax is that it is prepared for use with ANNs, so it expects each pattern to be in a row. To solve this, you would have to transpose the outputs matrix and transpose the result back, as follows:
        
        ```Julia
        outputs = softmax(outputs')';
        ```
        
     - To take the highest output for each class, it can be done in a similar way as in practice 2 the accuracy was calculated in the case of having more than 2 classes and the patterns arranged in a row. The code could be similar to the following:
     
     ```Julia
     vmax = maximum(outputs, dims=2);
     outputs = (outputs .== vmax);
     ```
     In this way, a matrix of Boolean outputs is generated with the class to which each pattern belongs, which can be used to compare with the target matrix to calculate the different performance metrics.
     
     ### Question
     
     The last piece of code may present problems in case several models generate the same output. Where would the problem be, and how would it could be solved?

`The last piece of code may present problems in case several models generate the same output, because it would assign the same class to all the patterns that have the same maximum output value, even if they belong to different classes. This would result in incorrect classification and poor performance metrics. `

`To solve this problem, we could modify the code to take into account ties in the maximum output values. One way to do this is to use the "argmax" function instead of the "maximum" function to obtain the index of the maximum value for each row, and then use this index to assign the corresponding class to each pattern.`

In [1]:
function oneVSall(inputs::AbstractArray{<:Real,2}, targets::AbstractArray{Bool,2})
    numInstances, numClasses = size(targets)
    
    # Create a dictionary to map each class to a unique integer label
    classDict = Dict(unique(targets) .=> 1:length(unique(targets)))

    # Convert the targets to integer labels
    intTargets = [classDict[t] for t in targets]

    # Create a 2-dimensional matrix of real values, with as many rows as patterns and as many columns as classes
    outputs = Array{Float32,2}(undef, numInstances, length(classDict))

    # Make a loop that iterates over each class
    for numClass in 1:length(classDict)
        # Create the desired outputs corresponding to that class
        desiredOutputs = (intTargets .== numClass)

        # Train the corresponding model with those inputs and the new desired outputs corresponding to that class
        model = fit(inputs, desiredOutputs)

        # Apply the model to the inputs to calculate the outputs, which will be copied into the previously created matrix
        outputs[:,numClass] .= model(inputs)
    end
    
    # Obtain the index of the maximum value for each row
    maxIndex = argmax(outputs, dims=2)

    # Create a matrix of zeros with the same dimensions as the outputs matrix
    classOutputs = zeros(numInstances, numClasses)

    # Assign the corresponding class to each pattern based on the index of the maximum value
    for i in 1:numInstances
        classOutputs[i, maxIndex[i]] = 1
    end

    return classOutputs
end

oneVSall (generic function with 1 method)

2. Develop a function called `confusionMatrix` (same name as in the previous assignment) that returns the values of the metrics adapted to the condition of having more than two classes. To do so, include an additional parameter that allows to calculate them in the *macro* and *weighted* forms.

    This function should receive two matrices: model outputs (`outputs`) and desired outputs (`targets`), both of Boolean elements and dimension 2, with each pattern in a row and each class in a column. The first thing this function should do is to check that the number of columns of both matrices is equal and is different from 2. In case they have only one column, these columns are taken as vectors and the confusionMatrix function developed in the previous assignment is called.
    
    ### Question
    
    Why are two-column matrices invalid?
    
    `Two-column matrices are invalid because they represent binary classification problems, where each pattern can belong to one of two classes. However, when dealing with multiclass classification problems, the confusion matrix has more than two rows and columns, and the metrics need to be adapted to take into account the multiple classes.`
    
    If both matrices have more than 2 columns, the following steps can be followed:
    
    - Reserve memory for the sensitivity, specificity, PPV, NPV and F-score vectors, with one value per class, initially equal to 0. To do this, the `zeros` function can be used.
    
    - Iterate for each class, and, if there are patterns in that class, make a call to the `confusionMatrix` function of the previous assignment passing as vectors the columns corresponding to the class of that iteration of the outputs and targets matrices. Assign the result to the corresponding element of the sensitivity, specificity, PPV, NPV and F1 vectors.
    - Reserve memory for the confusion matrix.
    - Perform a double loop in which booth loops iterate over the classes, to fill all the confusion matrix elements.
    - Aggregate the values of sensitivity, specificity, PPV, NPV, and F-score for eachclass into a single value according to the *macro* or *weighed* strategy, as specified in the input argument.
    - Finally, calculate the accuracy value with the `accuracy` function developed in a previous assignment, and calculate the error rate from this value.

In [2]:
"""
    confusionMatrix(outputs, targets; weighted=true)

Calculate the confusion matrix and performance metrics for a multiclass classification problem.

# Arguments
- `outputs::AbstractArray{Bool,2}`: binary matrix of predicted outputs, where each row represents a sample and each column represents a class.
- `targets::AbstractArray{Bool,2}`: binary matrix of target outputs, where each row represents a sample and each column represents a class.
- `weighted::Bool=true`: whether to weight the metrics by the number of samples in each class.

# Returns
- `acc`: accuracy
- `errorRate`: error rate
- `sensitivity`: sensitivity for each class
- `specificity`: specificity for each class
- `ppv`: positive predictive value for each class
- `npv`: negative predictive value for each class
- `f1Score`: F1 score for each class
- `matrix`: confusion matrix
"""

include("functions.jl")

function confusionMatrix(outputs::AbstractArray{Bool,2}, targets::AbstractArray{Bool,2}; weighted::Bool=true)
    
    # Check that the inputs are of the correct type and size
    @assert (typeof(outputs) <: AbstractArray{Bool,2}) "outputs must be a binary matrix"
    @assert (typeof(targets) <: AbstractArray{Bool,2}) "targets must be a binary matrix"
    @assert (size(outputs) == size(targets)) "outputs and targets must have the same size"
    @assert (size(outputs, 2) != 2) "outputs and targets cannot have 2 columns"
    
    numClasses = size(outputs, 2)
    
    if (numClasses == 1)
        return confusionMatrix(outputs[:,1], targets[:,1])
    end
    
    # Calculate metrics
    tp = [sum(targets[:,i] .& outputs[:,i]) for i in 1:numClasses]
    fp = [sum(!targets[:,i] .& outputs[:,i]) for i in 1:numClasses]
    tn = [sum(!targets[:,i] .& !outputs[:,i]) for i in 1:numClasses]
    fn = [sum(targets[:,i] .& !outputs[:,i]) for i in 1:numClasses]
    
    sensitivity = tp ./ (tp .+ fn)
    specificity = tn ./ (tn .+ fp)
    ppv = tp ./ (tp .+ fp)
    npv = tn ./ (tn .+ fn)
    f1Score = 2 .* tp ./ (2 .* tp .+ fp .+ fn)
    
    # Fill confusion matrix
    matrix = [sum(targets[:, i] .& outputs[:, j]) for i in 1:numClasses, j in 1:numClasses]
    
    # Aggregate metrics according to the strategy specified
    if (weighted)
        weightClasses = vec(sum(targets, dims=1) ./ sum(targets))
        sensitivity = sum(sensitivity .* weightClasses)
        specificity = sum(specificity .* weightClasses)
        ppv = sum(ppv .* weightClasses)
        npv = sum(npv .* weightClasses)
        f1Score = sum(f1Score .* weightClasses)
    else
        sensitivity = mean(sensitivity)
        specificity = mean(specificity)
        ppv = mean(ppv)
        npv = mean(npv)
        f1Score = mean(f1Score)
    end
    
    acc = accuracy(outputs, targets)
    errorRate = 1 - acc
    
    return acc, errorRate, sensitivity, specificity, ppv, npv, f1Score, matrix
end

confusionMatrix (generic function with 1 method)

3. Develop another function called `confusionMatrix` in which the first parameter `outputs` is of type `AbstractArray{<:Real,2}`, and `targets` is of type `AbstractArray{Bool,2}` (the same as before). What this function should do is to convert the first parameter to an array of boolean values (using the function `classifyOutputs`) and call the previous function.

In [15]:
function confusionMatrix(outputs::AbstractArray{<:Real,2}, targets::AbstractArray{Bool,2}; weighted::Bool=true)
    boolOutputs = classifyOutputs(outputs)
    return confusionMatrix(boolOutputs, targets, weighted=weighted)
end

confusionMatrix (generic function with 3 methods)

4. Override this function once again by developing another function of the same name that performs the same task, but this time taking as inputs two vectors (`targets` and `outputs`) of the same length, whose elements are of any type (i.e., they are of type `AbstractArray{<:Any}`), plus the additional parameter that allows to aggregate the metrics through the *macro* and *weighted* strategies. The elements of these vectors represent the classes, represented in different ways. For example, classes can be ["dog", "cat", 3].

    Obviously, it is necessary that all the output classes (vector `outputs`) are included in the desired output classes (vector `targets`). Include, therefore, a defensive programming line to ensure this.
    
      - Write this line without any loop. To do this, it may be useful to refer to the functions `all`, `in` and `unique`. At the end of this assignment, the solution of how to do this line is given.
        
      - As you will see in the following assignment, this line should not really be there, since it is possible that some produced output is not among among the desired outputs. This line is another example of a small exercise to practice vector programming, but once the practice is done it should be temporarily removed. The following assignment excludes the possibility of this happening by splitting the patterns in an stratified way, so this line can be added again.
        
        ### Question
        
        How is it possible that an output is not among the desired outputs? In which cases can this occur?
        
        `It is possible that an output is not among the desired outputs when the model is not able to generalize well to unseen patterns, or when the model is trained on a limited set of patterns that does not cover all the possible variations of the input space. In these cases, the model may produce outputs that do not correspond to any of the desired output classes. This can also occur when the desired output classes are not well-defined or are not representative of the true output classes.`
        `Also, this may occur during the validation or test of the network, especially when using a non-representative set of desired outputs, such as when using cross-validation. In summary, this could happen when the validation or test subset is non-representative and lacks some of the classes of the domain.`
        
    To develop this function, it is necessary to first take the possible classes of both `outputs` and `targets` by means of the `unique` function. Once this is done, both matrices, `outputs` and `targets`, will be encoded through the function `oneHotEncoding` passing as argument this vector of classes just calculated. With the result of these two encodings, the `confusionMatrix` function can be called.
    
    ### Question
    
    It is important that the class vector is calculated first and passed in both calls. What could happen if this is not done in this way?
    
    `It is important to calculate the class vector first and pass it to both calls to the "oneHotEncoding" function because this ensures that the same encoding is used for both matrices, "outputs" and "targets". If different encodings are used, the resulting one-hot encoded matrices may have different dimensions or different values, which would lead to incorrect results when calculating the confusion matrix. Also, the order of the classes in the vectors could change, so the columns after the one-hot-encoding would belong to different classes in the outputs and in the targets. This would lead to incorrect results when calculating the confusion matrix, as the columns would not correspond to the same classes in both matrices.`

In [3]:
"""
    confusionMatrix(outputs, targets; weighted=true)

Calculate the confusion matrix and performance metrics for a multiclass classification problem.

# Arguments
- `outputs::AbstractArray{<:Any,1}`: vector of predicted outputs.
- `targets::AbstractArray{<:Any,1}`: vector of target outputs.
- `weighted::Bool=true`: whether to weight the metrics by the number of samples in each class.

# Returns
- `acc`: accuracy
- `errorRate`: error rate
- `sensitivity`: sensitivity for each class
- `specificity`: specificity for each class
- `ppv`: positive predictive value for each class
- `npv`: negative predictive value for each class
- `f1Score`: F1 score for each class
- `matrix`: confusion matrix
"""
function confusionMatrix(outputs::AbstractArray{<:Any,1}, targets::AbstractArray{<:Any,1}; weighted::Bool=true)
    
    # Check that the inputs are of the correct type and size
    @assert (typeof(outputs) <: AbstractArray{<:Any,1}) "outputs must be a vector"
    @assert (typeof(targets) <: AbstractArray{<:Any,1}) "targets must be a vector"
    @assert (length(outputs) == length(targets)) "outputs and targets must have the same length"
    
    # Check that targets contains all the classes in outputs
    @assert all([in(output, unique(targets)) for output in outputs]) "targets does not contain all the classes in outputs"
    
    classes = unique(targets)
    return confusionMatrix(oneHotEncoding(outputs, classes), oneHotEncoding(targets, classes); weighted=weighted)
end

confusionMatrix

### Learn Julia

The defensive programming line to ensure that all classes of the `output` vector are included in the desired output vector is as follows:

```Julia
@assert(all([in(output, unique(targets)) for output in outputs]))
```