# ROC and AUC in Multiclass-Classifiers
When there is more than one outcome that we would like to predict, we require more than one output neuron, with each neuron qualifying the confidence of that outcome. Let us build a model on a sample data set using a multiclass NN.

In [1]:
import pandas as pd
from scipy.stats import zscore

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?']
)

df

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,13.100000,1,9.017895,35,11.738935,49,0.885827,0.492126,0.071100,b
1,2,kd,c,60369.0,18.625000,2,7.766643,59,6.805396,51,0.874016,0.342520,0.400809,c
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1996,vv,c,51017.0,38.233333,1,5.454545,34,14.013489,41,0.881890,0.744094,0.104838,b
1996,1997,kl,d,26576.0,33.358333,2,3.632069,20,8.380497,38,0.944882,0.877953,0.063851,a
1997,1998,kl,d,28595.0,39.425000,3,7.168218,99,4.626950,36,0.759843,0.744094,0.098703,f
1998,1999,qp,c,67949.0,5.733333,0,8.936292,26,3.281439,46,0.909449,0.598425,0.117803,c


## Preparing the data

We generate dummies, fill in the missing values, and standardize our data ranges with a Z-score.

This last point is non-trivial and should be elaborated on: if the range of one input neuron is 0 to 1, but another -1 million to 1 million, the NN may take longer to balance the weights, thus require more training and be more error-prone, than a network where all input neurons are in a standardized range.

The consequence of using something such as a Z-score, is that the extensiblitly of the NN is limited. Consider a test point that you wish to predict with the NN without knowing the mean or standard deviation of the training set; your prediction will be skewed, since the value cannot be properly normalized.

There is much to consider when preparing the data for the NN, and a choice of how to normalize data should be made that is sensitive to how the model will be used.

In [2]:
from sklearn.model_selection import train_test_split

# dummies
df = pd.concat(
    [
        df,
        pd.get_dummies(df['job'], prefix='job'),
        pd.get_dummies(df['area'], prefix='area')
    ],
    axis=1
)
df.drop('job', axis=1, inplace=True)
df.drop('area', axis=1, inplace=True)

# fill missing values
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# standardize
for col_name in ['income', 'aspect', 'save_rate', 'age', 'subscriptions']:
    df[col_name] = zscore(df[col_name])

# assemble data vectors
x_cols = df.columns.drop('product').drop('id')
x = df[x_cols].values

dummies = pd.get_dummies(df['product'])
products = dummies.columns # index names for dummies
y = dummies.values

# split
x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    test_size=0.25,
    random_state=414141
)

## Building the NN model
We construct a dense NN with 3 hidden layers, and an output layer that matches our `y`-vector dimension, with softmax activation, so that the output is normalized and correctly weighted. 

In [3]:
import numpy as np
import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense( # hidden 1
        100,
        input_dim=x.shape[1],
        activation='relu',
    ),
    tf.keras.layers.Dense( # hidden 2 
        50,
        activation='relu',
    ),
    tf.keras.layers.Dense( # hidden 3
        25,
        activation='relu',
    ),
    tf.keras.layers.Dense( # output
        y.shape[1],
        activation='softmax',
    )
])

model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam',
    metrics=['accuracy'] # more on this below
)

monitor = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    min_delta=1e-3,
    patience=5,
    mode='min',
    restore_best_weights=True
)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 100)               4800      
_________________________________________________________________
dense_1 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_2 (Dense)              (None, 25)                1275      
_________________________________________________________________
dense_3 (Dense)              (None, 7)                 182       
Total params: 11,307
Trainable params: 11,307
Non-trainable params: 0
_________________________________________________________________


## Training the model
We use the early stopping callback we defined during the build step, and train the model over a maxmimum of 1000 epochs.

In [4]:
model.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    callbacks=[monitor],
    verbose=0,
    epochs=1000
)

<tensorflow.python.keras.callbacks.History at 0x14a557e50>

## Evaluating the model
We will examine the *accuracy* of the NN, which we define as the number of rows where the NN correctly predicted the target class: we use *accuracy* for classification problems:
$$
a = \frac{c}{N}
$$
with $c$ denoting the number of correct evaluations, with $N$ total number of evaluations. Our model already consider accuracy as the evaluation metric. 

We can extract the model's prediction with `argmax()` since we used softmax activation in the output layer:

In [5]:
pred = model.predict(x_test)
pred_ind = np.argmax(pred, axis=1)

We can use a SciKit method for calculating the accuracy:

In [6]:
from sklearn import metrics

y_compare = np.argmax(y_test, axis=1)
score = metrics.accuracy_score(y_compare, pred_ind)

print(f"Accuracy score: {score}")

Accuracy score: 0.712


### Log loss
We can further evaluate our model by examining the probability predictions of each class. [Log loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html) is a metric that penalizes high probabilites in false answers -- lower log loss values are desired. 

Log loss is calculated with
$$
L_\text{log} = \frac{-1}{N} \sum_{i=1}^N \left( y_i \log(\hat{y}_i) + (1 - y_i) \log(1-\hat{y}_i) \right)
$$
with $y_i$ the target outcome of row $i$, $\hat{y}_i$ is the model prediction. This metric is best used for classifications with more than two outcomes.

In [7]:
score = metrics.log_loss(y_test, pred)
print(f"Log loss score: {score}")

Log loss score: 0.650408753653057
