# CONDOR ordinal classification/regression in Tensorflow Keras 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GarrettJenkinson/condor_tensorflow/blob/main/docs/CONDOR_TensorFlow_demo.ipynb)


This notebook uses MNIST hand-written digits and Amazon reviews as examples of ordinal classification, using the condor_tensorflow package for Tensorflow Keras.


**Acknowledgments**: This notebook is based in part on PyTorch source code written by Sebastian Rashka [in this notebook](https://github.com/Raschka-research-group/coral-cnn/blob/master/coral-implementation-recipe.ipynb) and the CORAL ordinal notebook written by [Chris Kennedy and Stephen Matthews](https://github.com/ck37/coral-ordinal).

## Installation for Google Colab

With pip you can either install the latest source code from GitHub or the stable version of the module on pypi.org

In [None]:
#upgrade sklearn...only needed for advanced ordinalEncoder behaviours
if 'google.colab' in str(get_ipython()):
    !pip install scikit-learn==0.24.2


In [None]:
if 'google.colab' in str(get_ipython()):
    GITHUB_AUTH = "GarrettJenkinson:<APIaccessTOKEN>"
    !git clone https://$GITHUB_AUTH@github.com/GarrettJenkinson/condor_tensorflow.git

In [None]:
# Install source package from GitHub
if 'google.colab' in str(get_ipython()):
    !pip install --force-reinstall --no-deps --use-feature=in-tree-build condor_tensorflow/

## Import statements

In [1]:
import numpy as np
import sklearn
from sklearn import model_selection
from sklearn.model_selection import train_test_split
import pandas as pd
from scipy import special
import tensorflow_hub as hub
import os
import json
import gzip
from urllib.request import urlopen

import tensorflow as tf
print("Tensorflow version", tf.__version__)

import condor_tensorflow as condor
print("CORAL Ordinal version:", condor.__version__)

Tensorflow version 2.6.0
CORAL Ordinal version: 0.1.0-dev


## MNIST toy example

MNIST is a database of handwritten digits widely utilized in image classification tasks.

The originally intended use of the dataset is categorical prediction (recognition of digits), without any ordinal component.  However, since the data are numerical, one could imagine a scenario where ordinal proximity of incorrect predictions to the correct prediction might be beneficial e.g. map coordinates.  Hence we utilize the MNIST dataset and enforce ordinal predictions to demonstrate the improved performance of CONDOR on the ordinal problem, while acknowledging that MNIST is usually more suited to categorical prediction.

We begin by setting some core variables required for model building.

In [2]:
##########################
### SETTINGS
##########################

# Hyperparameters
random_seed = 1 # Not yet used
learning_rate = 0.05
batch_size = 128
num_epochs = 2

# Architecture
NUM_CLASSES = 10

Next we load the MNIST data and create training, test and validation datasets in a suitable format.  Finally we check the shapes of the data structures containing our MNIST data.

In [3]:
# Fetch and format the mnist data
(mnist_images, mnist_labels), (mnist_images_test, mnist_labels_test) = tf.keras.datasets.mnist.load_data()

# Split off a validation dataset for early stopping
mnist_images, mnist_images_val, mnist_labels, mnist_labels_val = \
  model_selection.train_test_split(mnist_images, mnist_labels, test_size = 5000, random_state = 1)

print("Shape of training images:", mnist_images.shape)
print("Shape of training labels:", mnist_labels.shape)

print("Shape of test images:", mnist_images_test.shape)
print("Shape of test labels:", mnist_labels_test.shape)

print("Shape of validation images:", mnist_images_val.shape)
print("Shape of validation labels:", mnist_labels_val.shape)

# Also rescales to 0-1 range.
dataset = tf.data.Dataset.from_tensor_slices(
  (tf.cast(mnist_images[..., tf.newaxis] / 255, tf.float32),
   tf.cast(mnist_labels, tf.int64)))
dataset = dataset.shuffle(1000).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices(
  (tf.cast(mnist_images_test[..., tf.newaxis] / 255, tf.float32),
   tf.cast(mnist_labels_test, tf.int64)))
#test_dataset = test_dataset.shuffle(1000).batch(batch_size)
# Here we do not shuffle the test dataset.
test_dataset = test_dataset.batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices(
  (tf.cast(mnist_images_val[..., tf.newaxis] / 255, tf.float32),
   tf.cast(mnist_labels_val, tf.int64)))
val_dataset = val_dataset.shuffle(1000).batch(batch_size)

Shape of training images: (55000, 28, 28)
Shape of training labels: (55000,)
Shape of test images: (10000, 28, 28)
Shape of test labels: (10000,)
Shape of validation images: (5000, 28, 28)
Shape of validation labels: (5000,)


### Simple MLP model



Now we create a simple multi-layer perceptron model and apply the ordinal output layer required by CONDOR (i.e. a dense layer with 1 unit less than the number of output classes).  Note while we use the example of an MLP model, any categorical neural network architecture could be used.  The version below uses the Sequential API to create the model.

In [4]:
def create_model(num_classes):
  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Flatten(input_shape = (28, 28, )))
  model.add(tf.keras.layers.Dense(128, activation = "relu"))
  model.add(tf.keras.layers.Dropout(0.2))
  model.add(tf.keras.layers.Dense(32, activation = "relu"))
  model.add(tf.keras.layers.Dropout(0.1))
  # No activation function specified so this will output cumulative logits.
  model.add(tf.keras.layers.Dense(num_classes-1))
  return model

model = create_model(NUM_CLASSES)

# Note that the model generates 1 fewer outputs than the number of classes. 
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               100480    
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                4128      
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 9)                 297       
Total params: 104,905
Trainable params: 104,905
Non-trainable params: 0
__________________________________________________

Alternatively we could build the model using the Functional API as demonstrated below.

In [5]:
# Or a functional API version
def create_model2(num_classes):
  inputs = tf.keras.Input(shape = (28, 28, ))

  x = tf.keras.layers.Flatten()(inputs)
  x = tf.keras.layers.Dense(128, activation = "relu")(x)
  x = tf.keras.layers.Dropout(0.2)(x)
  x = tf.keras.layers.Dense(32, activation = "relu")(x)
  x = tf.keras.layers.Dropout(0.1)(x)
  # No activation function specified so this will output cumulative logits.
  outputs = tf.keras.layers.Dense(num_classes-1)(x)

  model = tf.keras.Model(inputs = inputs, outputs = outputs)

  return model

model = create_model2(NUM_CLASSES)

# Note that the model generates 1 fewer outputs than the number of classes. 
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 28, 28)]          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               100480    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 32)                4128      
_________________________________________________________________
dropout_3 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 9)                 297   

We compile the model using CONDOR's SparseCondorOrdinalCrossEntropy as the loss function.  This is the key component of the CONDOR method, which enables ordinal prediction with rank consistency.  The other metrics provided by CONDOR enable assessment of CONDOR's performance on the ordinal prediction problem.

In [6]:
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate),
              loss = condor.SparseCondorOrdinalCrossEntropy(),
              metrics = [condor.SparseOrdinalEarthMoversDistance(),
                         condor.SparseOrdinalMeanAbsoluteError()])

Now we train the model.

In [7]:
%%time

# This takes about 5 minutes on CPU, 2.5 minutes on GPU.
history = model.fit(dataset, epochs = 5, validation_data = val_dataset,
                    callbacks = [tf.keras.callbacks.EarlyStopping(patience = 3, restore_best_weights = True)])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Wall time: 21.1 s


### Test set evaluation
Now we can evaluate performance on the MNIST test dataset we created previously.

In [8]:
# Evaluate on test dataset.
model.evaluate(test_dataset)



[2.0194995403289795, 1.3870970010757446, 0.7875000238418579]

### Cumulative logits to probabilities

Note that the output layer naturally outputs cumulative logit values.  These can be  subsequently converted to probability estimates for each ordinal label utilizing the condor.ordinal_softmax() function.  These probababilities can then be used to calculate other metrics like accuracy or mean absolute error.

In [9]:
print("Predict on test dataset")

# Note that these are ordinal (cumulative) logits, not probabilities or regular logits.
ordinal_logits = model.predict(test_dataset)

# Convert from logits to label probabilities. This is initially a tensorflow tensor.
tensor_probs = condor.ordinal_softmax(ordinal_logits)

# Convert the tensor into a pandas dataframe.
probs_df = pd.DataFrame(tensor_probs.numpy())

probs_df.head(10)

Predict on test dataset


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,4.673004e-05,0.0,0.000985,0.018463,0.00052,0.004716,0.001831,0.5641987,0.04133856,0.3679013
1,8.34465e-07,0.0,0.620271,0.378648,2.1e-05,0.000364,0.000365,5.652808e-06,0.0003239568,8.39228e-07
2,0.2898683,0.478036,0.039226,0.034971,0.012516,0.049234,0.032079,0.005306434,0.04843163,0.01033208
3,0.2898683,0.478036,0.039226,0.034971,0.012516,0.049234,0.032079,0.005306434,0.04843163,0.01033208
4,0.001172602,0.0,0.0,0.012016,0.918288,0.000208,0.005483,5.215406e-08,7.450581e-09,0.06283215
5,0.2898683,0.478036,0.039226,0.034971,0.012516,0.049234,0.032079,0.005306434,0.04843163,0.01033208
6,3.0756e-05,0.0,0.0,0.003818,0.970375,6e-06,0.000702,0.0,0.0,0.02506757
7,0.0006659031,0.0,0.001104,0.043011,0.011639,0.078128,0.051977,0.02138722,0.2827402,0.5093479
8,0.05411088,2.8e-05,0.021621,0.10786,0.054946,0.203791,0.148065,0.01030344,0.2698844,0.1293912
9,9.894371e-06,0.0,4.9e-05,0.009641,0.002023,0.001249,0.003876,0.05451417,0.00302285,0.9256147


Notice above that the probability distribution for each observation is unimodal, which is what we want for an ordinal outcome variable.

We can also confirm that our probabilities sum to 1 as expected:

In [10]:
# Check that probabilities all sum to 1 - looks good!
probs_df.sum(axis = 1)

0       1.0
1       1.0
2       1.0
3       1.0
4       1.0
       ... 
9995    1.0
9996    1.0
9997    1.0
9998    1.0
9999    1.0
Length: 10000, dtype: float32

### Label prediction

Using the probabilities generated, we can predict labels for the MNIST images.  Here we demonstrate two manners of calculating predicted labels. First we can simply select the label with the highest probability:

In [11]:
# Probs to labels
labels = probs_df.idxmax(axis = 1)
labels.values

array([7, 2, 1, ..., 4, 8, 6], dtype=int64)

We can now use these labels to calculate the accuracy of our predictions:

In [12]:
np.mean(labels == mnist_labels_test)

0.5851

In [13]:
# Compare to logit-based cumulative probs
cum_probs = pd.DataFrame(ordinal_logits).apply(special.expit).cumprod(axis=1)
cum_probs.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.999953,0.999953,0.998968,0.980505,0.979985,0.975269,0.973439,0.40924,0.3679013
1,0.999999,0.999999,0.379728,0.00108,0.001059,0.000695,0.00033,0.000325,8.392232e-07
2,0.710132,0.232096,0.19287,0.157899,0.145383,0.096149,0.06407,0.058764,0.01033208
3,0.710132,0.232096,0.19287,0.157899,0.145383,0.096149,0.06407,0.058764,0.01033208
4,0.998827,0.998827,0.998827,0.986811,0.068523,0.068315,0.062832,0.062832,0.06283212
5,0.710132,0.232096,0.19287,0.157899,0.145383,0.096149,0.06407,0.058764,0.01033208
6,0.999969,0.999969,0.999969,0.996151,0.025776,0.02577,0.025068,0.025068,0.02506755
7,0.999334,0.999334,0.99823,0.955219,0.94358,0.865452,0.813475,0.792088,0.5093479
8,0.945889,0.945861,0.924241,0.816381,0.761435,0.557644,0.409579,0.399276,0.1293911
9,0.99999,0.99999,0.999941,0.9903,0.988277,0.987028,0.983152,0.928638,0.9256148


Secondly we utilize the method of label prediction proposed by Cao et al in their CORAL method.  Here we choose the label with the highest probability when Pr(Y > label) > 50%. 

In [14]:
# Calculate the labels using the style of Cao et al.
labels2 = cum_probs.apply(lambda x: x > 0.5).sum(axis = 1)
labels2.head()

0    7
1    2
2    1
3    1
4    4
dtype: int64

In [15]:
# What is the accuracy of these labels? 
np.mean(labels2 == mnist_labels_test)

0.5767

In [16]:
# More often than not these are the same, but still a lot of discrepancy.
np.mean(labels == labels2)

0.8274

In [17]:
print("Mean absolute label error version 1:", np.mean(np.abs(labels - mnist_labels_test)))
print("Mean absolute label error version 2:", np.mean(np.abs(labels2 - mnist_labels_test)))

Mean absolute label error version 1: 0.902
Mean absolute label error version 2: 0.7875


In [None]:
mnist_labels_test[:5]

### Importance weights customization

A quick example to show how the importance weights can be customized. 

In [None]:
model = create_model(num_classes = NUM_CLASSES)
model.summary()

# We have num_classes - 1 outputs (cumulative logits), so there are 9 elements
# in the importance vector to customize.
importance_weights = [1., 1., 0.5, 0.5, 0.5, 1., 1., 0.1, 0.1]
loss_fn = condor.SparseCondorOrdinalCrossEntropy(importance_weights = importance_weights)

model.compile(tf.keras.optimizers.Adam(learning_rate = learning_rate), loss = loss_fn)

In [None]:
%%time

history = model.fit(dataset, epochs = num_epochs)

## Amazon reviews and 5-star ratings

Amazon review data via https://nijianmo.github.io/amazon/index.html#subsets


In [None]:
!curl -o Prime_Pantry_5.json.gz http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Prime_Pantry_5.json.gz 

In [None]:
data = []
with gzip.open('Prime_Pantry_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))

df = pd.DataFrame.from_dict(data)
df = df[['overall', 'reviewText']]

# There is a large amount of duplicate text in here, possibly due to paid/fraudulent reviews.
df.drop_duplicates("reviewText", inplace = True)

# Some of the text is blank, which causes an obscure error about floating point conversion.
df.dropna(inplace = True)

print(len(df))
print(df.head())

outcome_col = "overall"
text_col = "reviewText"

# We subtract the minimum value from the outcomes so that they start at 0.
df[outcome_col] = df[outcome_col].values - df[outcome_col].min()

print("\n", df.overall.value_counts())

# TODO: define automatically based on the number of unique values in the outcome variable.
num_classes = 5

In [None]:
# Train/Test split
text_train, text_test, labels_train, labels_test = \
  train_test_split(df[text_col].values, df[outcome_col].values, test_size = 10000, random_state = 1)

print("Training text shape:", text_train.shape)
print("Training labels shape:", labels_train.shape)
print("Testing text shape:", text_test.shape)
print("Testing labels shape:", labels_test.shape)

### Universal Sentence Encoder model (minimal code changes)

In [None]:
%%time
# This takes 20 - 30 seconds.

# Clear our GPU memory to stay efficient.
tf.keras.backend.clear_session()

input_text = tf.keras.layers.Input(shape = [], dtype = tf.string, name = 'input_text')

model_url = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

base_model = hub.KerasLayer(model_url, input_shape = [],
                            dtype = tf.string,
                            trainable = False)
                            
embedded = base_model(input_text)

x = tf.keras.layers.Dense(64, activation = 'relu')(embedded)
x = tf.keras.layers.Dropout(0.1)(x)
output =tf.keras.layers.Dense(num_classes-1)(x) 

model = tf.keras.Model(inputs = input_text, outputs = output)

model.summary()

In [None]:
model.compile(loss = condor.SparseCondorOrdinalCrossEntropy(),
              metrics = [condor.SparseOrdinalEarthMoversDistance(),
                         condor.SparseOrdinalMeanAbsoluteError()],
              optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001))

In [None]:
# Encode a test string and take a look at the first ten dimensions.
base_model(np.array(["test_string"])).numpy()[0, :10]

In [None]:
%%time

history = model.fit(x = text_train,
                    y = labels_train,
                    epochs = 5,
                    batch_size = 32, 
                    validation_split = 0.2,
                    callbacks = [tf.keras.callbacks.EarlyStopping(patience = 2,
                                                                  min_delta = 0.001,
                                                                  restore_best_weights = True)])

#### Evaluate

In [None]:
# For comparison, CORAL achieves loss 0.7962, MAE 0.3195
model.evaluate(text_test, labels_test) 

In [None]:
# Generate predictions - initially these are cumulative logits.
preds = model.predict(text_test)
print(preds)
# Convert cumulative logits to probabilities for each class aka rank or label.
probs = pd.DataFrame(condor.ordinal_softmax(preds).numpy())

In [None]:
print(probs.head(10))
print(labels_test[:10])

#### Evaluate accuracy

In [None]:
# Evaluate accuracy and mean absolute error
labels_v1 = probs.idxmax(axis = 1)
print("Accuracy of label version 1:", np.mean(labels_v1 == labels_test))

# Compare to logit-based cumulative probs
cum_probs = pd.DataFrame(preds).apply(special.expit).cumprod(axis=1)
# Calculate the labels using the style of Cao et al.
labels_v2 = cum_probs.apply(lambda x: x > 0.5).sum(axis = 1)
print("Accuracy of label version 2:", np.mean(labels_v2 == labels_test))

#### Evaluate mean absolute label error

This is effectively an ordinal version of 1 - accuracy.

In [None]:
# These do not correspond with what we get from the model evaluation. Something must be off in one of these.
print("Mean absolute label error version 1:", np.mean(np.abs(labels_v1 - labels_test)))
print("Mean absolute label error version 2:", np.mean(np.abs(labels_v2 - labels_test)))

print("Root mean squared label error version 1:", np.sqrt(np.mean(np.square(labels_v1 - labels_test))))
print("Root mean squared label error version 2:", np.sqrt(np.mean(np.square(labels_v2 - labels_test))))

In [None]:
# Review how absolute error is calculated for ordinal labels:
pd.DataFrame({"true": labels_test, "pred_v2": labels_v1, "abs": labels_v2 - labels_test}).head()

### Universal Sentence Encoder model (speed up using encodings)

The "Sparse" versions of the CONDOR API are convenient and require minimal code changes. However there is a performance overhead compared to if we pre-encode the labels using CONDORs ordinal encoder. The sparse API is basically encoding on the fly inside the training loop. 

Also as we will see later, the labels do not always come encoded as 0,1,...,K-1. In this case, using the CondorOrdinalEncoder will help transform labels into ordinal-ready values.

In [None]:
%%time
# pre-encoding runs very fast so the savings later are worth it
enc = condor.CondorOrdinalEncoder(nclasses=num_classes)
enc_labs_train = enc.fit_transform(labels_train)
enc_labs_test = enc.transform(labels_test)

In [None]:
# Note the lack of "Sparse" in the condor functions here
model.compile(loss = condor.CondorOrdinalCrossEntropy(),
              metrics = [condor.OrdinalEarthMoversDistance(),
                         condor.OrdinalMeanAbsoluteError()],
              optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001))

In [None]:
%%time
# note the encoded labels are passed to the fit now
history = model.fit(x = text_train,
                    y = enc_labs_train,
                    epochs = 5,
                    batch_size = 32, 
                    validation_split = 0.2,
                    callbacks = [tf.keras.callbacks.EarlyStopping(patience = 2,
                                                                  min_delta = 0.001,
                                                                  restore_best_weights = True)])

In [None]:
model.evaluate(text_test, enc_labs_test) 

#### More examples of label encoding capabilities
Here we demo the features of the ordinal encoder.


In [None]:
# Here the ordinal encoder figures out how many classes there are automatically
# and orders them in the default sklearn OrdinalEncoder fashion 
# (i.e., alphabetically here)
labels = np.array(['a','b','c','d','e'])
enc_labs = condor.CondorOrdinalEncoder().fit_transform(labels)
print(enc_labs)

In [None]:
# Here the ordinal encoder figures out how many classes there are automatically
# and orders them in the default sklearn OrdinalEncoder fashion 
# (i.e., alphabetically here). This time it is dealing with a basic list.
labels = ['a','b','c','d','e']
enc_labs = condor.CondorOrdinalEncoder().fit_transform(labels)

print(enc_labs)

In [None]:
# Here we wish to specify the order to be different from alphabetical. Note
# this would also allow "missing" categories to be included in proper order.
labels = ['low','med','high']
enc = condor.CondorOrdinalEncoder(categories=[['low', 'med', 'high']])
enc_labs = enc.fit_transform(labels)

print(enc_labs)