# Fun with ML Production

This is a composition of fun things (tricks) that one may try in ML production environment.

For quick demonstration purposes, we will be using the MNIST dataset.


In [None]:
from tensorflow.keras.datasets import mnist
import numpy as np
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = (x_train / 255.0).astype(np.float32)
x_test  = (x_test / 255.0).astype(np.float32)

## Model Aggregation (like Ensemble)

### Intro

Alternate method to train multiple versions of the model in parallel where each has a different weight initialization.

In this method, I use model aggregation. The tecnique is similar to an 'ensemble' training, except in ensemble each model is trained independently. In aggregation, the models are conditionally dependent on each other during training.

### Method

Consider a model that is composed of a stem (entry) group, a learner (features are learned) group and classifier (exit) group --such as in the Idiomatic macro-architecture design pattern for models.

                                stem => learner => classifier
                                
In aggregation, we start with the stem and the add multiple branches, where each branch has a separate copy of the learneer+classifier group of the model, as depicted below:

                                       inputs
                                        stem

                    learner_1        learner_2        learner_3
                    classifier_1     classifier_2     classifier_2
                    output_1         output_2         output_3
                
When instantiating the model, we will use a single input with multiple outputs:

                    model = Model(inputs, [output_1, output_2, output_3])

In [None]:
from tensorflow.keras import Input, Model, Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

# Stem
inputs = Input((28, 28))
x = Flatten()(inputs)

# Model 1 branch
x1 = Dense(64, activation='relu', name="64_1")(x)  # input is output from the stem
x1 = Dense(128, activation='relu', name="128_1")(x1)
o1 = Dense(10, activation='softmax', name="output1")(x1)

# Model 2 branch
x2 = Dense(64, activation='relu', name="64_2")(x)
x2 = Dense(128, activation='relu', name="128_2")(x2)
o2 = Dense(10, activation='softmax', name="output2")(x2)

# Model 3 branch
x3 = Dense(64, activation='relu', name="64_3")(x)
x3 = Dense(128, activation='relu', name="128_3")(x3)
o3 = Dense(10, activation='softmax', name="output3")(x3)

model = Model(inputs, [o1, o2, o3])
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])
model.summary()

### Train

This now train all three models (branches) in parallel. Note for the label data we pass three copies of the labels --i.e., we need to specify a corresponding label set per classifier output.

Let's point out two things:

The layers for each branch are trained separate from each other.

But, the training of the stem will be an aggregation from the training of the branches. Since each branch has a different weight initialization, the updates on the weights in the stem are like a push and pull effect --thus providing regularization (a tad bit of noise).

Likewise, that noise is reflected back down the branches on the next forward feed batch --providing regularization through all layers.

In [None]:
model.fit(x_train, [y_train, y_train, y_train], epochs=10, batch_size=32, validation_split=0.1, verbose=1)
model.evaluate(x_test, [y_test, y_test, y_test])

### Cutout the Best Model

Next, we use the evaluation results to pick which branch got the highest accuracy (1, 2 or 3). Then we construct a new model using the trained layers of the corresponding branch.

In [None]:
# Layer paths for the three models
models = { '1' : [5, 8],
           '2' : [6, 9],
           '3' : [7, 10] }


# common stem
x = model.input
x = model.layers[1](x)
x = model.layers[2](x)

# branch layers
for n in models['3']:
    x = model.layers[n](x)

new_model = Model(inputs, x)
new_model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])
new_model.summary()

Let's now feed the test data again to the new 'cutout' model:

In [None]:
n_model.evaluate(x_test, y_test)

## Add another layer to trained model

Let's do this again with training an aggregation of models. This time we will make the branched models one layer deeper, but we will reuse all (but the classifier) trained layers.

In [None]:
inputs = Input((28, 28))
x = model.layers[1](inputs)
x = model.layers[2](x)

x1 = model.layers[5](x)
x1 = Dense(128, activation='relu')(x1)
o1 = Dense(10, activation='softmax')(x1)


x2 = model.layers[5](x)
x2 = Dense(128, activation='relu')(x2)
o2 = Dense(10, activation='softmax')(x2)


x3 = model.layers[5](x)
x3 = Dense(128, activation='relu')(x3)
o3 = Dense(10, activation='softmax')(x3)

model2 = Model(inputs, [o1, o2, o3])
model2.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])
model2.summary()

Let's train it.

In [None]:
model2.fit(x_train, [y_train, y_train, y_train], epochs=5, validation_split=0.1, verbose=1)
model2.evaluate(x_test, [y_test, y_test, y_test])

## (Why Not to Use) Auxiliary Classifier

Inception (2014) introduced the concept of auxiliary classifiers.

The idea, is that the further away the classifier is from the entry (bottom), the less it will contribute to updating the those weights on each pass. The concept was to add separate output classifiers at earlier parts in the model (auxiliary) which make contributions to the entry layers --from a less distant point.

In theory, the idea 'intuitively' made sense. But we don't see auxiliary classifiers subsequent to inception --there is a reason.

### Method

So, let's propose using this technique and see what happens. We will make our model be 3 dense layers, and after each dense layer we will add an auxiliary classifier. That is, each level will have it's own classifier.

What we 'intuitively' expect is that each deeper level is more accurate than the last. Given that, we can pick a level of accuracy and cut off the remaining layers for a more compact model.

In [None]:
from tensorflow.keras import Input, Model, Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

# Stem
inputs = Input((28, 28))
x = Flatten()(inputs)

# Level 1 and Auxiliary Classifier
x = Dense(128, activation='relu')(x)
o1 = Dense(10, activation='softmax', name="level1")(x)

# Level 2 and Auxiliary Classifier
x = Dense(128, activation='relu')(x)
o2 = Dense(10, activation='softmax', name="level2")(x)

# Level 3 and Final Classifier
x = Dense(128, activation='relu')(x)
o3 = Dense(10, activation='softmax', name="level3")(x)

model3 = Model(inputs, [o1, o2, o3])
model3.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])
model3.summary()

Let's train it.

In [None]:
model3.fit(x_train, [y_train, y_train, y_train], epochs=5, validation_split=0.1, verbose=1)
model3.evaluate(x_test, [y_test, y_test, y_test])

### Results

Note that as we went deeper, the accuracy did not go up -- but actually went down!

Why? Each level is a separate 'solver' and in effect they are fighting each other. The more deeper, the more solvers fighting each other, the further the degradation in results!

### Next Try
This seems like it maybe a case of covariant shift? Perhaps we can solve this by adding in BatchNormalization() as a pre-activation before each auxiliary/final classifier.

In [None]:
from tensorflow.keras import Input, Model, Sequential
from tensorflow.keras.layers import Dense, Flatten, BatchNormalization
from tensorflow.keras.optimizers import Adam

# Stem
inputs = Input((28, 28))
x = Flatten()(inputs)

# Level 1 with pre-activation normalization
x = Dense(128, activation='relu')(x)
o1 = BatchNormalization()(x)
o1 = Dense(10, activation='softmax', name="level1")(o1)

# Level 2 with pre-activation normalization
x = Dense(128, activation='relu')(x)
o2 = BatchNormalization()(x)
o2 = Dense(10, activation='softmax', name="level2")(o2)

# Level 3 with pre-activation normalization
x = Dense(128, activation='relu')(x)
o3 = BatchNormalization()(x)
o3 = Dense(10, activation='softmax', name="level3")(o3)

model4 = Model(inputs, [o1, o2, o3])
model4.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])
model4.summary()

In [None]:
model4.fit(x_train, [y_train, y_train, y_train], epochs=5, validation_split=0.1, verbose=1)
model4.evaluate(x_test, [y_test, y_test, y_test])

### Result

Nope, it doesn't help!

In [None]:
model5.fit(x_train, y_train, epochs=5, validation_split=0.1, verbose=1)

## End