techniques to get the most out of models:
1. Hyperparameter optimization
Based on intuition, a ml engineer may decide what architecture to use , how many layers to stack, what should be the dropout, should there be batchnorm etc

initial arch will be sub-optimal , with some foresight, trial and error, a better model is created. the trial and error is better left to machine and not a human engineer. The code has to search through the architecture space and come up with optimal set of hyperparameters
many possible optimization schemes : Bayesian optimization, genetic search algorithms, simple random search, grid based search etc

There are a bunch of caveats:
1. the hyperparameter space is discreta and not continuous, differentiable -> cannot use gradient descent to reach minimum on a loss function to optimize the search process. have to rely on discrete search strategies
2. is very expensive and time consuming - pick a bunch of hyperparameters, train the model , validate against validation data and then repeat -> crazy amount of gpu hours
3. hard to distinguish noise from signal -> if 0.1 percent improvement, is it due to model configuration or better set of initial weights?

There is a tool called keras autotuner which helps for keras based models.


this tool helps one set a range of values for hyperparameters instead of manually updating them and evaluating => crawling through the search space of the hyperparameter.

In [1]:
from tensorflow import keras
from tensorflow.keras import layers


In [2]:
def build_model(hp):
  units = hp.Int(name="units",min_value = 16, max_value = 64, step = 16)
  model = keras.Sequential([
    layers.Dense(units = units, activation="relu"),
    layers.Dense(units = 10,activation="softmax")

  ])
  optimizer = hp.Choice(name="optimizer", values=["adam", "rmsprop"])
  model.compile(optimizer = optimizer, loss = "sparse_categorical_crossentropy", metrics = ["accuracy"])
  return model

The above was the easy way of building a model with keras autotuner functionality to go for a hyperparameter search
if a more modular approach is required, we can try subclassing

In [3]:
import keras_tuner as kt
class SimpleMLP(kt.HyperModel):
    def __init__(self,num_classes):
        self.num_classes = num_classes
    def build(self,hp):
        units = hp.Int(name="units",min_value = 16, max_value = 64, step = 16)
        model = keras.Sequential([
            layers.Dense(units = units, activation="relu"),
            layers.Dense(units = self.num_classes,activation="softmax")

        ])
        optimizer = hp.Choice(name="optimizer", values=["adam", "rmsprop"])
        model.compile(optimizer = optimizer, loss = "sparse_categorical_crossentropy", metrics = ["accuracy"])
        return model
hypermodel = SimpleMLP(10)


the advantage of above object oriented approach is that the configuration parameters for hyperparameter search can be passed as constructor arguments instead of hardcoding them in the code.

Once the model is selected, the tuner is to be coded up. the tuner is like a for loop which selects a bunch of hyperparameters , trains the model, records the metrics etc
Possible tuners : RandomSearch, Bayesian Optimization, HyperBand
Bayesian Optimization is a smart tuner which makes a guess as to given current state of the model, which hyperparameter change will give better results

In [5]:
tuner = kt.BayesianOptimization(build_model, objective="val_accuracy", max_trials = 20, executions_per_trial = 2, directory="/Users/divyeshkanagavel/Desktop/DeepLearning/DL_nbs/mnist_kt",overwrite=True)
# build_model - model passed to tuner, objective is to maximize validation accuracy, execution_per_trial = 2,to reduce metric variance

2024-06-02 08:49:08.743724: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Max
2024-06-02 08:49:08.743752: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2024-06-02 08:49:08.743763: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2024-06-02 08:49:08.743832: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-06-02 08:49:08.743884: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [6]:
tuner.search_space_summary()

Search space summary
Default search space size: 2
units (Int)
{'default': None, 'conditions': [], 'min_value': 16, 'max_value': 64, 'step': 16, 'sampling': 'linear'}
optimizer (Choice)
{'default': 'adam', 'conditions': [], 'values': ['adam', 'rmsprop'], 'ordered': False}


For build-in metrics like accuracy, the keras tuner tool will infer whether to maximize the metric or minimize the metric [like loss], in case of custom metrics , the direction needs to be passed as an argument
objective = kt.Objective(
    name="val_accuracy",
    direction="max")

In [7]:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# pre-processing
x_train = x_train.reshape((-1,28*28)).astype("float32")/255.
x_test = x_test.reshape((-1,28*28)).astype("float32")/255.
x_train_full = x_train[:]
y_train_full = y_train[:]
num_val_samples = 10000
x_train, x_val = x_train[:-num_val_samples], x_train[-num_val_samples:] 
y_train, y_val = y_train[:-num_val_samples], y_train[-num_val_samples:]




In [8]:
callbacks = [keras.callbacks.EarlyStopping(monitor="val_loss", patience = 3)]


In [9]:
#tuner.search is similar to model.fit -> same set of arguments
tuner.search(x_train, y_train, epochs = 40, validation_data = (x_val, y_val), callbacks = callbacks, batch_size = 128, verbose = 2)

Trial 20 Complete [00h 00m 46s]
val_accuracy: 0.9300999939441681

Best val_accuracy So Far: 0.9308499991893768
Total elapsed time: 00h 15m 00s


In [10]:
#once the hyperparameter search is done, we can select the best set of hps returned by the tool and use it to train on more data and for more epochs
top_n = 4
best_hps = tuner.get_best_hyperparameters(top_n)
best_hps

[<keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters at 0x33c05ac10>,
 <keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters at 0x351adfd50>,
 <keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters at 0x32569cf50>,
 <keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters at 0x2e3739590>]

There is another hyperparameter which needs to be found - the number of epochs 
during hyperparameter search, we resorted to aggressive practices like early stopping with patience value of just 3. 
we might need to train of validation data as well and go for more epochs to prevent underfitting.


In [11]:
def get_best_epoch(hp):
    model = build_model(hp)
    callbacks = [keras.callbacks.EarlyStopping(monitor="val_loss",mode="min", patience = 10)] # high patience value to prevent underfitting
    history = model.fit(x_train, y_train, validation_data = (x_val, y_val), epochs = 50,batch_size=128, verbose=0, callbacks = callbacks )
    val_loss_per_epoch = history.history["val_loss"]
    best_epoch = val_loss_per_epoch.index(min(val_loss_per_epoch))+1
    return best_epoch

In [14]:
def get_best_trained_model(hp): 
    best_epoch = get_best_epoch(hp) 
    model = build_model(hp)
    model.fit(
        x_train_full, y_train_full,
        batch_size=128, epochs=int(best_epoch * 1.2)) # we are training on 20 percent more train data [val_data included in training corpus]
    return model

In [15]:
best_models = []
for hp in best_hps:
    model = get_best_trained_model(hp)
    model.evaluate(x_test, y_test)
    best_models.append(model)



Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


if we don't care about slightly underperforming , we can use tuner.get_best_models(top_n) to return the best model with the weight saved.

One important thing to care about in automatic hyperparameter tuning is overfitting on validation data. we are basically training the model to have hyperparameters which work well on validation data, so there is some type of information leak about validation data to model. so, this should be kept in mind.  A hidden test data should always be there to gauge the true performance of the model.

Hyperparameter tuning still does not do away with major architectural decisions -> the search space is still huge. 
the only advantage it gives is it helps the human mind put more effort into major architectural decisions like -should i use residual network, batchnorm etc. and the automation will take care of num_units, optimizer ,learning rate etc.


Another powerful technique : Ensemble learning 
Ensembling : pooling together the predictions of different model to produce a better prediction.
One single model may be very good for the task at hand, but when weaker architectures predict for the given data, they are unique in the way that the manifold patterns they learn from the data is in its own perspective thereby offering multiple different perspectives of looking at the data at hand. When the predictions are pooled together and then a final prediction is made, it usually beats the best model.

For example, for a classification problem, we could have predictions from four models -> model_a, model_b, model_c, model_d
we could average them and produce a single output -> 0.25 * (model_a + model_b + model_c + model_d)
this assumes that all the four models are equally good. 
another smarter way to ensemble them is to validate them on the validation data -> then assign weights based on the performance on validation data. Use random search or simple optimization algorithm for the same


Diversity of ML models is the key to building a cool ensemble model. if we have n models with same bias, the ensemble will retain the bias. Instead , if n models have n different types of bias, the biases will cancel each other out and will provide strong robust accurate model.
Usually tree-based methods like gradient boosted trees, random forests enesembled with neural networks offer good performance.


after sometime, ideation, execution of models will cease to be bottlenecks and instead the training time will be the bottleneck. In order to iterate quickly on the baseline and improve performance, the trainig infrastructure should be fast , robust
there are many methods to speeden up training:
mixed precision : single gpu : faster training and inference
multi-gpu training
tpu training

numbers in mathematics : infinitely many , but in computers, the number of numbers which can be stored is finite : resolution. 
it depends on the number of bits you use to store the numbers .
Floating point numbers:
half precision : float16
single precision : float32
double precision : float64

float32 : for example ; 1 bit  -> sign bit , 8 bits -> exponent , 23 bits -> mantissa [the float value]

floating numbers are very good for smaller numbers : but for larger numbers : the rounding errors become worse
because you can accommodate the same number of points between 2 ^0 and 2^1 and 2^N and 2^N+1
so,larger the number -> larger the rounding error




float64 is too expensive for matrix operations like mult and addition -> so we used float32 to do the gradient updates faster without losing precision in the process. if we use fp16,it could lead to loss of a lot of information during gradient update thereby leading to unstable convergence or wrong learning.

mixed precision ; use fp32 where precision could be an issue and use fp16 where it is not an issue.

On NVIDIA GPUs mixed precision can speed up training by 3X.


In [None]:
#Mixed precision in practice
keras.mixed_precision.set_global_policy("mixed_float16")
#typically most operations happen in float16, with the exception of some numerically unstable operations like softmax for which datatype is switched back to float32
#Keras layers have variable dtype and compute dtype, by default both are set to float32, in mixed precision, compute dtype is made float16,but when weight updates come from optimizer -> switch back to float32 and add.
#some operations may be numerically unstable  -> softmax for example.hence they can be switched back to float32


two different types of parallelism - model parallelism and data parallelism
data parallelism : run the same model on different batches of data on different gpus.
model parallelism : process different data on different parts of the model run on different gpus.
works well with models which are highly parallelizable and requires good synchronization.


for now, google colab provides access to single gpu. a way to have multiple gpus is to acquire 2-4 gpus on a local computer and do multi-gpu training. or we can rent a multi-gpu VM on Google Cloud, AZURE or AWS with pre-installed drivers and software.
With tensorflow cloud, we can use multi-gpu infra with a single line of code on top of the single gpu colab notebook


In [None]:
#MultiGPU training
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}"
      )
with strategy.scope():
    #model = get_compiled_model()

#model.fit(train_dataset, epochs = 10, validation_data = val_dataset, callbacks = callbacks)


mirroredStrategy : single host -> meaning the computations run on a single machine with multiple gpus and not gpu clusters
synchronous training : state of the per-GPU model replicas stays the same 
steps;
a batch of data (global batch) is chosen from the sample
this is sub-divided into four sub batches (local batches) 
local batch size has to be high and hence high global size is required to keep gpus busy.

Each local batch is passed through the model replica on the gpu, a forward pass and then backward pass of gradients
weight delta is calculated for each gpu

the deltas from all the gpus are then merged to form a global delta to update the weights on all replicas. the next batch is taken in only after the global delta update making sure that the model weights across all replicas are in sync.


Always provide data as tf.data.Dataset object for better performance.Also make sure to prefetch data and put in the cache to prevent loss of time due to i/o operations. use dataset.prefetch(buffer_size). If buffer_size is not known, use dataset.prefetch(tf.data.AUTOTUNE)

ideally using n gpus should result in runtime reduction by factor of n. but with increase in gpus , there is some overhead due to merging of weights - synchronising the model replicas etc
two gpus - 2x speedup
four gpus - 3.8X
eight gpus - 7.3 X

Beyond GPUs, companies are investing in specialized chips for AI applications based ASIC [application specific ICs]. Google's such efforts lead to TPUs (Tensor Processing Units)
It involves some work to train on TPUs but on the upside, it is 15x faster than NVIDIA P100 Gpu and is cost-effective.


for gpus, just changing runtime is sufficient for connecting to gpus in colab
for tpus, extra piece of code is required.
import tensorflow as tf
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect() print("Device:", tpu.master())
Usage of multi-TPU is similar to multi-GPU training -> a distribution strategy is required and in most of cases, mirroredDistribution strategy would suffice for TPUs as well.

In [None]:
#Example code: cannot be run on MAC - use google colab to experiment with TPUs when required
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect() # to connect to TPU runtime cluster
strategy = tf.distribute.TPUStrategy(tpu) # distribution strategy
print(f"Number of replicas: {strategy.num_replicas_in_sync}") # number of model replicas placed in tpus in sync

In [None]:
#a sample model for vision applications
def build_model(input_size):
    inputs = keras.Input((224, 224, 3))
    x = keras.applications.resnet.preprocess_input(inputs) 
    x = keras.applications.resnet.ResNet50(
        weights=None, include_top=False, pooling="max")(x)
    outputs = layers.Dense(10, activation="softmax")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="sparse_categorical_crossentropy",
                  metrics=["accuracy"])
    return model

In [None]:
with strategy.scope():
    model = build_model(input_size=32)

with this piece of code, the model will start using the tpus for training. 
there is one small caveat - unlike gpus, tpus are a two step VM -> the notebook which hosts runtime does not contain tpus and instead the tpu resides in another computer. So, file read at runtime from the disk (for large files)is not possible.
1. either read from RAM (of VM) during runtime [okay for small dataset]
2. use Google cloud storage bucket from which VM can download data instead of searching in local disk [only option for large dataset which cannot be help in RAM]

In [None]:
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
model.fit(x_train, y_train, batch_size=1024) # large batch size to make sure all the tpus are well utilized and not idling

I/O bottleneck
for a tpu, the processing speed is blazing fast -> so the i/o operations can become a bottleneck
if the dataset is small, make sure to call prefetch to load the data in VM's memory and fetch data from there
if the dataset is large, use tfRecord  - > an efficient binary data storage format which can be loaded quickly to GPUs/ TPUs

with tpus, we have enormous batch sizes -> meaning a large number of data points are there for altering weights -> meaning fewer weight updates but accurate updates as well -> use a larger learning rate to alter weights by good amount
between data fetches, there is some time when tpus could be idle, to prevent that, we could use steps_per_execution argument and set it to 4 or 8 so that the data present in memory can be used to train multiple times till it is overwritten by next set of data => leads to dramatic speedup.