# Intro to TensorFlow: MNIST Logistic Regression Classifier

Checked 25.02.24 GPaaß

A detailed introduction to tensorflow is [here](https://machinelearningmastery.com/tensorflow-tutorial-deep-learning-with-tf-keras/).

First we have to load a number of libraries.

In [None]:
import os, sys, math
import numpy as np                             # library for vector, matrix, tensor operations
%matplotlib inline
import matplotlib.pyplot as plt                # plotting library
import pandas as pd                            # data handling library
from IPython.display import display, Markdown  # for formatting answers to questions

In [None]:
import tensorflow as tf                        # main tensorflow library
from tensorflow import keras                   # keras on top of tensorflow
from keras.models import Sequential, clone_model
from keras.layers import Flatten, Dense
from keras.losses import SparseCategoricalCrossentropy
print("python version =",sys.version_info)      # check the version of python
print("tensorflow version:", tf.__version__)    # check the version of tensorflow
print("Tensorflow compute devices (CPU, GPU): ")
for dv in tf.config.list_physical_devices():
    print("\t",dv)

`print_mat`: pretty-print a matrix or dataframe

In [None]:
#@title
def print_mat(x, title="", prtDim=True, max_rows=10, max_columns=10, precision=3, doRound=True,index=None, rowNames=None, colNames=None ):
    """ use pandas display to print a dataframe
        title: to be printed
        max_rows: number or None
        max_columns: number or None
        precision: number
        doRound: True  perform rounding (avoid E notation)
        index: None  row names
        columns: None column names
    """
    import pandas as pd
    import tensorflow as tf
    import numpy as np
    with pd.option_context('display.max_rows', max_rows, 'display.max_columns', max_columns, 'display.precision',precision):
        # pd.options.display.max_columns = None
        if tf.is_tensor(x):
            x = x.numpy()
        if doRound:
            x = np.round(x,decimals=precision)
        if title!="":
            if prtDim:
                print(title,x.shape)
            else:
                print(title,x.shape)
        display(pd.DataFrame(x,index=rowNames, columns=colNames))     # use smaller font


This function resets all random generators to a given state  $\longrightarrow$ an **identical** stream of random numbers is generated.

In [None]:
import random as python_random
def reset_seeds(num):
  """ reset random number generators """
  np.random.seed(num)
  python_random.seed(num)
  tf.random.set_seed(num+1)

## Logistic Regression for MNIST Data

### Read the MNIST Data

**Task**: assign each $28\times28$ pixel image to one of the digits  0,....,9.

The data has the following form
* input matrix $x$ of 60000 rows and 784 columns. Each row represents the image of a digit.
* output value $y$ of length 60000 rows is the class index of the corresponding input.

In addition there are 10000 test examples.


In [None]:
from tensorflow import keras
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
print("x_train.shape",x_train.shape,"\ty_train.shape",y_train.shape)
print("x_test.shape",x_test.shape,"\ty_test.shape",y_test.shape)
xx=x_train
x_train, x_test = x_train / 255.0, x_test / 255.0

Print input and output of one example.

In [None]:
### just for visualization ###
itm = 7   # example to print
print("y_train["+str(itm)+"] =",y_train[itm])
#print("x_train[itm,]=",x_train[itm,])
df = pd.DataFrame(xx[itm,])
pd.options.display.max_columns = None  # no column break
print_mat(df,"x_train["+str(itm)+"] =", max_columns=None, max_rows=None)

In [None]:
### just for visualization ###
def showDigit(itm):
    x1=xx[itm]
    xx1 = np.array(x1, dtype='uint8') # array of 8-bits pixels
    xx1 = xx1.reshape((28, 28))        # 28 x 28 array (2-dimensional array)
    print("y_train["+str(itm)+"]=",y_train[itm])
    print("x_train["+str(itm)+"]=")

    plt.imshow(xx1, cmap='gray')
    plt.show()
showDigit(7)
showDigit(1)
showDigit(2)

### Logistic Regression Classifier with MNIST Data

We again use Keras to define the model. The input has a dimension of $60000\times28\times28$.
It has two layers:
* First layer creates a vector of length $784$ from the $28\times28$ pixel input matrix. <br> The output has dimension $60000\times784$.
* The second layer computes the probabilities of classes by $prb=\text{softmax}(x*A+b)$.
  <br> The output has a dimension of $60000\times10$. Each row is a probability vector.

In [None]:
reset_seeds(42)   ## reproducible random parameters

In [None]:
model0 = Sequential([
  Flatten(input_shape=(28, 28)),  # function: convert 28x28 matrix to vector x
  Dense(10,activation='softmax')  # function: out = softmax(A2*hid +b2)
])

This model is a function, which can be applied to the training data.

The result is a probability vector of length 10 for each input.

In [None]:
prb = model0(x_train)   # application to data requires random generation of parameters
print_mat(prb)

#### Loss function
As before define a **loss function**: <br>
The log of probability of the whole training set $(x_1,y_1),\ldots,(x_n,y_n)$.
$$ loss(w) =  -\log p(y_1|x_1,w)-\ldots- \log p(y_n|x_n,w)$$
where $p(y_i|x_i)$ is the probability of class $y_i$ computed for input $x_i$ with the current parameters $w$.

In [None]:
loss_fn = SparseCategoricalCrossentropy(from_logits=False)  # -log(p(y_iobs)) probability of observed digit, predicted from
loss_fn(y_train, prb).numpy()

#### Optimizer Stochastic Gradient Descent

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)  # use Adam optimizer
model0.compile(optimizer = optimizer, # optimization method as string or optimizer object. alternative is sgd
              loss = loss_fn,         # loss function
              metrics = ['accuracy']) # for accuracy computation on test set

The three most common loss functions are:

-    `binary_crossentropy` for binary classification.
-    `categorical_crossentropy` for multi-class classification
-    `sparse_categorical_crossentropy` for multi-class classification (using an approximation).
-    `mse` (mean squared error) for regression to predict a continuos variable.

A list of loss functions is given [here](https://www.tensorflow.org/api_docs/python/tf/keras/losses). A list of optimizers is given [here](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers). A list of metrics to measure prediction quality is [here](https://www.tensorflow.org/api_docs/python/tf/keras/metrics).

### Training
Fitting the model requires that you first select the training configuration, such as
* the number of **epochs** (loops through the training dataset) and
* the **batch size** (number of samples in an epoch used to estimate model error).

Training applies the chosen optimization algorithm to minimize the chosen loss function and updates the model parameters using the backpropagation of error algorithm.

In [None]:
import time
t0 = time.time()
model0.fit(x_train,                             # training set input
           y_train,                             # training set output
           batch_size=60000,                    # number of training instances for optimization: all
           validation_data=(x_test, y_test),    # validation set (optional)
           epochs=150,                          # number of passes through data
           verbose=2)                           # amount of output: 0-2
print("used {0:.1f} sec".format(time.time()-t0))

While fitting the model, a progress bar will summarize the status of each epoch and the overall training process. This can be simplified to a simple report of model performance each epoch by setting the `verbose` argument to 2. All output can be turned off during training by setting `verbose` to 0.

### Plot the learning curve

Function to plot loss and accuracy.

In [None]:
#@title
def plot_hist(hist):
    fig, ax = plt.subplots(1, 2,figsize=(8,3.3))
    colormap = np.array(['r', 'g'])
    ax[0].title.set_text('Loss')
    ax[0].plot(hist['loss'], label='train loss')
    ax[0].plot(hist['val_loss'], label='validation loss')
    ax[0].set_ylim([0, max(max(hist['loss']), max(hist['val_loss']))])
    #ax[0].scatter(xx[:,0],xx[:,1], c=colormap[yy.astype(int)])
    ax[0].legend()
    ax[0].set_xlabel('epoch')
    #ax[0].set_ylabel('loss')
    ax[1].title.set_text('Accuracy')
    ax[1].plot(hist['accuracy'], label='train accuracy')
    ax[1].plot(hist['val_accuracy'], label='validation accuracy')
    ax[1].set_ylim([min(min(hist['accuracy']), min(hist['val_accuracy'])),1.0])
    ax[1].legend()
    ax[1].set_xlabel('epoch')


In [None]:
plot_hist(model0.history.history)

## Model with smaller batch_size

Compute the gradient not for the whole training set but only for 64 randomly selected elements.

 <font color='red'>**Task 1:**</font>   
 We have a training set of $n=60000$ elements
* The average gradient of the whole training set is $$avg= \frac1n \sum_{i=1}^n -\frac{\partial \log p(y_i|x_i,w)}{\partial w} $$
* Assume $S$ is a random subset of $1,\ldots,n$ containing $|S|$ elements. Then we define
$$avg_S= \frac1{|S|} \sum_{i\in S} -\frac{\partial \log p(y_i|x_i,w)}{\partial w} $$

What is the expected value or average mean value of $avg_S$?

What is the consequence for the gradient steps?

In [None]:
ans=""" """

Run next cell to get an answer.

In [None]:
#@title
display(Markdown(
  rf"""
Question 1:
  * According to the law of large numbers the average of a sample of elements
    from some distribution converges to the global average.<br>
    Therefore the expected values of $avg$ and $avg_S$ are **equal**.

  * The variance (mean square distance from the global average)
    will be smaller for $avg$ than for $avg_S$.

Question 2:
  * Therefore the gradients for a step with $avg$ and $avg_S$ will point **on average** in the same direction.
  * But the gradients of $avg_S$ will have a larger fluctuation around the global average.
"""))

In [None]:
model1 = Sequential([
  Flatten(input_shape=(28, 28)),  # function: convert 28x28 matrix to vector x
  Dense(10,activation='softmax')  # function: out = softmax(A2*hid +b2)
])
opt = tf.keras.optimizers.Adam(learning_rate=0.005)
model1.compile(optimizer = 'adam',     # optimization method as string or optimizer object. alternative is sgd
              loss = loss_fn,         # loss function
              metrics = ['accuracy']) # for accuracy computation
model1.summary()

Memory footprint:
* batchsize 60000: 60000*784 + 60000*1 = 47100000
* batchsize 64: 64*784 + 64*1 = 50240

In [None]:
reset_seeds(42)   ## reproducible random parameters

In [None]:
%%time
batch_size = 64
epochs=10
updates_per_epoch = x_train.shape[0]/64
print("gradient updates_per_epoch",updates_per_epoch,"\n")

model1.fit(x_train,                             # training set input
           y_train,                             # training set output
           batch_size=batch_size,               # number of training instances for optimization: 100
           validation_data=(x_test, y_test),    # validation set (optional)
           epochs=epochs,                       # number of passes through data
           verbose=2)                           # amount of output: 0-2


### Plot

In [None]:
plot_hist(model1.history.history)

In [None]:
x_train.shape[0]

<font color='red'>**Task 2:**</font>   
 What is the number of gradient computations during the epochs?

Run next cell to get an answer.

In [None]:
#@title
display(Markdown(
  rf"""
  * The number of gradient updates per iteration is {x_train.shape[0]}/{batch_size} = {math.floor(x_train.shape[0]/batch_size)}
  * As we have  {epochs} epochs there are **{epochs*math.floor(x_train.shape[0]/batch_size)} gradient updates**.
"""))

## Evaluation of the Model Performance
The model.evaluate method checks the models performance, usually on a validation-set.

The speed of model evaluation is proportional to the amount of data you want to use for the evaluation, although it is much faster than training as the model is not changed.

In [None]:
performance=model1.evaluate(x_test,  y_test, verbose=2)

Predict the model for a few examples. The probabilities of classes are usually near 1.0 or 0.0.


In [None]:
n=20
yhat = model1.predict(x_test[:n])   # need a matrix with rows
print_mat(yhat)
print("arg.max = ",np.argmax(yhat,axis=1))
print("y_test  = ",y_test[:n])
print(np.argmax(yhat,axis=1) == y_test[:n])

## Alternative ways to specify a model




### By iteratively adding layers
All layers form an execution sequence.

Very similar to specifying layers as inputs to `Sequential`.


```
model2 = Sequential()
model2.add(layer1)     # add a layer
model2.add(layer2)     # add another layer
...
model2.compile(optimizer= ..., loss= ...)
```

In [None]:
reset_seeds(42)   ## reproducible random parameters
model2 = Sequential()
model2.add(Flatten(input_shape=(28, 28)))           # convert 28x28 matrix to vector x
model2.add(Dense(10, activation='softmax'))         # out = A2*hid +b2

model2.compile(optimizer='adam',                    # string or optimizer. alternative is sgd
              loss=SparseCategoricalCrossentropy(), # loss function
              metrics=['accuracy'])                 # for accuracy computation
model2.summary()
model2.fit(x_train,                         # training set input
           y_train,                          # training set output
           batch_size=100,                   # number of training instances for each gradient computation: 100
           validation_data=(x_test, y_test), # validation set (optional)
           epochs=10,                        # number of passes through data
           verbose=1)                        # amount of output: 0-2

In [None]:
performance=model2.evaluate(x_test,  y_test, verbose=2)

### By using variables to communicate results

* A variable can be input to several layers (operators).
* An operator can produce several output variables
* This allows to specify **complex connection graphs**.


```
inp = tf.keras.Input(shape=inputDim) # input dim without first component
layer1 = Layer1(hyperparams1)  # define a layer
hid1 = layer1(inp)             # specify input / output variables

layer2 = Layer2(hyperparams2)  # define another layer
hid2 = layer2(hid1)            # specify inputs/ output variables

model = tf.keras.Model(iputs=inp, outputs=hid2)  # define inputs, outputs
model.compile(optimizer= ..., loss= ...)   
```

This can be shortened to
```
inp = tf.keras.Input(shape=inputDim) # input dim without first component
hid1 = Layer1(hyperparams1)(inp)   # define a layer & input, output

hid2 = Layer2(hyperparams2)(hid1)  # define a layer & input, output

model = tf.keras.Model(iputs=inp, outputs=hid2)  # define inputs, outputs
model.compile(optimizer= ..., loss= ...)   `
```

In [None]:
inp = tf.keras.Input(shape= x_train.shape[1:])  # input dim (without first dim)
layer1 = Flatten()                        # define a layer
hid1 = layer1(inp)                        # specify inputs / outputs

layer3 = Dense(10, activation='softmax')     # define another layer
hid2 = layer3(hid1)                       # specify inputs / outputs

model3 = tf.keras.Model(inputs=inp, outputs=hid2)


model3.compile(optimizer='adam',                    # string or optimizer. alternative is sgd
              loss=SparseCategoricalCrossentropy(), # loss function
              metrics=['accuracy'])
model3.summary()

In [None]:
model3.fit(x_train,                         # training set input
           y_train,                          # training set output
           batch_size=100,                   # number of training instances for each gradient computation: 100
           validation_data=(x_test, y_test), # validation set (optional)
           epochs=10,                        # number of passes through data
           verbose=1)                        # amount of output: 0-2

In [None]:
performance=model3.evaluate(x_test,  y_test, verbose=2)