# Intro to TensorFlow: Logistic Regression Classifier

Checked 25.02.24 GPaaß

A detailed introduction to tensorflow is [here](https://machinelearningmastery.com/tensorflow-tutorial-deep-learning-with-tf-keras/).

First we have to load a number of libraries.

In [None]:
import os, sys, math
import numpy as np                             # library for vector, matrix, tensor operations
%matplotlib inline
import matplotlib.pyplot as plt                # plotting library
import pandas as pd                            # data handling library
from IPython.display import display, Markdown  # for formatting answers to questions

In [None]:
import tensorflow as tf                        # main tensorflow library
from tensorflow import keras                   # keras on top of tensorflow
from keras.models import Sequential, clone_model
from keras.layers import Flatten, Dense
from keras.losses import SparseCategoricalCrossentropy
print("python version =",sys.version_info)      # check the version of python
print("tensorflow version:", tf.__version__)    # check the version of tensorflow
print("Tensorflow compute devices (CPU, GPU): ")
for dv in tf.config.list_physical_devices():
    print("\t",dv)

## Numpy and TensorFlow: Basic Concepts


Both numpy and  tensorflow are languages to process n-dimensional arrays.
- The values of an array are stored in a contiguous vector.
- The dimension **information** is kept separate.
- A large library of high-level mathematical functions to operate on these arrays.



### Matrix Multiplication  in Numpy
Matrix multiplication $C=A * B$ <br>
Requirement: number of columns of A == number rows of B

Every command is **immediately executed**!

In [None]:
v = np.array([1, 2, 3, 4])
print("v=" + str(v))
B = v.reshape([2, 2])  # reshape as 2x2 - matrix.

print("B=\n" + str(B))

In [None]:
A = np.matrix([[1.0, 1.0, -1.0], [0.0, 2.0, 3.0]])  # 2x3 matrix
print("A=\n" + str(A))
C = np.dot(B, A)  # matrix product: 2x3 matrix
print("C=\n" +
      str(C))  # The values are available as soon as operation is executed.
print("type(A)=",type(A))

### Matrix Multiplication  and Other Computations in Tensorflow
Tensorflow can do arithmetic with **tensors**: vectors, matrices and higher-dimensional arrays
  
- Tensors are similar to NumPy ndarray objects,
- tf.Tensor objects have a data type and a shape. Additionally,
- tf.Tensors can reside in accelerator memory (like a GPU).
- TensorFlow offers a rich library of operations (tf.add, tf.matmul, tf.linalg.inv etc.) that consume and produce tf.Tensors.

These operations automatically convert native Python types, for example:

In [None]:
# vector and matrix computations
At = tf.constant([[1.0, 1.0, -1.0], [0.0, 2.0, 3.0]]) ;    print("At=\n",At)
vt = tf.constant([1.0, 2.0, 3.0, 4.0], name="this_is_vt"); print("vt=",vt)
Bt = tf.reshape(vt, [2, 2], name="this_is_Bt");            print("Bt=\n",Bt)
Ct = tf.matmul(Bt, At);                                    print("Ct=\n",Ct)

Same results as with numpy.
- `vt, At, Bt`, and `Ct` are tensors
- have type, shape,  name (optional), and values.
- New nodes are automatically built into the underlying graph!


In [None]:
print(tf.add(1.0, 2))                   # add two scalars
print(tf.square(5))                     # compute square
print(tf.reduce_sum([1, 2, 3]))         # compute a sum. Axis= gives dimension to sum over
print(tf.add([1, 2], [3, 4]))           # add two vectors
print(tf.add([1, 2], [3, 4]).numpy())   # add two vectors. .numpy() returns the numpy array.


In [None]:
tf.matmul?

Getting information:

* get completions:  ` tf.m<TAB> `
* get documentation: ` tf.matmul? `
* show python code of the function: ` tf.matmul??`


`print_mat`: pretty-print a matrix or dataframe

In [None]:
#@title
def print_mat(x, title="", prtDim=True, max_rows=10, max_columns=10, precision=3, doRound=True,index=None, rowNames=None, colNames=None ):
    """ use pandas display to print a dataframe
        title: to be printed
        max_rows: number or None
        max_columns: number or None
        precision: number
        doRound: True  perform rounding (avoid E notation)
        index: None  row names
        columns: None column names
    """
    import pandas as pd
    import tensorflow as tf
    import numpy as np
    with pd.option_context('display.max_rows', max_rows, 'display.max_columns', max_columns, 'display.precision',precision):
        # pd.options.display.max_columns = None
        if tf.is_tensor(x):
            x = x.numpy()
        if doRound:
            x = np.round(x,decimals=precision)
        if title!="":
            if prtDim:
                print(title,x.shape)
            else:
                print(title,x.shape)
        display(pd.DataFrame(x,index=rowNames, columns=colNames))     # use smaller font


This function resets all random generators to a given state  $\longrightarrow$ an **identical** stream of random numbers is generated.

In [None]:
import random as python_random
def reset_seeds(num):
  """ reset random number generators """
  np.random.seed(num)
  python_random.seed(num)
  tf.random.set_seed(num+1)

## Logistic Regression for Simple Data
We generate two datasets with two variables for visualization.
### Generate Data with Two Well-separated Classes
First define softmax function and a plot routine

In [None]:
### define the softmax function
### exp(x_1)/(exp(x_1)+...+exp(x_k)) = exp(x_1-mx)/(exp(x_1-mx)+...+exp(x_k-mx))
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    nc=xx_2.shape[1]
    mx = np.max(xx_2,axis=1)              # compute max of rows of xx_2
    mx = np.repeat(mx[:,np.newaxis],nc,1)  # expand to new dimension
    xx3 = xx_2-mx                         # subtract maximum (avoid overflow)
    ex = np.exp(xx3)                      # compute exponent
    ex_sum = np.sum(ex,axis=1)            # sum of rows
    ex_sum = np.repeat(ex_sum[:,np.newaxis],nc,1) #
    return ex/ex_sum                      # [exp(x_1),...,exp(x_k)]/(exp(x_1)+...+exp(x_k))

In [None]:
colormap = np.array(['b', 'r', 'g'])
colormap[:2]

`plot_points_prob`: function to plot the points of a training set with their class probabilities

In [None]:
#   @title
def plot_points_prob(model, xx, yy, iclass=1, psize = (6.5,6), useMaxProb=False,
                     colormap = np.array(['b', 'r','g'])):
  """ function to plot the points of a training set with their class """
  ngrid=100
  mv=np.ceil(np.max(np.abs(xx))*10)/10
  X, Y = np.meshgrid(np.linspace(-mv, mv, ngrid), np.linspace(-mv, mv, ngrid))
  #print(X.shape,Y.shape)
  xf = X.flatten()
  yf = Y.flatten()
  X_grid = np.column_stack((xf ,yf))
  #print(X_grid.shape)
  Y_pred = model.predict(X_grid)
  nclass = Y_pred.shape[1]
  #print(xf.shape,yf.shape, Y_pred[:,1].shape)

  plt.figure(figsize=psize)
  if useMaxProb:
    plt.title("Maximum probability of a class")
    Yp = Y_pred.max(axis=1)
  else:
    plt.title("Probability of class "+str(iclass))
    Yp = Y_pred[:,iclass]

  Z = np.reshape(Yp,(ngrid,ngrid))
  #print(Z.shape)
  plt.xlabel=("x1")
  plt.ylabel=("x2")
  # RdBu, Spectral, coolwarm
  #plt.contour(X, Y, Z, [0.25,0.75], colors='black',linewidths=0.5);
  plt.contour(X, Y, Z, [0.5], colors='green',linewidths=0.7);



  plt.pcolor(X,Y,Z,vmin = 0.0, vmax = 1.0,cmap='coolwarm')
  plt.colorbar()
  plt.scatter(xx[:,0],xx[:,1], c=colormap[yy.astype(int)])
  #plt.plot(x1g, x2g, 'bo')  # plot x and y using blue circle markers

  #plt.plot(x1l, x2l, 'r+')  # plot x and y using blue circle markers

  #plt.scatter([0.1,0.2],[0.1,0.4])

  plt.show()


In [None]:
def plot_points_maxprob(model, xx, yy, join=[], psize = (6,6),
                       colormap = np.array(['b', 'r', 'g'])):
  """Plot training data xx, yy and the class with maximal probability"""
  ngrid=100
  mv=np.ceil(np.max(np.abs(xx))*10)/10
  X, Y = np.meshgrid(np.linspace(-mv, mv, ngrid), np.linspace(-mv, mv, ngrid))
  #print(X.shape,Y.shape)
  xf = X.flatten()
  yf = Y.flatten()
  X_grid = np.column_stack((xf ,yf))
  #print(X_grid.shape)
  Y_pred = model.predict(X_grid)
  nclass=Y_pred.shape[1]
  y_mx=Y_pred.argmax(axis=1)  # class with maximal probability

  ### plot training set
  plt.figure(figsize=psize)
  plt.title("Training Data and  Areas where a Class has Maximum Probability")
  colormap=colormap[:nclass]
  plt.scatter(xx[:,0],xx[:,1], c=colormap[yy.astype(int)])


  plt.xlabel=("x1")
  #plt.ylabel=("x2")
  colormap = np.array(['b', 'r', 'g'])
  plt.scatter(X.flatten(),Y.flatten(), c=colormap[y_mx.astype(int)],alpha=0.05)

  plt.scatter(xx[:,0],xx[:,1], c=colormap[yy.astype(int)])

  plt.show()

Define a datasets with two blobs corresponding to two classes or thre blobs for three classes.
* Every **row** is an example in the training set

### Define Simple Dataset

In [None]:
def data_2blobs(n, mean1= [-1.0,-0.5], mean2=[1.0,0.5], vr = 0.4):
    cov1=[[vr,0.0],[0.0,vr]]
    cov2=[[vr,0.0],[0.0,vr]]
    n2 = int(math.ceil(n/2))
    xx1 = np.random.multivariate_normal(mean1, cov1, size=n2)   # first class
    yy1 = np.zeros(n2)
    xx2 = np.random.multivariate_normal(mean2, cov2, size=n2)   # second class
    yy2 = np.ones(n2)
    #print(yy1,yy2,np.concatenate([yy1,yy2]))
    return np.concatenate([xx1,xx2]), np.concatenate([yy1,yy2])

def data_3blobs(n, mean1= [-1.0,-0.5], mean2=[1.0,0.5], mean3=[1.0,-0.5], vr = 0.4):
    cov1=[[vr,0.0],[0.0,vr]]
    cov2=[[vr,0.0],[0.0,vr]]
    cov3=[[vr,0.0],[0.0,vr]]
    n2 = int(math.ceil(n/2))
    xx1 = np.random.multivariate_normal(mean1, cov1, size=n2)   # first class
    yy1 = np.zeros(n2)
    xx2 = np.random.multivariate_normal(mean2, cov2, size=n2)   # second class
    yy2 = np.ones(n2)
    xx3 = np.random.multivariate_normal(mean3, cov2, size=n2)   # third class
    yy3 = np.ones(n2)*2
    #print(yy1,yy2,np.concatenate([yy1,yy2]))
    return np.concatenate([xx1,xx2,xx3]), np.concatenate([yy1,yy2,yy3])

def data_xor(n, overlap=0.0):
    n4 = int(math.ceil(n/4))
    xx1 = np.column_stack((np.random.uniform(low=0.0-overlap, high=1.0, size=n4),
                           np.random.uniform(low=0.0-overlap, high=1.0, size=n4)))
    yy1 = np.zeros(n4)
    xx2 = np.column_stack((np.random.uniform(low=-1.0, high=0.0+overlap, size=n4),
                           np.random.uniform(low=-1.0, high=0.0+overlap, size=n4)))
    yy2 = np.zeros(n4)
    xx3 = np.column_stack((np.random.uniform(low=-1.0, high=0.0+overlap, size=n4),
                           np.random.uniform(low=0.0-overlap, high=1.0, size=n4)))
    yy3 = np.ones(n4)
    xx4 = np.column_stack((np.random.uniform(low=0.0-overlap, high=1.0, size=n4),
                           np.random.uniform(low=-1.0, high=0.0+overlap, size=n4)))
    yy4 = np.ones(n4)
    return np.concatenate([xx1,xx2,xx3,xx4]), np.concatenate([yy1,yy2,yy3,yy4])


In [None]:
# always yields the same data and parameters
reset_seeds(4241)   ## reproducible random parameters

In [None]:
use_data="3blobs"  # 2blobs, 3blobs, or xor
n_obs=16
if use_data =="2blobs":
    nclass=2
    xx, yy = data_2blobs(n_obs)    # training data
    xx_val, yy_val = data_2blobs(n_obs)        # validation data
if use_data =="3blobs":
    nclass=3
    xx, yy = data_3blobs(n_obs)    # training data
    xx_val, yy_val = data_3blobs(n_obs)        # validation data
if use_data =="xor":
    nclass=2
    xx, yy = data_xor(n_obs)    # training data
    xx_val, yy_val = data_xor(n_obs)        # validation data
print_mat(xx,"xx")
print_mat(yy,"yy")

Plot the training and test data.

In [None]:
#@title
# Plot the training and test data
ig, ax = plt.subplots(1, 2,figsize=(8,3.3))
colormap = np.array(['b', 'r', 'g'])[:nclass]
ax[0].title.set_text('Training Data')
ax[0].scatter(xx[:,0],xx[:,1], c=colormap[yy.astype(int)])
ax[1].title.set_text('Test Data')
ax[1].scatter(xx_val[:,0],xx_val[:,1], c=colormap[yy_val.astype(int)])

### Define the logistic Regression Model

In [None]:
reset_seeds(42)   ## reproducible random parameters

`Dense(units, activation = f)` generates a layer $ y = f(x*A+b)$. <br>
We use the `softmax` function as activation function $f$.

In [None]:
# define the model
modela = Sequential([            # list of layers. Here is only one layer.
  Dense(nclass,activation='softmax')  # function: out = softmax(x*A2 +b2)
])

In [None]:
Dense?

`modela` is a function. We can **apply** the model to the imput data and get the output probabilities.

When the model gets its first inputs, its gets their dimension and **randomly generates the parameters**.

In [None]:
# compute the prediction for the input xx by the model
yprob=modela(xx)      # yprob = softmax(xx*A2 +b2)
                      # model initializes its parameters
# generate (n x 2) probability matrix for 2 classes
print_mat(yprob,"yprob = softmax(xx*A2 +b2)")

We can extract the parameters A and b

In [None]:
A=modela.layers[0].weights[0].numpy()             # extract A-matrix
print("----- PARAMETERS -----")
print_mat(A,"A")
b=modela.layers[0].weights[1].numpy()             # extract b-vector
print_mat(b,"b")

#### **Repeat Computations with numpy**
We repeat the computations by numpy to show what `modela` computes.
* The difference is close to 0.0

In [None]:
print("----- INPUT -----")
print_mat(xx,"xx")
print("----- NUMPY COMPUTATIONS -----")
xx_1 = np.dot(xx, A)                        # xx * A
print_mat(xx_1,"xx*A")
xx_2 = xx_1 + b                             # xx*A + b
print_mat(xx_2,"xx*A + b")
prb = softmax(xx_2)                         # softmax(xx*A + b)
print_mat(prb,"prb = softmax(xx*A + b)")
print("----- DIFFERENCE NUMPY - KERAS  -----")
print("difference prb-yprob =",np.max(prb-yprob))   # difference to value computed by the model

#### **Plot the Data and the predicted probability for the untrained model**
We plot the data and model prediction:
* The dot indicate the observed points. Their color indicates the class red/blue.
* The model predicts a probability of red/blue.
* The probability for each position in the grid is printed by a color between red and blue. The scale on the right indicates the probability of red.
* The lines (coutour lines) indicate the levels of identical probability.

The model is **not able** to predict the data correctly.

In [None]:
plot_points_prob(modela, xx, yy, iclass=1)

In [None]:
plot_points_prob(modela, xx, yy, useMaxProb=True)

In [None]:

plot_points_maxprob(modela, xx, yy, join=[], psize = (6,6),
                       colormap = np.array(['b', 'r', 'g']))

### Train the Logistic RegressionModel

In [None]:
modela.summary()

Define a **loss function** for the training set $(x_1,y_1),\ldots,(x_n,y_n)$
$$ loss(w) =  -\log p(y_1|x_1,w)-\ldots- \log p(y_n|x_n,w)$$
where $p(y_i|x_i)$ is the probability of class $y_i$ computed for input $x_i$ with the current parameters $w$.

<font size="10" color='red'>**?**</font> 02-c-01  loss function

In [None]:
#@title
from IPython.display import display, Markdown
display(Markdown(
   rf"""
First question:
* $p(y_i|x_i)$ for the $i$-th pair $(x_i,y_i)$ of the training set \\
  is the probability of the observed class $y_i$  \\
  computed for input $x_i$ with the current parameters $w$.

* $p(y_1|x_1, w) * ... * p(y_n|x_n, w)$ is the probability
  of the whole training set, a very small number. \\
  It measure how well the parameter is compatible with the observed data.
* $loss(w)=-\log p(y_1|x_1, w) - ... - \log p(y_n|x_n, w)$ is the negative log
  of the probability of the whole training set

Second question:
* The minimum $w^*$ of the loss function $loss(w)$ is identical to the maximum
  of the probability of the whole training set.
"""))

In [None]:
loss_fn = SparseCategoricalCrossentropy(from_logits=False)  # -negative log-probability of observed x_i

print("value of the loss",loss_fn(yy, yprob).numpy()) # loss_fn can be applied to observation

optimizer = tf.keras.optimizers.SGD(learning_rate=0.04)

 <font color='red'>**Task 1:**</font>   Assume have a training set of 60000 elements with 10 classes
* What will be the initial value of the loss function?"
* What would be the initial value of the loss function for 100 classes?"
* What is the maximal value of the loss function?


In [None]:
ans=""" """

Run next cell for an answer

In [None]:
#@title
from IPython.display import display, Markdown
display(Markdown(
   rf"""
$loss(w)$ is the negative logarithm of the probability of the whole training set of $n=60000$
  $$loss(w) = -\log(p(y_1|x_1, w)) - ... - \log(p(y_n|x_n, w))$$
  As $\log(0.1) \approx {round(math.log(0.1),3)}$ the initial loss value for 10 classes will be {round(60000*math.log(0.1),3)}

  As $\log(0.01) \approx {round(math.log(0.01),3)}$ the initial loss value for 100 classes will be around {round(60000*math.log(0.01),3)}.

  As $\log(1.0)={round(math.log(1.0),3)}$ the maximal value of the loss function is 0.0
"""))

### For Illustration: Apply Gradient Descent Step by Step

This code illustrates the inner working of the gradient descent. Usually these details are hidden from the user.

* The input has dimensions $dim(xx)=(n,2)$
* The output has dimensions $dim(yy)=(n,1)$
* The parameter $w=(A,b)$ has length 6 and was  initialized randomly.

This code performs a loop:

1. Forward propagation  $\qquad$ $prbs = \text{softmax}(xx*A + b)$
1. Compute loss   $\qquad$  $\qquad$  $loss = \text{loss}\_\text{fn}(yy, prbs) = -log(prbs_{yy})$
1. Compute the gradient  $\qquad$  $\frac{\partial \text{loss}}{\partial w}$
1. Update parameters  $\qquad$  $w := w - \lambda \frac{\partial \text{loss}}{\partial w}$

In [None]:
import copy
epochs = 250
nprint = 5
models = []
loss_arr = []
plot_int = 15
for epoch in range(epochs):
    print(f"\n------------------ Start of epoch {epoch} ------------------")
    if epoch%plot_int==0:
      models.append(copy.deepcopy(modela))  # deep copy of the model

    with tf.GradientTape() as tape:
        # --- FORWARD PROPAGATION ---
        # The transformation of inputs will be recorded on the GradientTape.
        prbs = modela(xx, training=True)  # = softmax(xx*A+b)
        if epoch<nprint:
            print("xx",xx)
            print("prbs=softmax(xx*A+b)",prbs.numpy())
            print("yy",yy)

        # --- COMPUTE LOSS ---
        loss_value = loss_fn(yy, prbs)
        print("loss_value",loss_value.numpy())
        loss_arr.append(loss_value.numpy())
    # -------------- COMPUTE GRADIENTS ---------------
    # Use the gradient tape to automatically retrieve
    # the gradients of the trainable variables with respect to the loss.
    # --- Compute Gradients ---
    grads = tape.gradient(loss_value, modela.trainable_weights)

    if epoch<nprint:
        for zi in zip(grads, modela.trainable_weights):
            print("grad=\n",zi[0].numpy(),"\nweight=\n", zi[1].numpy())

    # --- UPDATE PARAMETERS ---
    # Run one step of gradient descent by updating
    # the value of the variables to minimize the loss.
    optimizer.apply_gradients(zip(grads, modela.weights))
    if epoch<nprint:
        for ww in modela.weights:
            print("updated weight",ww.numpy())

models.append(copy.deepcopy(modela))  # deep copy of the final model
loss_arr.append(loss_value.numpy())

In [None]:
print("--------- weights at start --------------")
print("models[0].weights",models[0].weights)
print("--------- final weights  --------------")
print("modela.weights",models[1].weights)

In [None]:
colormap = np.array(['r', 'g'])
plt.title('Loss')
plt.ylim(0.0,max(loss_arr))
plt.plot(loss_arr, label='train loss')
plt.legend()
plt.show()

In [None]:
plot_points_prob(modela, xx, yy, useMaxProb=True)

In [None]:
plot_points_maxprob(modela, xx, yy, join=[], psize = (6,6),
                       colormap = np.array(['b', 'r', 'g']))

This series of plots shows how the probabilities are trained.

In [None]:
for i in range(len(models)):
  print("Epoch "+str(i*plot_int)+" Loss="+str(loss_arr[i]))
  #plot_points_prob(models[i], xx, yy)
  plot_points_maxprob(models[i], xx, yy, join=[], psize = (3,3),
                       colormap = np.array(['b', 'r', 'g']))
plt.show()

In the end the **separating hyperplane** (green) is able to separate the points of both classes without error.

### Train Model with Keras
Reset the model

In [None]:
reset_seeds(42)   ## reproducible random parameters
nclass = max(yy)+1
print("nclass=", nclass)

In [None]:
# define the model
modelb = Sequential([            # list of layers. Here is only one layer.
  Dense(nclass, activation='softmax')  # function: out = softmax(x*A2 +b2)
])

In [None]:
optimizer = tf.keras.optimizers.legacy.SGD(learning_rate=0.5)   # compatibility problem
modelb.compile(optimizer = optimizer,     # optimization method as string or optimizer object. alternative is sgd
               loss = loss_fn,         # loss function
               metrics = ['accuracy']) # for accuracy computation

In [None]:
import time
t0 = time.time()
historyb=modelb.fit(xx,                            # training set input
                    yy,                            # training set output
                    batch_size=xx.shape[0],        # number of training instances for optimization: all
                    validation_data=(xx_val, yy_val),    # validation set (optional)
                    epochs=50,                          # number of passes through data
                    verbose=2)                           # amount of output: 0-2
print("used {0:.1f} sec".format(time.time()-t0))


In [None]:
#modelb.history.history
performance=modelb.evaluate(xx_val,  yy_val, verbose=2)

In [None]:
history.history


In [None]:
b#@title Plot loss and accuracy
def plot_hist(hist):
    fig, ax = plt.subplots(1, 2,figsize=(8,3.3))
    colormap = np.array(['r', 'g'])
    ax[0].title.set_text('Loss')
    ax[0].plot(hist['loss'], label='train loss')
    ax[0].plot(hist['val_loss'], label='validation loss')
    ax[0].set_ylim([0, max(max(hist['loss']), max(hist['val_loss']))])
    #ax[0].scatter(xx[:,0],xx[:,1], c=colormap[yy.astype(int)])
    ax[0].legend()
    ax[0].set_xlabel('epoch')
    #ax[0].set_ylabel('loss')
    ax[1].title.set_text('Accuracy')
    ax[1].plot(hist['accuracy'], label='train accuracy')
    ax[1].plot(hist['val_accuracy'], label='validation accuracy')
    ax[1].set_ylim([min(min(hist['accuracy']), min(hist['val_accuracy'])),1.0])
    ax[1].legend()
    ax[1].set_xlabel('epoch')
#ax[1].set_ylabel('accuracy')
plot_hist(historyb.history)
#plot_points_prob(modelb,xx)

In [None]:
plot_points_prob(modelb, xx, yy, useMaxProb=True)
plot_points_maxprob(modelb, xx, yy, join=[], psize = (6,6),
                       colormap = np.array(['b', 'r', 'g']))

In [None]:
#@title
from IPython.display import display, Markdown
display(Markdown(
   rf"""
$loss(w)$ is the negative logarithm of the probability of the whole training set.
  $$loss(w) = -\log(p(y_1|x_1, w)) - ... - \log(p(y_n|x_n, w))$$
  As $\log(0.1) \approx -2.3$ the initial loss value for 10 classes will be around -23000.0

  As $\log(0.01) \approx -4.3$ the initial loss value for 100 classes will be around -46000.0
"""))