# Neural network classification with tensorflow

As the the title states, this notebook is all about training a classification model

A classification model involves predicting whether something is one thing or the other.

For example, you might want to:
- Predict whether or not someone has heart disease based on their health parameters. This is called binary classification since there are only two options.
- Decide whether a photo of is of food, a person or a dog. This is called multi-class classification since there are more than two options.
- Predict what categories should be assigned to a Wikipedia article. This is called multi-label classification since a single article could have more than one category assigned.

## Typical architecture of a classification neural network 

The word *typical* is on purpose.

Because the architecture of a classification neural network can widely vary depending on the problem you're working on.

However, there are some fundamentals all deep neural networks contain:
* An input layer.
* Some hidden layers.
* An output layer.

Much of the rest is up to the data analyst creating the model.

The following are some standard values you'll often use in your classification neural networks.

| **Hyperparameter** | **Binary Classification** | **Multiclass classification** |
| --- | --- | --- |
| Input layer shape | Same as number of features (e.g. 5 for age, sex, height, weight, smoking status in heart disease prediction) | Same as binary classification |
| Hidden layer(s) | Problem specific, minimum = 1, maximum = unlimited | Same as binary classification |
| Neurons per hidden layer | Problem specific, generally 10 to 100 | Same as binary classification |
| Output layer shape | 1 (one class or the other) | 1 per class (e.g. 3 for food, person or dog photo) |
| Hidden activation | Usually [ReLU](https://www.kaggle.com/dansbecker/rectified-linear-units-relu-in-deep-learning) (rectified linear unit) | Same as binary classification |
| Output activation | [Sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) | [Softmax](https://en.wikipedia.org/wiki/Softmax_function) |
| Loss function | [Cross entropy](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_loss_function_and_logistic_regression) ([`tf.keras.losses.BinaryCrossentropy`](https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy) in TensorFlow) | Cross entropy ([`tf.keras.losses.CategoricalCrossentropy`](https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy) in TensorFlow) |
| Optimizer | [SGD](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD) (stochastic gradient descent), [Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) | Same as binary classification |


## creating data to view and fit
Its a common practice to start off with a simple dataset before we get on the actual data.Treating it as a rehearsal
We can use sciki-learn's `make_circles()` function

In [1]:
from sklearn.datasets import make_circles
import tensorflow as tf
#make 1000 examples
n_samples = 1000
X,y = make_circles(n_samples,noise =0.03, random_state = 42)

In [2]:
X

In [3]:
y

> One of the very first steps of starting any kind of machine learning project is to *become one with the data*

In [4]:
import pandas as pd
circles  = pd.DataFrame({"X0":X[:,0],"x1":X[:,1],"label":y})
circles.head()

In [5]:
circles.label.value_counts()

We seem to be dealing with binary classification problem. If there were more label options we would be dealing with multiclass classification

In [6]:
import matplotlib.pyplot as plt
plt.scatter(X[:,0],X[:,1],c =y,cmap = plt.cm.RdYlBu)

## input and output shapes
lets check out the input and output shapes. This is one of the most important steps

In [7]:
X.shape , y.shape

In [8]:
X[0],y[0]

Alright, so we've got two X features which lead to one y value.

This means our neural network input shape will has to accept a tensor with at least one dimension being two and output a tensor with at least one value.

### Steps in modelling
Now we know what data we have as well as the input and output shapes, let's see how we'd build a neural network to model it.

In TensorFlow, there are typically 3 fundamental steps to creating and training a model.

**Creating a model** - piece together the layers of a neural network yourself (using the functional or sequential API) or import a previously built model (known as transfer learning).

**Compiling a model** - defining how a model's performance should be measured (loss/metrics) as well as defining how it should improve (optimizer).

**Fitting a model** - letting the model try to find patterns in the data (how does X get to y).

In [9]:
tf.random.set_seed(42)
model_1 = tf.keras.Sequential([
    tf.keras.layers.Dense(1)
])

model_1.compile(loss = tf.keras.losses.BinaryCrossentropy(),
               optimizer = tf.keras.optimizers.SGD(),
               metrics =['accuracy'])
model_1.fit(X,y, epochs =5)

In [10]:
# since it results in poor accuracy lets train the model longer
model_1.fit(X,y,epochs=199,verbose =0)
model_1.evaluate(X,y)

In [11]:
# we can see that training it longer did not help. Time to improve the model
tf.random.set_seed(42)
model_2 = tf.keras.Sequential([
    tf.keras.layers.Dense(1),
    tf.keras.layers.Dense(1)
])
model_2.compile(loss = tf.keras.losses.BinaryCrossentropy(),
               optimizer = tf.keras.optimizers.SGD(),
               metrics=["accuracy"])
model_2.fit(X,y,epochs =100,verbose =0)

In [12]:
model_2.evaluate(X,y)

still not a good accuracy. Lets use some other techniques to improve the model.

In [13]:
tf.random.set_seed(42)
model_3 = tf.keras.Sequential()
model_3.add(tf.keras.layers.Dense(100))
model_3.add(tf.keras.layers.Dense(10))
model_3.add(tf.keras.layers.Dense(1))
model_3.compile(loss =tf.keras.losses.BinaryCrossentropy(),
               optimizer = tf.keras.optimizers.Adam(),
               metrics = ['accuracy'])
model_3.fit(X,y,epochs=100,verbose =0)
model_3.evaluate(X,y)

Still there seems to be no changes made. To get to the root of the problem lets make some visualizations to see  whats happening
> whenever a modelis performing strangely or the model isnt yielding results. Its best to always visualize, visualize the data, model and predictions

To visualize the models predictions we are going to create a plot_boundry() function that:
   - takes in training model, X and y data as input
   - creates a meshgrid of different x values
   - makes predictions across the meshgrid
   - plots predictions and a line between the zones
   

In [14]:
import numpy as np

def plot_decision_bound(model,x,y):
    # define the axis boundaries of the plot and create a meshgrid
    x_min,x_max = x[:,0].min()-0.1, x[:,0].max()+0.1
    y_min,y_max =x[:,1].min()-0.1, x[:,1].max()+0.1
    xx,yy = np.meshgrid(np.linspace(x_min,x_max,100),
                       np.linspace(y_min,y_max,100))
    #create x values
    x_in = np.c_[xx.ravel(),yy.ravel()]# stack 2D arrays together
    
    #make predictions using trained model
    y_pred = model.predict(x_in)
    
    # check if its multiclass
    if len(y_pred[0])>1:
        print("proceding with multi classification")
        y_pred = np.argmax(y_pred,axis=1).reshape(xx.shape)
    else:
        print("doing binary classification")
        y_pred = np.round(y_pred).reshape(xx.shape)
    
    plt.contourf(xx,yy,y_pred,cmap=plt.cm.RdYlBu , alpha =0.7)
    plt.scatter(x[:,0],x[:,1],c =y,s=40,cmap=plt.cm.RdYlBu)
    plt.xlim(xx.min(),xx.max())
    plt.ylim(yy.min(),yy.max())

In [15]:
plot_decision_bound(model_3,X,y)


Looks like our model is trying to draw a straight line through the data.

What's wrong with doing this?

The main issue is our data isn't separable by a straight line.

In a regression problem, our model might work. In fact, let's try it.

In [16]:
tf.random.set_seed(42)
x_reg= np.arange(0,1000,5)
y_reg = np.arange(100,1100,5)
# Split it into training and test sets
X_reg_train = x_reg[:150]
X_reg_test = x_reg[150:]
y_reg_train = y_reg[:150]
y_reg_test = y_reg[150:]

# Recreate the model
model_3 = tf.keras.Sequential([
  tf.keras.layers.Dense(100),
  tf.keras.layers.Dense(10),
  tf.keras.layers.Dense(1)
])

# Change the loss and metrics of our compiled model
model_3.compile(loss=tf.keras.losses.mae, # change the loss function to be regression-specific
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['mae']) # change the metric to be regression-specific

# Fit the recompiled model
model_3.fit(X_reg_train, y_reg_train, epochs=100, verbose =0)
y_pred_reg = model_3.predict(X_reg_test)
plt.figure(figsize=(10,7))
plt.scatter(X_reg_train,y_reg_train,c='b',label ='training data')
plt.scatter(X_reg_test,y_reg_test,c ='g',label ="test data")
plt.scatter(X_reg_test,y_pred_reg.squeeze(),c ='r',label ='predictions')
plt.legend()

Not perfect but not bad either,model is learning but something is missing for classification problem


#### The missing piece: Non-linearity
Okay, so we saw our neural network can model straight lines (with ability a little bit better than guessing).

What about non-straight (non-linear) lines?

If we're going to model our classification data (the red and clue circles), we're going to need some non-linear lines.

In [17]:
tf.random.set_seed(42)
model_4 = tf.keras.Sequential([
    tf.keras.layers.Dense(10,activation='relu'),
    tf.keras.layers.Dense(5,activation = 'relu'),
    tf.keras.layers.Dense(1,activation ="sigmoid")
])

model_4.compile(loss = tf.keras.losses.BinaryCrossentropy(),
               optimizer = tf.keras.optimizers.Adam(),
               metrics = ["accuracy"])
history = model_4.fit(X,y,epochs =100, verbose =0)
model_4.evaluate(X,y)

In [18]:
plot_decision_bound(model_4,X,y)


Nice! It looks like our model is almost perfectly (apart from a few examples) separating the two circles.

> What's wrong with the predictions we've made? Are we really evaluating our model correctly here? Hint: what data did the model learn on and what did we predict on?

Before we answer that, it's important to recognize what we've just covered.

> The combination of linear (straight lines) and non-linear (non-straight lines) functions is one of the key fundamentals of neural networks.

lets have a look at how activation layers function on simple values

In [19]:
A = tf.cast(tf.range(-10,10),tf.float32)
A

In [20]:
plt.plot(A)

A straight (linear) line!

Nice, now let's recreate the sigmoid function and see what it does to our data. You can also find a pre-built sigmoid function at `tf.keras.activations.sigmoid`.

In [21]:
def sigmoid(x):
    return 1/(1+tf.exp(-x))

sigmoid(A)

In [22]:
plt.plot(sigmoid(A))
# A non-straight (non-linear) line!
# how about the ReLU function (ReLU turns all negatives to 0 and positive numbers stay the same)?

In [23]:
def relu(x):
    return tf.maximum(0,x)
relu(A)

In [24]:
plt.plot(relu(A))


Another non-straight line!

Well, how about TensorFlow's linear activation function?

In [25]:
tf.keras.activations.linear(A)

In [26]:
A == tf.keras.activations.linear(A)


Okay, so it makes sense now the model doesn't really learn anything when using only linear activation functions, because the linear activation function doesn't change our input data in anyway.

Where as, with our non-linear functions, our data gets manipulated. A neural network uses these kind of transformations at a large scale to figure draw patterns between its inputs and outputs.


### Evaluating and improving our classification model
If you answered the question above, you might've picked up what we've been doing wrong.

We've been evaluating our model on the same data it was trained on.

A better approach would be to split our data into training, validation (optional) and test sets.

Once we've done that, we'll train our model on the training set (let it find patterns in the data) and then see how well it learned the patterns by using it to predict values on the test set.

Let's do it.

In [27]:
x_train,y_train = X[:800],y[:800]
x_test,y_test = X[800:],y[800:]
x_train.shape,y_train.shape

In [28]:
tf.random.set_seed(42)
model_5 = tf.keras.Sequential([
  tf.keras.layers.Dense(10, activation="relu"), # hidden layer 1, using "relu" for activation (same as tf.keras.activations.relu)
  tf.keras.layers.Dense(5, activation="relu"),
  tf.keras.layers.Dense(1, activation="sigmoid") # output layer, using 'sigmoid' for the output
])
model_5.compile(loss=tf.keras.losses.binary_crossentropy,
                optimizer=tf.keras.optimizers.Adam(lr=0.01), # increase learning rate from 0.001 to 0.01 for faster learning
                metrics=['accuracy'])
history = model_5.fit(x_train, y_train, epochs=25,verbose =0)

In [29]:
loss,acc = model_5.evaluate(x_test,y_test)
print(f"model loss on the test set: {loss}")
print(f"model accuracy: {100*acc:.2f}%")

In [30]:
# Visulize
plt.figure(figsize =(12,6))
plt.subplot(1,2,1)
plt.title("Train")
plot_decision_bound(model_5,x_train,y_train)
plt.subplot(1,2,2)
plt.title("test")
plot_decision_bound(model_5,x_test,y_test)
plt.show()

### Plot the loss curves
Looking at the plots above, we can see the outputs of our model are very good.

But how did our model go whilst it was learning?

As in, how did the performance change everytime the model had a chance to look at the data (once every epoch)?

To figure this out, we can check the loss curves (also referred to as the learning curves).

You might've seen we've been using the variable history when calling the fit() function on a model (fit() returns a History object).

This is where we'll get the information for how our model is performing as it learns.

In [31]:
pd.DataFrame(history.history)

In [32]:
pd.DataFrame(history.history).plot()
plt.title("model_5 training curves")


This is the ideal plot we'd be looking for when dealing with a classification problem, loss going down, accuracy going up.

### Finding the best learning rate
Aside from the architecture itself (the layers, number of neurons, activations, etc), the most important hyperparameter you can tune for your neural network models is the learning rate.

To do so, we're going to use the following:

- A learning rate callback.
   - You can think of a callback as an extra piece of functionality you can add to your model while its training.
- Another model (we could use the same ones as above, we we're practicing building models here).
- A modified loss curves plot.


In [33]:
tf.random.set_seed(42)
model_6 = tf.keras.Sequential([
    tf.keras.layers.Dense(4,activation ="relu"),
    tf.keras.layers.Dense(4,activation ="relu"),
    tf.keras.layers.Dense(1,activation ="sigmoid")
])
model_6.compile(loss = tf.keras.losses.BinaryCrossentropy(),
               optimizer = tf.keras.optimizers.Adam(),
               metrics =['accuracy'])
#create a scheduler callback
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-4*10**(epoch/20))
# traverse a set of learning rate values starting from 1e-4, increasing by 10**(epoch/20) every epoch
history = model_6.fit(x_train,y_train,epochs=100,callbacks =[lr_scheduler])

In [34]:
#checkout the history
pd.DataFrame(history.history).plot(figsize=(10,7),xlabel ="epochs")


As you you see the learning rate exponentially increases as the number of epochs increases.

And you can see the model's accuracy goes up (and loss goes down) at a specific point when the learning rate slowly increases.

To figure out where this infliction point is, we can plot the loss versus the log-scale learning rate

In [35]:
# plot the learning rate vs the loss
lrs = 1e-4 * (10 ** (np.arange(100)/20))
plt.figure(figsize=(10,7))
plt.semilogx(lrs,history.history["loss"]) # we want the x-axis (learning rate) to be log scale
plt.xlabel("learning rate")
plt.ylabel("loss")
plt.title("Learning rate vs. loss")

To figure out the ideal value of the learning rate (at least the ideal value to begin training our model), the rule of thumb is to take the learning rate value where the loss is still decreasing but not quite flattened out (usually about 10x smaller than the bottom of the curve).

In this case, our ideal learning rate ends up between 0.01 ($10^{-2}$) and 0.02.

In [36]:
# Example of other typical learning rate values
10**0, 10**-1, 10**-2, 10**-3, 1e-4

In [37]:
# model with 0.02 as the learning rate
tf.random.set_seed(42)

# Create the model
model_7 = tf.keras.Sequential([
  tf.keras.layers.Dense(4, activation="relu"),
  tf.keras.layers.Dense(4, activation="relu"),
  tf.keras.layers.Dense(1, activation="sigmoid")
])

# Compile the model with the ideal learning rate
model_7.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(lr=0.02), # to adjust the learning rate, you need to use tf.keras.optimizers.Adam (not "adam")
                metrics=["accuracy"])

# Fit the model for 20 epochs (5 less than before)
history = model_7.fit(x_train, y_train, epochs=20)

In [38]:
model_7.evaluate(x_test,y_test)

In [39]:
# Plot the decision boundaries for the training and test sets
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_bound(model_7, x_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_bound(model_7, x_test, y_test)
plt.show()

### More classification evaluation methods

Alongside the visualizations we've been making, there are a number of different evaluation metrics we can use to evaluate our classification models.

| **Metric name/Evaluation method** | **Defintion** | **Code** |
| --- | --- | --- |
| Accuracy | Out of 100 predictions, how many does your model get correct? E.g. 95% accuracy means it gets 95/100 predictions correct. | [`sklearn.metrics.accuracy_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) or [`tf.keras.metrics.Accuracy()`](tensorflow.org/api_docs/python/tf/keras/metrics/Accuracy) |
| Precision | Proportion of true positives over total number of samples. Higher precision leads to less false positives (model predicts 1 when it should've been 0). | [`sklearn.metrics.precision_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html) or [`tf.keras.metrics.Precision()`](tensorflow.org/api_docs/python/tf/keras/metrics/Precision) |
| Recall | Proportion of true positives over total number of true positives and false negatives (model predicts 0 when it should've been 1). Higher recall leads to less false negatives. | [`sklearn.metrics.recall_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html) or [`tf.keras.metrics.Recall()`](tensorflow.org/api_docs/python/tf/keras/metrics/Recall) |
| F1-score | Combines precision and recall into one metric. 1 is best, 0 is worst. | [`sklearn.metrics.f1_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) |
| [Confusion matrix](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)  | Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagnol line). | Custom function or [`sklearn.metrics.plot_confusion_matrix()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html) |
| Classification report | Collection of some of the main classification metrics such as precision, recall and f1-score. | [`sklearn.metrics.classification_report()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) |



In [40]:
loss,acc = model_7.evaluate(x_test,y_test)
print(f"loss on test: {loss}")
print(f"accuracy on test: {acc}")

We can make a confusion matrix using Scikit-Learn's confusion_matrix method.

In [41]:
from sklearn.metrics import confusion_matrix

y_preds = model_7.predict(x_test)
confusion_matrix(y_test,y_preds)

In [48]:
y_test[:10],y_preds[:10]

It looks like we need to get our predictions into the binary format (0 or 1).

But you might be wondering, what format are they currently in?

In their current format (9.8526537e-01), they're in a form called **prediction probabilities.**

You'll see this often with the outputs of neural networks. Often they won't be exact values but more a probability of how likely they are to be one value or another.

So one of the steps you'll often see after making predicitons with a neural network is converting the prediction probabilities into labels.

In our case, since our ground truth labels (y_test) are binary (0 or 1), we can convert the prediction probabilities using to their binary form using `tf.round().`

In [49]:
tf.round(y_preds)[:10]

In [50]:
confusion_matrix(y_test,tf.round(y_preds))

In [51]:
# lets visualize the same
figs =(10,10)
cm  = confusion_matrix(y_test,tf.round(y_preds))
cm_norm = cm.astype("float")/cm.sum(axis =1)[:,np.newaxis] # normalize it
n_classes = cm.shape[0]

fig,ax = plt.subplots(figsize =figs)
# matrix plot
cax = ax.matshow(cm, cmap =plt.cm.Blues)
fig.colorbar(cax)
# create classes
classes = False
if classes:
    labels = classes
else:
    labels = np.arange(cm.shape[0])
    
ax.set(title ="confusion matrix",
      xlabel ="predicted label",
      ylabel ="true label",
      xticks = np.arange(n_classes),
      yticks = np.arange(n_classes),
      xticklabels = labels,
      yticklabels = labels)

# set x-axis  labels to bottom
ax.xaxis.set_label_position("bottom")
ax.xaxis.tick_bottom()

# adjust label size
ax.xaxis.label.set_size(20)
ax.yaxis.label.set_size(20)
ax.title.set_size(20)

threshold = (cm.max()+cm.min())/2
import itertools
for i, j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
    plt.text(j,i, f"{cm[i,j]} ({cm_norm[i,j]*100:.1f}%)",
            horizontalalignment="center",
            color ="white" if cm[i,j]> threshold else "black", size =15)


### Working with a larger example (multiclass classification)


For example, say you were a fashion company and you wanted to build a neural network to predict whether a piece of clothing was a shoe, a shirt or a jacket (3 different options).

When you have more than two classes as an option, this is known as multiclass classification.

The good news is, the things we've learned so far (with a few tweaks) can be applied to multiclass classification problems as well.

Let's see it in action.

To start, we'll need some data. The good thing for us is TensorFlow has a multiclass classication dataset known as Fashion MNIST built-in. Meaning we can get started straight away.

We can import it using the` tf.keras.datasets module.`


In [52]:
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist

# The data has already been sorted into training and test sets 
(train_data,train_labels),(test_data,test_labels) = fashion_mnist.load_data()

In [53]:
# Check the shapes of the dataset
train_data.shape, train_labels.shape, test_data.shape, test_labels.shape

So we have 60000 training data and 10000 test data, with each data having shape of (28, 28) and a label.

In [54]:
#lets visualize
import matplotlib.pyplot as plt
plt.imshow(train_data[7])

In [55]:
 train_labels[7]

It looks like our labels are in numerical form. And while this is fine for a neural network, you might want to have them in human readable form.

Let's create a small list of the class names 


In [56]:
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
len(class_names)

In [57]:
# Plot an example image and its label
plt.imshow(train_data[17], cmap=plt.cm.binary) # change the colours to black & white
plt.title(class_names[train_labels[17]]);

In [58]:
# Plot multiple random images of fashion MNIST
import random
plt.figure(figsize=(7, 7))
for i in range(4):
  ax = plt.subplot(2, 2, i + 1)
  rand_index = random.choice(range(len(train_data)))
  plt.imshow(train_data[rand_index], cmap=plt.cm.binary)
  plt.title(class_names[train_labels[rand_index]])
  plt.axis(False)


Alright, let's build a model to figure out the relationship between the pixel values and their labels.

Since this is a multiclass classification problem, we'll need to make a few changes to our architecture (inline with Table 1 above):

- The input shape will have to deal with 28x28 tensors (the height and width of our images).
    - We're actually going to squash the input into a tensor (vector) of shape (784).
- The output shape will have to be 10 because we need our model to predict for 10 different classes.
     - We'll also change the activation parameter of our output layer to be "softmax" instead of 'sigmoid'. As we'll see the "softmax" activation function outputs a series of values between 0 & 1 (the same shape as output shape, which together add up to ~1. The index with the highest value is predicted by the model to be the most likely class.
- We'll need to change our loss function from a binary loss function to a multiclass loss function.
More specifically, since our labels are in integer form, we'll use tf.keras.losses.SparseCategoricalCrossentropy(), if our labels were one-hot encoded (e.g. they looked something like [0, 0, 1, 0, 0...]), we'd use tf.keras.losses.CategoricalCrossentropy().
- We'll also use the validation_data parameter when calling the fit() function. This will give us an idea of how the model performs on the test set during training.

In [59]:
tf.random.set_seed(42)
model_8 = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape =(28,28)), # flattening the input layer from 28X28 to 784
    tf.keras.layers.Dense(4,activation ="relu"),
    tf.keras.layers.Dense(4,activation ="relu"),
    tf.keras.layers.Dense(10,activation ="softmax"),
])
model_8.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(),
               optimizer = tf.keras.optimizers.Adam(),
               metrics=['accuracy'])
history = model_8.fit(train_data,train_labels,epochs =10,validation_data =(test_data, test_labels))

In [60]:
model_8.summary()


Alright, our model gets to about ~35% accuracy after 10 epochs using a similar style model to what we used on our binary classification problem.

Which is better than guessing (guessing with 10 classes would result in about 10% accuracy) but we can do better.

Do you remember when we talked about neural networks preferring numbers between 0 and 1? (if not, treat this as a reminder)

Well, right now, the data we have isn't between 0 and 1, in other words, it's not normalized (hence why we used the non_norm_history variable when calling fit()). It's pixel values are between 0 and 255.

Let's see.

In [61]:
#lets normalize the data
train_data = train_data/255.0
test_data = test_data/255.0

train_data.min(),train_data.max()

In [62]:
# Set random seed
tf.random.set_seed(42)

# Create the model
model_9 = tf.keras.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)), # input layer (we had to reshape 28x28 to 784)
  tf.keras.layers.Dense(4, activation="relu"),
  tf.keras.layers.Dense(4, activation="relu"),
  tf.keras.layers.Dense(10, activation="softmax") # output shape is 10, activation is softmax
])

# Compile the model
model_9.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                 optimizer=tf.keras.optimizers.Adam(),
                 metrics=["accuracy"])

# Create the learning rate callback
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-3 * 10**(epoch/20))

# Fit the model
find_lr_history = model_9.fit(train_data,
                               train_labels,
                               epochs=40, # model already doing pretty good with current LR, probably don't need 100 epochs
                               validation_data=(test_data, test_labels),
                               callbacks=[lr_scheduler])

In [63]:
lrs = 1e-3*(10**(np.arange(40)/20))
plt.semilogx(lrs, find_lr_history.history["loss"]) # want the x-axis to be log-scale
plt.xlabel("Learning rate")
plt.ylabel("Loss")
plt.title("Finding the ideal learning rate");

In [64]:
# lets make some changes , The optimum learning rate seems too be 0.001
tf.random.set_seed(42)
model_10 = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(100,activation = "relu"),
    tf.keras.layers.Dense(50,activation= "relu"),
    tf.keras.layers.Dense(10,activation ="softmax")
])
model_10.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(),
                optimizer = tf.keras.optimizers.Adam(lr =0.001),
                metrics  =["accuracy"] 
                )
history=model_10.fit(train_data,train_labels,epochs = 50,validation_data=(test_data,test_labels))

In [65]:
pd.DataFrame(history.history).plot()

In [84]:
import itertools
from sklearn.metrics import confusion_matrix

def make_confusion_matrix(y_true,y_pred,classes =None,figsize =(10,10),text_size =15):
    """Makes a labelled confusion matrix comparing predictions and ground truth labels.

  If classes is passed, confusion matrix will be labelled, if not, integer class values
  will be used.

  Args:
    y_true: Array of truth labels (must be same shape as y_pred).
    y_pred: Array of predicted labels (must be same shape as y_true).
    classes: Array of class labels (e.g. string form). If `None`, integer labels are used.
    figsize: Size of output figure (default=(10, 10)).
    text_size: Size of output figure text (default=15).
  
  Returns:
    A labelled confusion matrix plot comparing y_true and y_pred.

  Example usage:
    make_confusion_matrix(y_true=test_labels, # ground truth test labels
                          y_pred=y_preds, # predicted labels
                          classes=class_names, # array of class label names
                          figsize=(15, 15),
                          text_size=10)
    """
    cm = confusion_matrix(y_true,y_pred)
    cm_norm = cm.astype("float")/cm.sum(axis=1)[:,np.newaxis]
    n_classes = cm.shape[0]
    fig,ax = plt.subplots(figsize = figsize)
    cax = ax.matshow(cm,cmap =plt.cm.Blues)# colors will represent how 'correct' a class is, darker == better
    fig.colorbar(cax)
    if classes:
        labels = classes
    else:
        labels = np.arange(cm.shape[0])        
    ax.set(title ="confusion matrix",
              xlabel = "predicted label",
              ylabel = "True label",
              xticks = np.arange(n_classes),
              yticks = np.arange(n_classes),
              xticklabels = labels,
              yticklabels = labels,)
    ax.xaxis.set_label_position("bottom")
    ax.xaxis.tick_bottom()
    threshold = (cm.max()+cm.min())/2.0
    #plot the text on each cell
    for i, j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
        plt.text(j,i,f"{cm[i,j]}({cm_norm[i,j]*100:.1f}%)", horizontalalignment ="center",color ="white" if cm[i,j] > threshold else "black",
                    size = text_size)
        

In [77]:
y_probs = model_10.predict(test_data)
y_probs[:5]

Our model outputs a list of prediction probabilities, meaning, it outputs a number for how likely it thinks a particular class is to be the label.

The higher the number in the prediction probabilities list, the more likely the model believes that is the right class.

To find the highest value we can use the argmax() method.

In [78]:
y_probs[0].argmax()

In [79]:
y_pred = y_probs.argmax(axis =1)

In [80]:
cm = confusion_matrix(y_true = test_labels,y_pred =y_pred)
cm

In [87]:
#lets visualize the above
make_confusion_matrix(test_labels,y_pred,class_names,text_size =8)

In [96]:
import random
def plot_rand_img(model,images,true_labels,classes):
    """Picks a random image, plots it and labels it with a predicted and truth label.

  Args:
    model: a trained model (trained on data similar to what's in images).
    images: a set of random images (in tensor form).
    true_labels: array of ground truth labels for images.
    classes: array of class names for images.
  
  Returns:
    A plot of a random image from `images` with a predicted class label from `model`
    as well as the truth class label from `true_labels`.
    """
    i = random.randint(0,len(images))
    target_image = images[i]
    pred_prob = model.predict(target_image.reshape(1,28,28))
    pred_label = classes[pred_prob.argmax()]
    true_label = classes[true_labels[i]]
    
    #plot the target image
    plt.imshow(target_image,cmap = plt.cm.binary)
    
    if pred_label == true_label:
        color = "green"
    else:
        color = "red"
    plt.xlabel("pred: {} {:2.0f}%(True:{})".format(pred_label, 100*tf.reduce_max(pred_prob),true_label),color = color)
    

In [110]:
# Check out a random image as well as its prediction
plot_rand_img(model=model_10, 
                  images=test_data, 
                  true_labels=test_labels, 
                  classes=class_names)

In [111]:
# Find the layers of our most recent model
model_10.layers

In [112]:
#we can access the layers 
model_10.layers[1]

In [113]:
# we can find the pattern learnt at a particular layer
weights , biases = model_10.layers[1].get_weights()
weights, weights.shape

The weights matrix is the same shape as the input data, which in our case is 784 (28x28 pixels). And there's a copy of the weights matrix for each neuron the in the selected layer (our selected layer has 4 neurons).

Each value in the weights matrix corresponds to how a particular value in the input data influences the network's decisions.

These values start out as random numbers (they're set by the kernel_initializer parameter when creating a layer, the default is "glorot_uniform") and are then updated to better representative values of the data (non-random) by the neural network during training.

In [115]:
biases, biases.shape


Every neuron has a bias vector. Each of these is paired with a weight matrix.

The bias values get initialized as zeroes by default (using the bias_initializer parameter).

 The bias vector dictates how much the patterns within the corresponding weights matrix should influence the next layer.

In [117]:
from tensorflow.keras.utils import plot_model
plot_model(model_10,show_shapes =True)


How a model learns (in brief)
Alright, we've trained a bunch of models, but we've never really discussed what's going on under the hood. So how exactly does a model learn?

A model learns by updating and improving its weight matrices and biases values every epoch (in our case, when we call the fit() fucntion).

It does so by comparing the patterns its learned between the data and labels to the actual labels.

If the current patterns (weight matrices and bias values) don't result in a desirable decrease in the loss function (higher loss means worse predictions), the optimizer tries to steer the model to update its patterns in the right way (using the real labels as a reference).

This process of using the real labels as a reference to improve the model's predictions is called backpropagation.

In other words, data and labels pass through a model (forward pass) and it attempts to learn the relationship between the data and labels.

And if this learned relationship isn't close to the actual relationship or it could be improved, the model does so by going back through itself (backward pass) and tweaking its weights matrices and bias values to better represent the data.