## Digit Recognition

In [12]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, LeakyReLU
from tensorflow.keras import Sequential
from tensorflow.keras.activations import sigmoid, relu, linear, softmax
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as GridSpec


In [9]:
tf.random.set_seed(1234) # for consistent results
model = Sequential(
    [               
        tf.keras.Input(shape = (400,)),
        Dense(units = 25, activation = 'relu'),
        Dense(units = 15, activation = 'relu'),
        Dense(units = 10, activation = 'linear')
    ], name = "my_model" 
)

In [10]:
from tensorflow.keras.losses import BinaryCrossentropy

In [None]:
X = np.array()
Y = np.array()
model.compile(loss = BinaryCrossentropy())
model.fit(X, Y, epochs=100)

## Details about training a model

1. specify how to compute output given input x and parameters w, b (define model)
2. specify the loss and the cost function
3. Train the model to minimize the cost function by changing the parameters

Done by:

1. model = Sequential([])
2. model.compile(loss = BinaryCrossentrpy()) (same one we used for logistic regression)
3. model.fit(X, y, epoch = 100)


## Alternatives to sigmoid function

ReLU $\longrightarrow$ Rectified Linear Unit
$$g(z) = max(0, z)$$

## How to choose activation function

#### Output
It is very stright forward to take a function based on the type of output that is required when it comes to what function to take, like for example:
1) When predicting probability, choose sigmoid function
2) When predicting stock price, choose linear function
3) When  predicting house price, choose ReLU

#### Hidden layers
ReLU is the most common choice, for hidden layers

## Why do we need activation functions?

Because without it, the model is nothing but a linear regression model

## Multiclass Classification

The formula is:
Let $a_1, a_2, a_3, a_4$ be the four categories, then:
$$a_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3} + e^{z_4}}$$
and similarly other estimates

In general:
$$z_j = \vec{w}_j \cdot \vec{x} + b_j$$
and,
$$a_j = \frac{e^{z_j}}{\sum_{k=1}^{N}e^{z_k}}$$

In [1]:
def my_softmax(z):  
    m = z.shape[0]
    a = np.zeros(m)
    e_z = np.exp(z)
    Sum = np.sum(e_z)
    for i in range(m):
        a[i] = e_z[i]/Sum
    return a

#### Cost function for softmax regression

We Remember the Binary Crossentropy loss function, now we generalise the loss function here as:
$$loss = 
    \begin{cases}
        -log(a_1) & \text{if  } y = 1 \\
        -log(a_2) & \text{if  } y = 2 \\
        & \vdots \\
        -log(a_N) & \text{if  } y = N \\
    \end{cases}$$

## Code

In [14]:
model = Sequential([
    Dense(units=25, activation='relu'),
    Dense(units=15, activation='relu'),
    Dense(units=10, activation='softmax')
])

In [18]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

In [19]:
model.compile(loss = SparseCategoricalCrossentropy())

In [None]:
model.fit(X, Y, epoch = 100)

(this code will work, but there is a better version of this code, which we will se later so hold your horses)

## Improved Implementation of softmax

As the memory in the computer is limited, there maybe more or less round of error or the floating point error, depending on the way you calculate as...

In [21]:
x = 2/10000
print(f"{x: .18f}")

 0.000200000000000000


In [22]:
x = (1+ 1/10000) - (1- 1/10000)
print(f"{x: .18f}")

 0.000199999999999978


And the way to caluclate softmax which reduces these errors is...

__Logistic Regression:__

so instead of: <br>
model.compile(loss = BinaryCrossEntropy())

we write: <br>
model.compile(loss = BinaryCrossEntropy(from_logits = True))

All this does is instead of calculating the $a_1$ and $a_2$'s seperately and then pulgging it in the loss function causing there to be floating point errors, we, now  directly put the formula of $a_1$ that is $\frac{1}{1+e^{-z}}$ into the logarithmic expression which allows tensorflow to do the necessary adjustments and calculate the value with less floating point error

__Softmax Algorithm:__

so instead of: <br>
Dense(units = 10, activation = 'softmax') <br>
model.compile(loss = SparseCategoricalCrossEntropy())

we write: <br>
Dense(units = 10, activation = 'linear') <br>
model.compile(loss = SparseCategoricalCrossEntropy(from_logits = True))

the we write: <br>
model.fit(X, Y, epoch = 100)

__to predict:__ <br>
logits = model(X) <br>
f_x = tf.nn.softmax(logits)

and here the logit are the "zs" and not the probability, and hence we add another line of code...

this is also to be done for the logistic regression

## Multilable Classification Problem

Here, either we can train 3 multilayer perceptrons to do one task each, or, we can train one neuron to detec all three, by using a sigmoid activation function as the output and a output to be an n dimensional vector!

## Advanced Optimization Algorithms
Better then Gradient Descent

__Gradient Descent:__
$$w_j = w_j - \alpha \frac{\partial}{\partial w_j} J(\vec{w}, b)$$

__Adam Algorithm (Adaptive movement estimation):__ <br>
Sometimes we wish to have a bigger learning rate and sometimes a smaller learning rate depending on the case, and this is automatically done by the Adam algorithm
$$w_j = w_j - \alpha_j \frac{\partial}{\partial w_j}J(\vec{w}, b)$$

In [None]:
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-3),
              loss = SparseCategoricalCrossentropy(from_logits = True))

### Additional Layer Type
Other than Dense Layer type

##### Convolutional layer

Here each neuron of the neural network dosen't look at all the data points, but rather only looks at a set region of the input data.
- Faster Computation
- Need less training data, and reduce overfitting