# Feedforward NN with TF/KS
Always start with importing Tensorflow(tf) and keras (ks) 

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

To do machine learning TF/KS, you are likely to need to define, save, and restore a model.
A model is (in abstract):
<ol>
    <li> A parameterized function that computes something on tensors that are assigned as values to function variables (a forward pass) </li>
    <li> The function can be updated (trained) when given examples and a loss function - cost of not matching target varaible value in examples</li>     
    <li> To update function some variables can be updated using loss on previous training instances
        <ol>
            <li> some variables can be trainable others non-trainable</li>
        </ol>
    </li>
    <li> Updates are the set of functions (one per network weight) that are computed taken loss tensors as input</li>
</ol>
Models are made of layers. 
<ol>
<li>Layers are functions with a known mathematical structure that can be reused and have trainable variables. </li>
<li>In TF, most high-level implementations of layers and models including Keras, are built on the same foundational class: tf.Module</li>
<li> tf. Module are named container for tf.Variables, other tf.Module and functions which apply to user input. In python terminology tf.module is an empty container class. It is sub-classed to create a true class.</li>
<lI>Any class that is inherited from tf.module has internal state, and methods that use that state.</li>
</ol>
So models and layers are classes and objects in Python.

## Layers and Models in TF


In [None]:
class MyModule(tf.Module):
    def __init__(self,name=None):
        super().__init__(name=name)
        self._w_variable  = tf.Variable(1., shape=tf.TensorShape(None))
        self._b_variable = tf.Variable(0.,shape=tf.TensorShape(None))
        self.__first_run=tf.Variable(0, shape=[],trainable=False)
       
    def __call__(self, x, w = None, b = None):
        if w!=None and self.__first_run==0:
            if len(w.shape) == 1:
                w = tf.reshape(X, [w.shape[0], 1])
            self.__first_run.assign(1)
            self._w_variable.assign(w)
            if b != None:
                self._b_variable.assign(b)
        return tf.matmul(self._w_variable,x) + self._b_variable

Notice that in terms of Python Mymodule is a  module from which
<ul>
    <li> inherit initialiation if the name is given. But if it is not then this is our initialization here.</li>
    <li> MyModule is callable class - i.e. it has method __call__ and can be called by name not only for instantiations but also for computations after object instantiation without method attribute.</li>
</ul>
Also I have not assigned shapes to variables which I will do on my first run, so that I can have any number of neurons at this instantiated by this module. In the cell below I have 2-dim vector as input and I have 3 neurons with weights (1,20, (2,3) and (3,4) respectively.  

Now that I have module that does forward computation I can call it.

In [None]:
my_module = MyModule()
w_0 = tf.constant([[1.0,2.0],[2.0,3.0],[3.0,4.0]])
b_0 = [[1.0],[2.0],[1.0]]
x=[[1.0],[1.0]]
p=my_module(x,w_0,b_0).numpy()
print("forward computation:",p)
print("trainable variables:", my_module.trainable_variables)
print("all variables:", my_module.variables)

Effectively MyModule created a class that is a layer of linear neurons. Now I can create a model that consists of two linear layers such that output of one layer is fed into input of another layer. Hearby I connect two layers seququentially by feeding output of one layer into intput of another. If the second layer is all I need to do, then I created the model for output vairables  

In [None]:
class MyModel(tf.Module):
    def __init__(self, name=None):
        super().__init__(name=name)
        self._layer_1 = MyModule()
        self._layer_2 = MyModule()
    
    def __call__(self, x, w1 = None, b1 = None, w2 = None, b2 = None):
        if w1==None:
            y_1 = self._layer_1(x)
            return self._layer_2(y_1)
        else:
            y_1 = self._layer_1(x,w1,b1)
            return self._layer_2(y_1,w2,b2)

Now I can instantiate my model 

In [None]:
my_model=MyModel()
w_1 = tf.constant([[1.0,2.0],[2.0,3.0],[3.0,4.0]])
b_1 = [[1.0],[2.0],[1.0]]
w_2 = tf.constant([[1.0,2.0,3.0],[3.0,4.0,5.0]])
b_2 = [[1.0],[1.0]]
x=[[1.0],[1.0]]
y_hat=my_model(x,w_1,b_1,w_2,b_2).numpy()
print("forward computation:",y_hat)
print("trainable variables:", my_model.trainable_variables)
print("all variables:", my_model.variables)


## Layers and Models in Keras
So far we were doing it all in TF - NN 'assembler'. It is much easier to do it in Keras, a NN language written in TF, that allready has many necessary concepts prebuilt. The mayn concept in KS is the 'layer' that as its own class. A layer encapsulates both a state (the ”weights’’ + ’’bias”) and a transformation of inputs to outputs (a "call", the layer's forward pass).  

Example below defines linear layer with 
<ul>
    <li>default number of neurons is 32, and the default number of inputs is 32 as well;</li>
    <li> Unlike in previous tf example here weight tensor shape is 2D and it can be initialized to any matrix size on a call to object instantiation by setting number of neurons and inputs into the object instantiation.</li>
    <li>Same with bias – as many as units (neurons) that are initialized with 0’s<li>
    <li>Initialize procedure sets inital weights w_init to be samples from random normal.</li>
</ul>
the output does the same forward computation as before with TF.

In [None]:
class Linear(keras.layers.Layer):
    def __init__(self, units=32, input_dim=32):
        super(Linear, self).__init__()
        w_init = tf.random_normal_initializer()
        self.__w = tf.Variable(
            initial_value=w_init(shape=(input_dim, units), dtype="float32"),
            trainable=True,
        )
        b_init = tf.zeros_initializer()
        self.__b = tf.Variable(
            initial_value=b_init(shape=(units,), dtype="float32"), trainable=True
        )

    def __call__(self, inputs):
        return tf.matmul(inputs, self.__w) + self.__b

We now can compute with this layer as before. So far not much of a difference with TF except we inherited all procedure of layer in KS

In [None]:
x = tf.ones((2, 2))
linear_layer = Linear(4, 2)
y = linear_layer(x)
print(y)

We assumed that weights are matrices but they could be of any shape. We could use the tf procedure of non-declaring tape and then initalizing on the first call. However KS offers better method. We can only speccify number of units in the layer and then ise 'build' method to add input shape and weights. Remeber shapes are [ ] for const, [k] for vector size k, [m,n] for matrix, etc. Using add weights with standard initializer allows to initialize weights. In the example I assume that input is a matrix so I initialize by a  matrix.

In [None]:
class Linear1(keras.layers.Layer):
    def __init__(self, units=32):
        super(Linear1, self).__init__()
        self.__units = units

    def build(self, input_shape):
        self.__w = self.add_weight(
            shape=(input_shape[-1],self.__units),
            initializer="random_normal",
            trainable=True,
        )
        self.__b = self.add_weight(
            shape=(self.__units,), initializer="random_normal", trainable=True
        )
        
    def call(self, x, y):
        z=tf.matmul(x,self.__w) + self.__b
        return tf.matmul(x,self.__w) + self.__b

To set the dimesions (i.e. call build) we just call the the class. However notice that build methood is invoked automatically on class so my program requires access to build to pass arguments, so I cannot have call __call__ hidden (so it has no underscores, same with build!   

In [None]:
x = tf.constant([[2.0,1.0],[1.0,2.0]])
#y_t=[[1.0],[1.0]]
linear_layer = Linear1(4)
y = linear_layer(x)
print(y)

As in TF Layers are composed into a model.
## What needs to be done to define Keras Model
<ul>
    <li>We need to define how to put output of one layer into another</li>
    <li>Model class is used to define the object we train. So there must be gradient descent method that we need to define</li>
    <li>The Model class has the same interface as Layer, with the following differences:
        <ul> 
            <li>It has built-in training, evaluation, and prediction loops model.fit(), model.evaluate(), model.predict().</li>
            <li>It must be defined by the list of its inner layers, by the model.layers method.</li>
            <li>It allows for saving and serialization using save(), save_weights(), etc. methods</li>
        </ul>
    </li>
 </ul>
 
### How we do it

<ul>
    <li>There is a wide library of neurons that can be used for layer definition, no need to define your own. Once layers are defined they are composed into a model directly in Keras as long as layer output is defined. There are two ways to define a model:
 <ul>
    <li> Sequential class. It is for stacks of layers, output of lower layer on stack is input of he next layer. This is the most common network architecture by far. It is done suing declaration
        <ul>
            <li>model = models.Sequential()</li>
        </ul>
    </li>
    <li>Functional class. It allows for directed acyclic graphs of layers, which lets you build completely arbitrary architectures. It is done using declaration
        <ul>
            <li>model = models.Model(inputs=input_tensor, 		outputs=output_tensor)</li>
        </ul>  
        along with the declaration of how layers are connected.</li>
    </ul>
        We would get back to functional API many classes later</li>
    <li> There is no need to define gradient descent for each layer in the model as long as one of the standard neurons is used in each layer of the model.</li> 
    <li>There is no need to specify loss functions in layer specification as long as only standard neurons are used</li> 
    <li>The learning process is configured in the compilation step, where you specify the optimizer and loss function(s) that the model should use, as well as the metrics you want to monitor during training.</li>
<ul>
Of course in the very beginning we need to import from keras models and layers classes:

In [None]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In this example the network consists 3 layers (beyond input):
<ol>
<li>first layer consist of 16 ReLU units that each take matrices with 10000 input features being one dimensios
<li> Second takes input from first and contains 16 ReLU unints
<li>And last is single neuron with sigmoid activation
<li>The loss is binary cross-entropy, metric that we are maintaining is accuracy
</ol>

## Realistic example: imdb classification
IMDB dataset is built into KS: 
<ul>
    <li>a set of 50,000 highly polarized reviews from the Internet Movie Database.</li>
    <li> Reviews have already been preprocessed: the sequences of words have been turned into sequences of integers, where each integer stands for a specific word in a dictionary</li>
    <li>Each record has target variable – class. The values are positive +1 and negative 0</li>
    <li> Dataset is prepared: data is separated into set of training record and st of testing records. Each set consists of 25,000 reviews, of which 50% are negative reviews and 50% are positive reviews.
    <li>Dataset contains wordindex that is a a dictionary that maps integers to words
</ul>
Let's see how does the original review looks like. Note that numeric code in the review encoding and in the dictionary differ by 3 (which is why i-3 is there). This is done to have values 0,1,2 in the encoding for service purposes.   

In [None]:
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
print(train_data[0],'\n')
decoded_review = ' '.join([reverse_word_index.get(i - 3, '\n') for i in train_data[0]])
print(decoded_review)
print('the label of this review is: ', train_labels[0])

### Learning process
Beyond creating the model learning process includes
<ul>
    <li>Encoding the data that we need to learn with (done for imdb)</li> 
    <li>Separating data into training and testing subset (done for imbd)</li>
    <li>Formatting data in the format arrpopriate for the model. This step somewhat depends on the model (needed)</li>
    <li>Separating training data into training and validation data (needed) </li>
    <li>Passing formatted input data (and the corresponding target data) to the model via the model.fit() method (needed)</li>
    <li>Evaluating results of training on testing data using model.evaluate() (needed)</li>
    <li>Predicting results using predict() method (needed)</li>
</ul>
Begin with creating model by importing necessary layers. But first import libraries needed for that. 

Feedforward fully connected network with a few layers of ReLU ($y=\max⁡{(0,\vec{𝑤}^𝑇\cdot \vec{𝑥}+𝑏)}$ units with logistic regression output layer performs well on problems when no specific structure exists like on sentences. How many layers and how many units per layer should we choose?
<ul>
    <li>First layer reduced input dimension vector to the output dimension that has dimensionality of number of units in a  layer</li>
    <li>Every hidden layer can reduce or not reduce the dimension</li>
</ul>
So as one of the gurus (Bengio) of NN said “understand dimensionality of your representation space as how much freedom you’re allowing the network to have when learning internal representations.”
<ul>
    <li>Having more hidden units (a higher-dimensional representation space) per layer allows your network to learn more-complex representations, but it makes the network more computationally expensive and may lead to learning unwanted patterns</li>
    <li>Having more layers allows you learn more sophisticated patterns, but as with number of units you my learn unwanted non-characteristic patterns</li>
</ul>
Unfortunately there is no exact science on it – the only way to establish it is by experimentation
Rule of thumb:
<ol>
    <li>reduce the feature space in the first step to between 10 and 100</li>
    <li>Start with relatively small number of hidden layers</li>
</ol>
We choose RMSprop that is a modification of stochastic gradient descent (gradientTape) that works well on FF (fully connected) networks. We need to do classification so it is sigmoid (logistic regression at the end. Binary cross-entropy is loss associated with output of logistic regression. We are going to watch accuracy of the model to see how it works.

In [None]:
from keras import models
from keras import layers

imdb_model = models.Sequential()
imdb_model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
imdb_model.add(layers.Dense(16, activation='relu'))
imdb_model.add(layers.Dense(1, activation='sigmoid'))
imdb_model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Next create the function that converts sequence into vector of length 10,000. Thus all reviews together are going to form a matrix of dimension number of reviews x 10,000. Why 10,000? because the dictionary contain 10,000 words. So we assign a pair (review number,word number) value 0 if the word is not in the review and we assign to the pair 1 if the word is in the review. We initialize arry to all 0's and then in the loop we fill in each entry i,j by occurenc/non-occurence of j-th word in i'th review. Thus we obtain sparse matrix for reviews-by-words that is called results. Each row in results array is a vector of occurences of 10,000 wors in the corresponding review.  

In [None]:
import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

Next we vectorize taining and testing data and we also convert all labels into numpy array of real values. They are still 0 and 1 but it is going to be matching the input tensor type of output of the sigmoid layer that we had in the model which is madatory for using standard loss! 

In [None]:
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)
print(x_train[0])
# Our vectorized labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

Finally we separate training data into training and validation. Why have validation data set? It provides an unbiased evaluation of a model fit on the training data set while tuning the model's hyperparameters (e.g. the number of hidden units—layers and layer widths—in a neural network). Testing data is used only once and are representative of the general population while validating data is part of training data, so it may be used many times in the process of desgin to fine-tune the model. Here we take first 10,000 records for trainin and the rest for validation.

In [None]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

We are now ready to fit the network to data. We take batch size 512 reviews and do 20 epochs training to start with. If loss stabilizes at the end and accuracy doesn't increase then we had enough epochs for this model.

In [None]:
history = imdb_model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

All the data from training is preserved in results attribute '.history'. We watched accuracy and computed loss. So they must be there. Let's see what is in fact there

In [None]:
history_dict = history.history
history_dict.keys()

Now let's plot these results. We need new library matplotlib.pyplot for that.

In [None]:
import matplotlib.pyplot as plt

now plot these values as function of epoch number. Let's plot validation against training. First let's plot loss.

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Now let's plot accuracy

In [None]:
plt.clf()
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'ro', label='Training acc')
plt.plot(epochs, val_acc, 'g^', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

We can see that after 3rd epoch validation stays flat and even decreses while test accuracy increases. We are overfitting the model. It maybe that data isn't representative or our model is too simple and doesn't capture the distribution. Needs to be adjusted. But it is not a production so we stop here assuming we done our best.

Let's see how it performs on testing data using .evaluate() and predict

In [None]:
l,acc = imdb_model.evaluate(x_test, y_test)
print('mean probability of correct classification (1-loss) is : ',l,'\n','accuracy is: ',acc)
print(imdb_model.predict(x_test))