### Notes on notations

Throughout the introduction chapter, number of data points is used as a number of features (or dimensionality) of dataset.<br>
> **Number of point = number of features = columns number = n**

If we have a single data point (feature) only, the neural network takes one data point at a time and outputs a single prediction.

### Neural network in action

In [1]:
import numpy as np

In [2]:
weight = 0.1 # initialize weight (usually random or zeros)

def neural_network(input_datapoint, weight):
    predictions = input_datapoint * weight
    return predictions

Neural network inputs the input data, multiplies it by *knowledge* (weight) and outputs *predictions*.<br>
Apply this function to an example.

In [3]:
number_of_toes = [8.5, 9.5, 10, 9] # number of toes - feature. len(number_of_toes) - m = dataset size (number of samples)

In [4]:
neural_network(number_of_toes[0], weight)

0.8500000000000001

Now, let's activate all weights and check the dimensions of input, weights and output.

In [5]:
# initialize 4 weighs randomly
weights = 0.1
number_of_toes = np.array([8.5, 9.5, 10, 9]).reshape(1,-1)

In [6]:
print('Shape of the input data', number_of_toes.shape)

Shape of the input data (1, 4)


In [7]:
linear_activ =  weights * number_of_toes

In [8]:
linear_activ

array([[0.85, 0.95, 1.  , 0.9 ]])

In [9]:
print('Shape of the linear activation', linear_activ.shape)

Shape of the linear activation (1, 4)


As we can see, applying knowledge (weights), we activate all samples (m). <br>
Note that in this case the `hidden_dim = 1`.<br>
**Important**: one weight for one feature! **Dimension of weights has nothing with the batch size m**, it has a shape of (*input_dim*, *hidden_dim*).

It's worth mentioning that in aove case we used batch (m) = 4 with a single feature and only 1 hiddden neuron.

> **Important note**: NN does not have an access to information except one instance!<br>
 I.e. when we feed the input data of batch m (for example, a single data sample of m = 1), net does not remember predictions from last timestamps. It does not have access to previous instances.<br>
 Later on, it's solved by RNN and LSTMs in particular.

#### Weight as a measure of sensitivity

We can think of the weights as a measure of sensitivity between the input data and predictions.<br>
If the **weight is very high**, then even the **tiniest input will create a large prediction**.<br>
Thant's why we pay attention to regularisation.

One last note: neural network can both input and output **either negative or positive values**.

### Making predictions with multiple inputs

Neural networks can combine predictions from **multiple inputs**. <br>
Along with the average number of toes, we can provide other features (other input data) - **win loss** and **number of fans**.

In [10]:
weights = [0.1, 0.2, 0.3] # three weights for 3 features
def neural_network(inputs, weights):
    # apply "weighted sum" function that we will define below
    pred = w_sum(inputs, weights)
    return pred

In [11]:
toes = [8.5, 9.5, 9.9, 9.0] # average number of toes
wlrec = [0.65, 0.8, 0.8, 0.9] # win loss
nfans = [1.2, 1.3, 0.5, 1.0] # number of fans

The lenght of dataset is 5 - **5 samples in our dataset**. The shape of the dataset is **(5,3)**. 

Let's take the **first sample** and feed this to our neural network.<br>
Our goal is to calculate the **weighted sum** across all input features.<br>
Formally, we can write: `predictions = weight_1 * input_1 + weight_2 * input_2 + weight_3 * input_3`

In [12]:
inputs = [toes[0], wlrec[0], nfans[0]]

In [13]:
def w_sum(inputs, weights):
    # first, we assert the input size is equal to weights length
    # again, we assume the use of a single hidden dimension, so the hidden_dim = 1
    assert(len(inputs) == len(weights))
    pred = 0
    # if the input size is equal to weght size, we can iterate over their size
    for i in range(len(inputs)):
        pred += (inputs[i]*weights[i])
    return pred

In [14]:
# apply neural net to calculate the total predictions (linear combination of all local pedictions)
predictions =neural_network(inputs, weights)

In [15]:
print('Output of network:', predictions)

Output of network: 1.34


**Main takeaway**:<br>

> 1. Neural networks combine multiple **local** predicctions in one as the linear combination of weights and corresponding input at the single timestamp.<br>
Recall, we treat **weights as knowledge**, **input as information** and **output of weighted sum as predictions**. <br>
So, the calculation of weighted sum once per instance.<br>

> 2. In order to make accurate predictions, we need to combine multiple data inputs (features).

### Vectorization

"Anytime you perform a mathematical operation between two vectors of equal length where
you **pair up values according to their position in the vector** (again: position 0 with 0, 1 with 1,
and so on), it’s called an **elementwise operation**.<br> Thus elementwise addition sums two vectors,
and elementwise multiplication multiplies two vectors."

In [16]:
a = np.array([ 0, 1, 0, 1])
b = np.array([ 1, 0, 1, 0])
c = np.array([ 0, 1, 1, 0])
d = np.array([.5, 0,.5, 0])
e = np.array([ 0, 1,-1, 0])

The intuition behind the dot product is just a weighted sum, when we perform **element-wise multiplication of vectors and sum up the element-wise results**.<br>
As well, dot product gives as a notion of **similarity between two vectors** of equal size.<br>
Let's calculate the dot products of the vectors above. 

In [17]:
print('Dot product of a and b:', a@b)
print('Dot product of b and c:', b@c)
print('Dot product of b and d:', b@d)
print('Dot product of c and c:', c@c)
print('Dot product of d and d:', d@d)

Dot product of a and b: 0
Dot product of b and c: 1
Dot product of b and d: 1.0
Dot product of c and c: 2
Dot product of d and d: 0.5


First observation: vectors a and b have no overlapping weights, that's why their dot product (or similarity) is 0.<br>
Second observation: vectors b and c have 2nd overlapping position, their similarity is 1. <br>
Third observation: vectors c and e have positive similarity on 1st position, but the negative weight cancells it out.

Let's demonstrate this property with logical `AND` operator.<br>
Vectors a and b do not share the similarity between 0th elements. The `AND` operator returns 0 for them.<br>
However, b and c have 2nd position similar and operator returns 1.

In [18]:
(a[0] and b[0])

0

In [19]:
(b[2] and c[2])

1

In [20]:
(a[0] and b[0]) or (a[1] and b[1])

0

Let's perform **vectorized code** for our network.

In [21]:
def neural_network(inputs, weights):
    pred = np.dot(inputs, weights)
    return pred

In [22]:
weights = np.array([0.1, 0.2, 0.3]) # numpy weights for 3 features
inputs = np.array([toes[0], wlrec[0], nfans[0]]) # numpy array of inputs

In [23]:
pred_vectorized = neural_network(inputs, weights)

In [24]:
print('Vectorized predictions:', pred_vectorized)

Vectorized predictions: 1.34


### Making predictions with single input and multiple outputs 

Neural networks can make **multiple predictions** even with a single input.<br>
In this case we talk about **multiple activations** or multiple hidden neurons.<br>
In case of a single input, but multiple hidden neurons, we will have weights of shape `(1, hidden_dim)`.

Suppose, we want to make **3 predictions** or create **3 hidden neurons**:
 - whether the  won or lost;
 - whethwer the players are happy or sad;
 - percentage of team players who are hurt.
 As an input we will use a single feature: `wlrec` (ratio of wins and losses) for one data sample (`m=1`).

In [25]:
def neural_network(inputs, weights):
    # perform element-wise multiplication
    pred = ele_wise(inputs, weights)
    return pred

In [26]:
def ele_wise(inputs, weights):
    # initiate the list of zeros, with lenght = hidden_dim = 3 (number of outputs)
    output = [0,0,0] 
    # we make sure the weights dimension equals hidden_dim
    assert (len(output) == len(weights))
    for i in range(len(weights)):
        # we have a single input, that will be multiplied by weight, that corresponds to each activation (each output)
        output[i] = inputs * weights[i]
    return output

In [27]:
inputs = wlrec[0]
weights = np.array([0.3, 0.2, 0.9])
output = neural_network(inputs, weights)

In [28]:
print(output)

[0.195, 0.13, 0.5850000000000001]


As we can see from the output above, the network performs output of `size = 3`, **multiplying a single input (for a single sample) by weight**, corresponding to each neuron (or output).

### Predicting with multiple inputs and outputs

Now we are ready to make **multiple predictions based on multiple inputs**.<br>

As previously, we *connect each input node to each output node*.

We were given with weights matrix for 3 outputs (hidden_dim): hurt, win/loss, sad*.<br>
The shape of weights matrix is (3x3), so the `(input_size, input_size)`. 

In [29]:
weights = np.array([ [0.1, 0.1, -0.3], # hurt?
            [0.1, 0.2, 0.0], # win?
            [0.0, 1.3, 0.1] ]) # sad? 

In [30]:
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

inputs = np.array([toes[0], wlrec[0], nfans[0]]).reshape(1,-1)

In [31]:
def neural_network(inputs, weights):
    output = vect_matmul(inputs, weights)
    return output

The idea of `vect_matmul()` function is to **multiply each input datapoint from `inputs` list by corresponding weights**.<br>
In fact, we will perform **vector-matrix multiplication** for inputs vector and matrix of weights.

In [32]:
print('Shape of the input:', inputs.shape)
print('Shape of the weights:', weights.shape)

Shape of the input: (1, 3)
Shape of the weights: (3, 3)


Shape of the input is `(m,input_size)`. In outr case `m=1` and `input_size` = 3 <br>
Shape of the weights is `(input_size, hidden_dim)`. <br>

> Rows determine the `input_size`, or in other words **there are 3 weights coming to each output node (or hidden neuron)**. <br>For example, 3 weights [0.1, 0.2, 0.0] go to the "win" output node from 3 features of a single sample.<br>

> In other words, we can think of **3 weights go from each feature (columns) to corresponding ouptut nodes**. <br>
For example we have 3 weights, coming from `toes` feature [0.1, 0.1, 0.] to 3 corresponging output nodes.

Thus, rows of weights matrix represent the number of input features and columns represent the output nodes.

Our goal is to **activate all `m` samples**, so the output will have shape of `(m, hiddem_size)`.

In [33]:
print(inputs) 

[[8.5  0.65 1.2 ]]


In [34]:
print(weights.T)

[[ 0.1  0.1  0. ]
 [ 0.1  0.2  1.3]
 [-0.3  0.   0.1]]


To follow the logic above, we need to transpose the matrix of weights. <br>
**Example**: for the first feature `toes` we have: 0.85 * 0.1, 0.85 * 0.1, 0.85 * 0.0. From a single feature we have 3 weights, corresponding to 3 different hidden neurons. Then, we need to sum 3 independent activations for `toes` to activations of rest of the features.<br>
We can think about it as **3 independent dot products**. 

In [35]:
# how the output is calculated, using vectorization
inputs @ weights.T

array([[0.555, 0.98 , 0.965]])

#### Defining `vect_matmul` function

Suppose, we want to calculate **3 neurons**, hence `hidden_dim` = 3.<br>
Using the weighted sum function `w_sum` we defined above, we need to calculate 3 outputs.

In [250]:
# define the inputs vector without reshaping
inputs = np.array([toes[0], wlrec[0], nfans[0]])

In [313]:
def w_sum(a, b):
    # start calculating predictions for a single neuron (single output)
    # number of features must be equal to weights size
    assert(len(a) == len(b)) 
    output = 0
    for i in range(len(a)):
        output += (a[i] *b [i])
    return output

In [314]:
"""
 - vect - vector of inputs
 - matrix - matrix of weights
"""
def vect_mat_mul(vect,matrix):
    # since we want to get 3 outputs, we need to initialiase 3 neurons of zeros
    assert(len(vect) == len(matrix))
    output = [0,0,0]
    # iterate over the output neurons
    for i in range(len(vect)):
        # perform the weighted sum of input vector and corresponding matrix weights
        # conceptually, we multiply each data input (each input feature) (information) by neurons weights (knowledge)
        output[i] = w_sum(vect,matrix[i])
    return output

In [315]:
output = neural_network(inputs, weights)

In [316]:
print(output)

[0.21350000000000002, 0.08034999999999999, 0.227455]


Note that 3 predictions are **completely separate from each other** (3 seprarate dot products). <br>
Unlike the network with **multiple inputs and the single output**, we multiply the feature inputs with separate set of weights for each of 3 neurons.<br>
We called these 3 outputs as **hurt** prediction, **win/loss** predictions, **sad** predictions.

### Perdicting on predictions: Stacked NNs

Networks' layers can be stacked and we can make predictions based on the **outputs of the hidden layer**.

Suppose, we want to include a hidden layer and make **predictions based on hidden layer outputs**.<br>
Hidden neurons = 3 (`hidden_dim` = 3).<br>
Output neurons = 3 (`output_dim` = 3).<br>

In order to perform calculations and output 3 predictions for a single sample, we need to initialize **2 weights matrices** and perform **2 vector-matrix multiplications** using `vect_matmul`.

In [318]:
inputs = np.array([toes[0], wlrec[0], nfans[0]])

In [319]:
# first, initiate weights, connecting inputs to hidden units (neurons)
ih_wgt = np.array([[0.1, 0.2, -0.1], # hid[0]
                   [-0.1,0.1, 0.9],   # hid[1]
                   [0.1, 0.4, 0.1]]) # hid[2]

In [320]:
# next, we connect the output of hidden layer with output layer, producing 3 outputs
hp_wgt = [ [0.3, 1.1, -0.3], # hurt?
 [0.1, 0.2, 0.0], # win?
 [0.0, 1.3, 0.1] ] # sad?

In [321]:
# concatinate 2 weight matrices
weights = [ih_wgt, hp_wgt]

In [322]:
def neural_network(inputs, weights):
    hidden_out = vect_matmul(inputs, weights[0]) # vector-matrix multiplication for hidden layer
    output = vect_mat_mul(hidden_out, weights[1]) # vector-matrix multiplication for output as 
    return output

In [323]:
pred = neural_network(inputs, weights)

In [324]:
print(pred)

[0.21350000000000002, 0.14500000000000002, 0.5065]


### Vectorized Numpy version for 1-hidden layer

Let's just rewrite the function above in vectorized form.<br>
An important thing to note here is that we need to **transpose the weight matrices** to make dot product follow the logic.

In [328]:
# first, initiate weights, connecting inputs to hidden units (neurons)
ih_wgt = np.array([[0.1, 0.2, -0.1], # hid[0]
                   [-0.1,0.1, 0.9],   # hid[1]
                   [0.1, 0.4, 0.1]]).T # hid[2]

In [329]:
# next, we connect the output of hidden layer with output layer, producing 3 outputs
hp_wgt = np.array([[0.3, 1.1, -0.3], # hurt?
          [0.1, 0.2, 0.0], # win?
          [0.0, 1.3, 0.1]]).T # sad?

In [330]:
# concatinate 2 weight matrices
weights = [ih_wgt, hp_wgt]

In [331]:
def neural_network(inputs, weights):
    hidden_out = inputs @ weights[0]
    output = hidden_out @ weights[1]
    return output

In [332]:
neural_network(inputs, weights)

array([0.2135, 0.145 , 0.5065])

**Main takeaway**: it's always much efficient to use vectorization instead of for loops.

### Canonical shapes in machine learning

The rule of thub is to follow some guidelines for input/output shapes:
 - input data has dimension `(m, input_size)`, where `m` - number of samples, `input_size` - number of features or dimensionality.
 - weight matrix has dimension `(input_size, hidden_dim)`, where `input_size` - dimensionality of input, `hidden_dim` - the size of hidden layer or number of neurons.
 - hidden output, that is the result of multiplication of input data