<a href="https://colab.research.google.com/github/Blackman9t/ML_and_DL_with_tensor_flow/blob/master/deeplearning_ai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notations:

1. A single training example is represented by a pair <b>$(x,y)$</b>, where $x$ is an $nx$ dimensional feature vector and $y$ the label is either 0 or 1.:- <h2>${x\in\ R}^{n_x}$ and $y\in\{0,1\}$</h2>

2. Our training set would comprise <b>$m$</b> training examples like illustrated below:<h2>$\{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), (x^{(3)},y^{(3)})... (x^{(m)},y^{(m)})\}$</h2>
lower case $m$ will denote the number of training examples. Thus $m_{train}$ and $m_{test}$ may refer to the training and testing examples.

3. Finally Matrix $X$ will be defined by taking the training examples $x^{(1)}$, $x^{(2)}$ and so on and stacking them in columns.<br>This means $x^{(1)}$ becomes the first column in Matrix $X$ and $x^{(2)}$ is 2nd column and so on, down to $x^{(m)}$.<br>Therefore this Matrix $X$, will have $m$ columns, where $m$ is the number of training examples, and the number of rows or the height of this Matrix is $nx$. Therefore:- <h2>$X\in\ R^{nx*m}$ Matrix</h2>

4. To make our implementation of a neural network easier, after stacking the $X$ feature Matrix by columns, we need to stack the $Y$ label Matrix by columns too:- <h2>$Y = [y^{(1)}, y^{(2)}, y^{(3)} ... y^{(m)}]$</h2> like so... Therefore: <h2>$Y\in\ R^{1*m}$ Matrix</h2>

**RELU:**

RELU function stands for Rectified Linear Unit. Rectified just means taking a max of 0, that is why we get the RELU function shape.
<br>Neural Networks(NN) are densely connectd because each input feature is connected to every hidden layer. Thus each hidden layer has inputs from all the input features.
<br>Given input features $X$ and a target label $y$, Neural networks are remarkably good at predicting $y$ even for unseen data. NNs are certainly a lot more powerful in supervised learning settings. 

### Basics of Neural Network Programming:

When we program neural networks, we usually keep the parameters $W$ and $b$ separate.

**Logistic Regression as a Neural Network:**

The parameters of a logistic regression are $W$ and $b$.<br>$W$ is an $n_x$ dimensional vector, and $b$, is a real number. $W$ are the weights of the logistic regression and $b$ is the interceptor.

Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable, <b>y</b>, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.

Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function, which is called sigmoid function 𝜎:

$$
ℎ_\theta(𝑥) = \sigma({\theta^TX}) =  \frac {e^{(\theta_0 + \theta_1  x_1 + \theta_2  x_2 +...)}}{1 + e^{(\theta_0 + \theta_1  x_1 + \theta_2  x_2 +\cdots)}}
$$
Or:
$$
ProbabilityOfaClass_1 =  P(Y=1|X) = \sigma({\theta^TX}) = \frac{e^{\theta^TX}}{1+e^{\theta^TX}} 
$$

In this equation, ${\theta^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $\sigma(\theta^TX)$ is the sigmoid or [logistic function](http://en.wikipedia.org/wiki/Logistic_function), also called logistic curve. It is a common "S" shape (sigmoid curve).

So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:



Therefore assume that $\theta^TX + b$ = $z$,
<br>Then the sigmoid function applied to $z$ will be<h2>$1\over1+e^{-z}$</h2> where $e$ is the euler's number of value 2.7182818

**Logistic Regression Cost Function:**

_Loss Function_: Note that the Loss function is defined with respect to a single training example. It measures how well we're doing on a single training example.
<br>_Cost Function_: While the Cost function measures how well we're doing on the entire training set. This is the cost of all our parameters. So in training our logistic regression model, we're going to try to find parameters $W$ and $b$ that minimise the Cost function.Eventually we can see logistic regression as a very small neural network.

See the notation below for the Loss and Cost functions of a logistic regression:-

**Loss Function:** <h2>$LF = -(y^{(i)}*logyhat^{(i)} + (1 - y^{(i)}) * log(1 - yhat^{(i)}))$</h2>

**Cost Function:** <h2>$J(W,b) = -{1\over m} \sum^m_{i=1} (LF)$</h2>

### Gradient Descent Algorithm

While applying the Gradient Descent algorithm, the cost function is updated to be...<br>
$J(W,b) :-$ <h2>$W:= W - \alpha ({\partial J(W,b) \over \partial W})$</h2>
<h2>$b:= b - \alpha ({\partial J(W,b) \over \partial b})$</h2>

The definition of derivative is the slope of a function at any given point. The slope is basically the the rise divided by the run, or the height divided by the width.
<br>The optimisation function of Gradient Descent has a convex shape like below:-

<img src= 'https://github.com/Blackman9t/Machine_Learning/blob/master/gradient_descent%20(1).jpg?raw=true' >

For example if the cost function is randomly initialised beween points A and B in the chart above. The GD will calculate the derivative or the slope of the point where it is initialised and this derivative will be a negative number. Then it plugs it to the formulas above for $W$ and $b$ and updates each accordingly. note that in this case by multiplying $-\alpha$ which is the learning rate to a negative derivative, we get a positive number and its added to the cost function which now moves a step positively closer to the global minimum point B in the chart.
<br> The opposite is the case if the cost function is randomly initialised between point B and C in the chart. In that case the derivative of the point will be a positive number. Then multiplying $-\alpha$ to a positive number gives a negative number and this now becomes the new step as the cost function takes a negative step moving closer from region C to B.

## Vectorisation

**Vectorisation Demo**

In [1]:
import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a,b)
toc = time.time()

print(c)
print('Vectorised version: ' + str(1000 * (toc - tic)), 'ms.')

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i] * b[i]
toc = time.time()

print(c)
print('For-Loop version: ' + str(1000 * (toc - tic)), 'ms.')

250029.76759978366
Vectorised version: 1.055002212524414 ms.
250029.76759978523
For-Loop version: 692.579984664917 ms.


## Broadcasting Example:

Let's get the sum of each column of an array and then compute the percentage of each element in each column to the sum of each column

In [2]:
A = np.array([[56.0, 0.0, 4.4, 68.0],
             [1.2, 104.0, 52.0, 8.0],
             [1.8, 135.0, 99.0, 0.9]])
A

array([[ 56. ,   0. ,   4.4,  68. ],
       [  1.2, 104. ,  52. ,   8. ],
       [  1.8, 135. ,  99. ,   0.9]])

In [3]:
print('A shape is: ',A.shape)
print("A dimension is: ",A.ndim)
type(A)

A shape is:  (3, 4)
A dimension is:  2


numpy.ndarray

In [4]:
# let's calculate the sum of each column of A in cal
cal = A.sum(axis=0)  #np.sum(A, axis=0)
cal

array([ 59. , 239. , 155.4,  76.9])

In [5]:
print('Cal shape is: ',cal.shape)
print("Cal dimension is: ",cal.ndim)
type(cal)

Cal shape is:  (4,)
Cal dimension is:  1


numpy.ndarray

In [6]:
# Let's calculate the percentage of each element to each column sum.
percentage = 100 * (A / cal)
percentage

array([[94.91525424,  0.        ,  2.83140283, 88.42652796],
       [ 2.03389831, 43.51464435, 33.46203346, 10.40312094],
       [ 3.05084746, 56.48535565, 63.70656371,  1.17035111]])

**Avoid Rank 1 Arrays:**

Rank 1 arrays are arrays whose shape display something like this (5,) or (,5).
<br>See example below.

In [7]:
a = np.random.randn(5)
print(a.shape)
print(a)
print(type(a))

(5,)
[-1.06084302  0.10358885 -0.14834132  0.86559636 -0.61669658]
<class 'numpy.ndarray'>


In [8]:
# Now let's try to transpose a and save in b.
b = a.T

print(b.shape)
print(b)

(5,)
[-1.06084302  0.10358885 -0.14834132  0.86559636 -0.61669658]


In [9]:
# Let's multiply a by b and see what happens
a * b

array([1.12538791, 0.01073065, 0.02200515, 0.74925706, 0.38031467])

We see that the transpose of a, which is b is just a and it did not change. this would cause issues in programming neural networks.<br>A better option is to explicitly specify the full shape of a at the point of creation 

In [10]:
a = np.random.randn(5,1)
print(a.shape)
print(a)
print(type(a))

(5, 1)
[[-0.38656812]
 [-0.91290287]
 [-0.19181844]
 [-1.32287266]
 [ 1.19450848]]
<class 'numpy.ndarray'>


Now a above is a full column vector. let's transpose a in b

In [11]:
b = a.T
print(b.shape)
print(b)
print(type(b))

(1, 5)
[[-0.38656812 -0.91290287 -0.19181844 -1.32287266  1.19450848]]
<class 'numpy.ndarray'>


In [12]:
# Let's multiply a by b and see what happens
a * b

array([[ 0.14943491,  0.35289914,  0.07415089,  0.51138039, -0.46175889],
       [ 0.35289914,  0.83339165,  0.1751116 ,  1.20765424, -1.09047022],
       [ 0.07415089,  0.1751116 ,  0.03679431,  0.25375137, -0.22912875],
       [ 0.51138039,  1.20765424,  0.25375137,  1.74999206, -1.5801826 ],
       [-0.46175889, -1.09047022, -0.22912875, -1.5801826 ,  1.4268505 ]])

Feel free to use assert statements to confirm shape of your arrays

In [0]:
assert(a.shape == (5,1))

## Pros and Cons of Activation Functions

<img src = 'https://github.com/Blackman9t/ML_and_DL_with_tensor_flow/blob/master/activation_funcs.jpg?raw=true' height=500 />