# Binary Classification Problem

This problem is, for a given image, indicate whether the image has a cat or not. It is the practical problem that will be used through out the course to practice building the neural network.

The image will be represented as a three separate matrices. Each matrix represent one channel of red, green, and blue colors. These matrices will be refered as the set of features, X. Y, on the other hand, is a vector for each image indicating whether the image contains a cat or not.

In order to transform image matrices to an input vector, they need to be unrolled to a vector of length 64\*64\*3=12288 as each image is an 64 pexil image.

The problem is formulated as $$F(X) \longrightarrow y$$ where $F(X)$ is the function to be learned by the neural network, $X$ is the matrix of feature vectors, and $Y$ is the vector of outputs for each feature vector.

# Notations

- $M$ represent an example ($x^{(i)}$,$y^{(i)}$).

- $M_{train}$ represent training examples. Similarly, $M_{test}$, and $M_{dev}$ represents testing and development examples respecitvely. Let $m$ represents the number of training examples in $M$.

- $X$ represents the matrix of $x^{i}$ vectors as follows: 

$$\begin{bmatrix} . & . & . & .\\ . & . & . & . \\ x^{(1)} & x^{(2)} & .. & x^{(m)} \\ . & . & . & .\\ . & . & . & . \end{bmatrix}$$

In that since, $X$ is of size $n_x\times m$ where $n_x$ is the length of any $x^{i}$ vector, 12288 in this problem, and $m$ is the size of the examples set $M$. More formally: $X \in \mathbb{R}^{(n_x , m)}$

- $Y$ is the set of outputs. It is, as $X$, is stacked vertically as follows:

$$\begin{bmatrix} y^{(1)} & y^{(2)} & .. & y^{(m)} \end{bmatrix}$$

In this regard, $Y$ will be defined more formally as: $Y \in \mathbb{R}^{(1 , m)}$


# Logestic Regression

given an input feature x, a vector of an image, the requirement is to predict a propability score $$\hat{y} = P(y=1|x)$$ that the image contains a cat.

The parameters of the model are the weights, $w$ and the bias $b$.

The formulat can be derived from the linear regression conjugated with sigmod function $$\hat{y} = \sigma(w^Tx + b)$$ where $$\sigma(z) = \frac{1}{1+e^{-z}}$$

### Some notes about $\sigma (z)$ function

- If the value $z$ is very large, then the value of the $\sigma$ becomes very close to 1.
- If the value of $z$ is very small, like a very large negative number, then the value of $\sigma$ function becomes very small.

## Logestic Regression Training

### The loss function derivation

The linear regression loss (error) function $$L(\hat{y},y) = \frac{1}{2}(\hat{y}-y)^2$$ is not applicable here as it is not convex. i.e. the gradient descent algorithm is not guarnteed to find a global minimum for the function. The function is not convex due to the use of the sigmoid $\sigma(z)$ function.

The following Loss function is used in logestic regression $$ L(\hat{y},y) = - (y\log{\hat{y}} + (1-y)\log{(1-\hat{y})}) $$

Some intuition why this loss function works:

- If $y=1$, then, in order to minimize the function $\hat{y}$ should be as large as possible. The second term is zero.
- If $\hat{y}=1$, then, $(1-\hat{y})$ should be as large as possible, hense $\hat{y}$ should be as small as possible.

Hense, if the true value $y$ is 1, then $\hat{y}$ is pushed to be as large as possible i.e. closer to 1. Also, if $y = 0$, then $\hat{y}$ is pushed to be as small as possible, i.e. closer to zero.

### The Cost function derivation

$$J(w,b) = \frac{1}{m} \sum^m_{i=1}L(\hat{y}^{(i)},y^{(i)})$$ 

where L is the loss function derived in the previous subsection. Hence, the final formula would be:

$$J(w,b) = -\frac{1}{m}\sum^{m}_{i=1}(y^{(i)}\log(\hat{y}^{(i)})+(1-y^{(i)})\log(1-\hat{y}^{(i)}))$$

# Gradient Descent

The key part of the gradient descent algorithm is iteratively updates the weights and biases until some threshold or predefined number of iterations. The formula is $$w := w- \alpha \frac{\partial J(w,b)}{\partial w} $$ $$ b := b-\alpha \frac{\partial J(w,b)}{\partial b} $$

where:
- $\frac{\partial J(w,b)}{\partial w}$ is the partial derivative of the cost function with respect to weights parameter, w. Similarly, $\frac{\partial J(w,b)}{\partial b}$ is the bias partial derivative.
-  $\alpha$ is a parameter used to control the update rate. It is usually called, the learning rate.



# The Computation Graph

The computation graph organizes the computation of the gradient descent and its derivatives on both passes, forward pass and backward pass. Consider the following function $$f(x)=3(a+bc)$$ We can organize it as the following computation graph:

<figure class="image">
    <center><img src="imgs/computation_graph.png"/></center>
    <center><figcaption> Fig. 1: The Computation Graph </figcaption></center>
</figure>

the variable $u$ and $v$ are used to facilitate the computation.

the derivative of j with respect to v can be calculated as $\frac{dj}{dv}$ which is 3. The derivative $\frac{dj}{du}$ can be calculated using the chain rule as $\frac{dj}{dv} \times \frac{dv}{du}$ which is $3 \times 1 = 3$ and so on until reaching a,b and c variables.

# Logestic Regression Computation Graph

Consider the following graph for logestic regression with two features x1,x2 example.

<figure class="image">
    <center><img src="./imgs/logestic_regression_computation_graph.png" /></center>
    <center><figcaption> Fig. 2: The Logestic Regression Computation Graph </figcaption></center>
</figure>

 ![test](imgs/logestic_regression_computation_graph.png)

Note that $\hat{y}$ is replaced with $a$ in the computation graph just for presentation conveniences. Now, in order to compute the derivative of the loss function $l$ with respect to $a$ we have:

$$ \frac{\partial l(a,y)}{\partial a} = -(y\log{a}+(1-y)(\log{1-a})) $$

$$ - \frac{y}{a} + \frac{1-y}{1-a} $$

Similarly, to compute the derivative of the loss function with respect to $z$, the chain rule will be used as follows:

$$ \frac{\partial l(a,y)}{\partial z} = \frac{\partial l(a,y)}{\partial a} * \frac{\partial a}{\partial z}$$

$$ (-\frac{y}{a}+\frac{1-y}{1-a}) \times (a(1-a)) = a-y$$

finally, to compute the derivatives for the parameters w,b as follows: 

$$\frac{\partial l}{\partial w_1} = x_1 \times(a-y)$$ 

$$\frac{\partial l}{\partial w_2} = x_2 \times (a-y)$$

$$ \frac{\partial l}{\partial b} = (a-y) $$ 
 
where $a-y$ is the value of $\frac{\partial l}{\partial z}$

Finally, this one example derivation of the gradient descent algorithm can be generalized for $w_1$ to a dataset of m examples as follows:

$$ \frac{\partial J(w,b)}{\partial w_1} = \frac{1}{m} \sum^m_{i=1} \frac{\partial l(a^{(i)},y^{(i)})}{\partial w_1} $$

The rest of the varialbes, $w_2$ and $b$ have similar derivation over m examples.

# Vectorization and Broadcasting

Vectorization refers to the process of applying matrix operations with the help of parallalisim in the processor architecture. Usually, it is acheived through special packages like `numpy` for Python. Examples of such operations could be dot product or matrix addition.

Broadcasting refers primarly to the process of aplying an operation of a scalar to a vector. For example adding, subtracting or multiplying a scalar by a vector exploying parallelisim features of the processor.