# Understanding the behavior of Linear Activation Functions in a Simple Autoencoder
Cameron Farzaneh

The goal of this project is to gain insight as to why a Linear Activation Function is not able to sucessfully reconstruct this specific input data, and why it is behaving the way it is when reducing the dimensions from two, down to one in latency space. The purpose of this experiment is to gain futurer insight into Autoencoders, the basic structure of Neural Networks, and to gain a deeper understanding into the Mathematics involved during the entire process.

In this experiment, I was not able to successfully reconstruct the Input Data. My goal is to understand why this is the case.

# The Dataset

The dataset consists of vectors with magnitudes between -3 and 3. These vectors are unit vectors 45 degrees from the X and Y axis, and 90 degrees from each other.

This is how the dataset looks like:
![title](img/dataset.png)

Now, to construct this dataset, we are first creating basis vectors. These are unit vectors so we can easily control the magnitude. Our basis vectors, $U_1$ and $U_2$ are:
$$U_1 = <\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}>$$
$$U_2 = <-\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}>$$

Now, we can multiply the basis vectors by magnitudes randomly picked between -3 and 3. Doing this, we can construct the dataset above. Our dataset size is 10,000. We can simply store this in a NumPy array.

# The Autoencoder

This is how the Autoencoder looks like:
![title](img/network.png)

In this diagram, $W_1$ and $W_2$ are both weights. They are initialized randomly. $B_z$, $B_1$, and $B_2$ are our biases. $\tilde{X}_1$ and $\tilde{X}_2$ are our output neurons. The weights $W_1$ and $W_2$ are shared, however, they are transposed in-between Z and the reconstruction layer. All together, there are 3 biases and 2 weights.

This autoencoder has one neuron in the hidden layer and two neurons representing for both the input and output layers. The goal of this autoencoder is to reduce the dimensionality from two (the dataset) into one dimension in latency space, and reconstruct the same vectors.

The autoencoder works by taking in two inputs, $X_1$ and $X_2$. $X_1$ and $X_2$ represent the X and Y componants of a single vector (either Purple or Yellow).
So $X_1$ could be the Y compontant and $X_2$ could be the X compontent (or Vice Versa).  Because our autoencoder has only one node in the middle, the transformation from the two nodes to Z is simply a dot product. 

**Note. This is only the case because we are reducing from two neurons to one! Typically, this step would be matrix multiplcation.**

Our forumula for Z is equal to:
$$Z = \sum\limits_{i=1}^{2}{X_iW_i} + B_z$$

<center>or</center>

$$Z = X_1W_1+X_2W_2+B_z$$

Now, we must look at our possibilities as inputs for $X_1$ and $X_2$.
If the input is a point on the purple line, then $X_1$ and $X_2$ would either both be positive, or both be negative.
Similarly, if the input is a point on the yellow line, then $X_1$ is either negative and $X_2$ is positive, or $X_1$ is positive and $X_2$ is negative.

We can write this as:
$$\frac{a}{\sqrt{2}}<-1,1>$$

Because of this, our Z function will look different depending on the input point. 

# The Results

Given the state in which the autoencoder was built, it was not able to successfully reconstruct both vectors. As you can see in the diagram below, only one line was successfully being successfully reconstructed. This must mean that the Autoencoder was only able to learn one of the vectors.

To optimize the cost function, Adam Optimizer was the fastest in comparison to Gradient Decent and Adagrad.

<img src="img/results/result1.png" width="600">

In latency space, it is clear that the input data for the purple line was successfully being transformed into one-dimension. This is not the case for the yellow line. All the points appear to be clustered around the point 0.

The distance between the points in latency space should correspond to the distance in the Input data. This is why the purple line, in latency space, looks almost identitcal to the input data and reconstrcution. The purple line appeared to be successfully keeping the same distance between the points.

But why is the yellow line being mapped to only 0? Why isn't the autoencoder able to learn both lines, and maintain the distance apart in latency space for both lines? To answer this question, we must look at the activation function to determine why it is only learning one of the two vector lines.

# The Activation Function

As mentioned earlier, 

Our forumula for Z is equal to:
$$Z = \sum\limits_{i=1}^{2}{X_iW_i} + B_z$$

<center>or</center>

$$Z = X_1W_1+X_2W_2+B_z$$

However, this varies on the input vector. Based on our dataset, there are four posibilities of vectors being passed into the Autoencoder. Two purple vectors, where both componants are either positive or negative, and two yellow vectors, where one componant is positive, and one negative.

Because of this, if the vector passed into the Autoencoder contained two positive compontents, the vector would be on the purple line, in the first quadrant. Passing this vector into the Neural Network, our activation function Z would now be:

<center>$Z_p = (\frac{a}{\sqrt{2}})W_1+(\frac{a}{\sqrt{2}})W_2+B_z$ &emsp; or &emsp; $Z_p = (-\frac{a}{\sqrt{2}})W_1+(-\frac{a}{\sqrt{2}})W_2+B_z$</center>

Where a is magnitude between 0 and 3 (positive). This function would also work the same if the two vectors passed in where both negative. In that case, the vector would be in the third quadtrant and the formula would be the same (because the two negative components cancel out).

Similarly, if the vector passed into the Autoencoder contained two components, one of them positive, and one of them negative, the vector would exist on the yellow line.

The formula for this function would be: 

<center>$Z_y = (-\frac{a}{\sqrt{2}})W_1+(\frac{a}{\sqrt{2}})W_2+B_z$ &emsp; or &emsp; $Z_y = (\frac{a}{\sqrt{2}})W_1+(-\frac{a}{\sqrt{2}})W_2+B_z$</center>

Now that we know what the Z function is in relation to the four possible vectors passed into the Autoencoder, we must look at the output nodes, $\tilde{X}_1$ and $\tilde{X}_2$.

# The Output Nodes

The output nodes, $\tilde{X}_1$ and $\tilde{X}_2$, are unique in that they are just scaled representations of the output of the Z function. This is because the output of Z is in one demention and is being increased back into two dimensions.

In order to create these output nodes, we must transpose the weights $W_1$ and $W_2$. We then multiply the transposed weights by the output of the Z function, and add the bias corresponding to the output node.

With this being said, the function for $\tilde{X}_1$ and $\tilde{X}_2$ looks like:

<center>$\tilde{X}_1 = ZW_1 + b_1$ &emsp; and &emsp; $\tilde{X}_2 = ZW_2 + b_2$</center>

The weights, $W_1$ and $W_2$ are the same weights used beforehand to create the Z function. But remember, the weights are transposed so they have the ability of being multiplied by Z.

Now that we know the how the Z function works, and how the output nodes are being reconstructed, we can now look into why the Autoencoder is learning only one of the vector lines. To do this, we must look at the cost function. Doing so will show us what the Autoencoder is learning.

# The Cost Function

If you want to know what a Neural Network is doing, you always look at the cost function. 

The cost function is defined as the $\iota ^{2}-norm$ of the input, minus the reconstruction with respect to theta. This is how the cost function looks like:


$$cost = \left \| x-f_{\theta }\left ( x \right ) \right \|_{2}^{2} \hspace{4pt} = \sum_{i}^{2}(x_i-\tilde{x}_i)^{2}$$

Now, lets take the derivative of this cost function and optimize it ourself. In TensorFlow, we are using Adam Optimizer to adjust the weights and optimize the function. If we take the derivative and optimize it, we can see why Adam Optimizer is converging and what the weights are converging to.

The derivative of the cost function looks like: 
$$\triangledown_\theta cost = 2\sum_{i}^{2}(X_i-\tilde{X}_i)\cdot \triangledown_\theta \tilde{X}_i$$
Now, $\theta = <W_1, W_2, b_z, b_1, b_2>$
Therefore the derivative of $\tilde{X}_i$ with respect to theta is the gradient: $\triangledown_\theta \tilde{X}_i$

Now the gradient is equal to:
$$\triangledown_\theta \tilde{X}_i = <\frac{\partial \tilde{X}_i}{\partial {W}_1}, \frac{\partial \tilde{X}_i}{\partial {W}_1}, \frac{\partial \tilde{X}_i}{\partial {b}_z}, \frac{\partial \tilde{X}_i}{\partial {b}_1}, \frac{\partial \tilde{X}_i}{\partial {b}_2}>$$