# Artificial Neural Networks

Previously, we looked at the case where $h_{\theta}(x) = \theta^{T}x$, possible in linear and logistic regression. This model is linear in $\theta$. ANNs are models that are non-linear in both the parameters $\theta$ and the inputs $x$.

Suppose $\{(x^{(i)}, y^{(i)})\}_{i=1}^{n}$ are the training examples and $y^{(i)} \in \mathbb{R}$ and $h_{\theta}(x) \in \mathbb{R}$.

If we define the least squares cost function for the $i$-th example as $J^{(i)}(\theta) = \frac{1}{2}(h_{\theta}(x^{(i)}) - y^{(i)})^{2}$, then we can also define the mean-square cost function for the dataset as:

$$ J(\theta) = \frac{1}{n} \sum_{i=1}^{n}J^{(i)}(\theta) $$

which looks just the same as for linear regression except there is now a constant $\frac{1}{n}$ in front.

## Stochastic Gradient descent

Gradient descent is an iterative technique for finding the minimum of a function. For an analytical function, the minimum can be found simply by setting the first derivative to zero and solving for the parameters. An alternative is to imagine our cost function as a curve in space, and placing a ball at a random point. If we let go of the ball, it will fall down the gradient of steepest descent, namely in the direction opposite to $\nabla_{\theta}J(\theta)$.

Let's image that, given a function $f(x,y)$ we start at a particular point $(x_{0}, y_{0})$. Then we can move a small amount in the direction of steepest ascent to find a new point $(x_{1}, y_{1})$, i.e., $\vec{x_{1}} = \vec{x_{0}} - \alpha\nabla{f(\vec{x_{0}})}$, where $\alpha$ controls the amount by which we move down the curve, and is known as the **learning rate**. We can repeat this process using $(x_{1}, y_{1})$ as our input and calculating $(x_{2}, y_{2})$. Eventually, by repeating this process many times, we will converge on the minimum point. We can summarize this process in the following equation:

$$ \theta \coloneqq  \theta - \alpha\nabla_{\theta}J(\theta) $$

**A simple example for demonstration**: Let $f(x,y) = 4x^{2} + y^{2}$. We know that $\nabla{f(x,y)} = (8x, 2y)$, and if we set this zero and solve for $x$ and $y$ we find the minimum point is $(0,0)$. However, instead of this approach, let's choose a position on the function, $(x, y) = (1, 1)$ and run gradient descent. Let's set the learning rate to be 0.1 and see how many iterations it takes to reach the minimum point.

In [8]:
alpha = 0.1 # learning rate
(x, y) = (1, 1) # starting points of the function
n_iter = 10 # number of iterations

for i in range(n_iter):
    partial_x = 8*x
    partial_y = 2*y
    x = x - alpha*partial_x
    y = y - alpha*partial_y
    print(x, y)



0.19999999999999996 0.8
0.03999999999999998 0.64
0.007999999999999993 0.512
0.0015999999999999981 0.4096
0.00031999999999999954 0.32768
6.399999999999988e-05 0.26214400000000004
1.2799999999999972e-05 0.20971520000000005
2.559999999999994e-06 0.16777216000000003
5.119999999999987e-07 0.13421772800000004
1.023999999999997e-07 0.10737418240000003


So, as we see from the above, we get close and close to the minimum point, although the y-coordinate appears to be taking much longer than the x-coordinate. Playing with the learning rate will reveal a few interesting features:
- too high (0.5) and the x-coordinate will jump around all over the place
- too low (0.01) and convergence will take a very long time, requiring many more iterations.

**Using real data**
The above example uses a function that we are already provided with. However, when we are using training data, we don't know the value of $\nabla_{\theta}J(\theta)$. In fact, we are trying to find $\theta$ such that this value is zero, and this is why gradient descent is useful.

When studying linear regression, we used matrix algebra to find the normal equation. The normal equation is an analytical solution to the case that $\nabla_{\theta}J(\theta) = 0$. In fact, just before setting this value to zero, we have:

$$\vec\nabla_{\theta}J(\theta) =  X^{T}X\theta - X^{T}\vec{y} $$

This is the same as calculating the error = prediction - observation, then multiplying the transpose of the data with the error.

This means we can use the data in the gradient descent algorithm. Depending on the cost function we use, this formula can change, but essentially the process is like this. We can look at example of gradient descent. Let's repeat the linear regression of sales ~ advertising to see that gradient descent can help us to find the coefficients. The following uses a BATCH gradient descent, so the $(1/n)$ term becomes important since the cost function is essentially the Mean Square Error (MSE), which is the Residual Square Error (RSE) divided by the number of data points (below, labelled m).

After some tweaking, a learning rate of 0.001 and 10000 iterations led to an approximation of the coefficients which is close to that given by the normal equations by linear regression in chapter 5.

In [98]:
import pandas as pd
import numpy as np

# Compute the cost function to monitor its decrease during gradient descent
def compute_cost(X, y, coefs):
    m = len(y)
    prediction = X @ coefs
    cost = (1 / (2 * m)) * np.sum((prediction - y) ** 2)
    return cost

# Prepare a function for gradient descent
def gradient_descent(n_iter, X, y, coefs, alpha):
    m = len(y) # The number of data points

    for i in range(n_iter):
        # # Randomly sample an index j
        # j = np.random.randint(0, m)

        # # Compute the gradient using the j-th data point - this is stochastic gradient descent
        # X_j = X[j, :].reshape(1, -1) # Shape (1, n_features)
        # y_j = y[j] # Shape (1,)
        # gradient = X_j.T @ (X_j @ coefs - y_j) # Shape (n_features, 1)
        # coefs = coefs - alpha * gradient

        # Compute the prediction
        predictions = X @ coefs # Shape (m,)
        
        # Compute the gradient using all data points
        errors = predictions - y #Shape (m,)
        gradient = (1/m) * (X.T @ errors) # Shape (n_features, )

        # Update coefficients
        coefs = coefs - alpha * gradient

        cf = compute_cost(X, y, coefs)
        # print(f"Iteration {i + 1}: Cost = {cf}, Coefs = {coefs}")
    return coefs



# Load the data in
df = pd.read_csv("Data/Advertising.csv")

# Keep the mean and standard deviations of the data for conversion back to original scale later
meanX = df['TV'].mean()
meanY = df['sales'].mean()
stdX = df['TV'].std()
stdY = df['sales'].std()

# Normalize / Standardize the data
df[['TV', 'sales']] = (df[['TV', 'sales']] - df[['TV', 'sales']].mean()) / df[['TV', 'sales']].std()

# Prepare the data
X = np.c_[np.ones(df.shape[0]), df['TV'].values] # Add intercept terms
y = df['sales'].values

# Initialize coefficients as a numpy array
coefs = np.array([0, 0])

# Now run the calculation
new_coefs = gradient_descent(10000, X, y, coefs, 0.001);

# Convert the coefficients back to their original scale.
beta1 = new_coefs[1] * stdY / stdX
beta0 = meanY - (beta1*meanX);
print(beta1, beta0)


0.047534382832213604 7.0329255123942325


## What is a neural network?
...

## Activation Functions
...

## Vectorization
...

## Backpropagation
...