[[Neural Networks from Scratch]]

##### Initialise the packages

In [None]:
import micropip

await micropip.install("numpy")
await micropip.install("nnfs")
await micropip.install("matplotlib")

import matplotlib.pyplot as plt
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
from nnfs.datasets import vertical_data

##### Running the "vertical dataset"

In [None]:
nnfs.init()

X, y = vertical_data(samples=100, classes=3)

plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap='brg')
plt.show()

##### Running the "spiral dataset"

In [None]:
X, y = spiral_data(samples=100, classes=3)

plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap='brg')
plt.show()


##### Model construction, loss definition, and brute-force search over 100,000 random hidden-layer weight-bias combinations to record minimal loss and corresponding parameters (Requires the classes we created in "Implementing Loss.ipynb")

In [None]:
dense1 = Layer_Dense(2,3)
activation1 = Activation_ReLU()
dense2 = Layer_Dense(3,3)
activation2 = Activation_Softmax()

loss_function = Loss_CategoricalCrossentropy()

lower_loss = 9999999
best_dense1_weights = dense1.weights.copy()
best_dense1_biases = dense1.biases.copy()
best_dense2_weights = dense2.weights.copy()
best_dense2_biases = dense2.biases.copy()

for iteration in range(100000):
	dense1.weights = 0.05 * np.random.randn(2,3)
	dense1.biases = 0.05 * np.random.randn(1,3)
	dense2.weights = 0.05 * np.random.randn(3,3)
	dense2.biases = 0.05 * np.random.randn(1,3)

	dense1.forward(X)
	activation1.forward(dense1.output)
	dense2.forward(activation1.output)
	activation2.forward(dense2.output)

	loss = loss_function.calculate(activation2.output, y)

	predictions = np.argmax(activation2.output, axis=1)
	accuracy = np.mean(predictions==y)

	if loss < lower_loss:
		print('New set of weights found, iteration:', iteration,
		'loss:', loss, 'acc:', accuracy)
		best_dense1_weights = dense1.weights.copy()
		best_dense1_biases = dense1.biases.copy()
		best_dense2_weights = dense2.weights.copy()
		best_dense2_biases = dense2.biases.copy()
		lowest_loss = loss
	else:
		dense1.weights = best_dense1_weights.copy()
		dense1.biases = best_dense1_biases.copy()
		dense2.weights = best_dense2_weights.copy()
		dense2.biases = best_dense2_biases.copy()



**The process above of randomly searching for weights and biases until we reach a local minimum is INEFFICIENT and has a worst case time complexity of O(n) where n is the number of random initialisations. With 1 million samples, if each loss evaluation took 1 millisecond, total time would be 1 million milliseconds which is 1,000 seconds or approximately 16 minutes and 40 seconds.**

##### Calculating the derivative (slope of tangent line) at various points in the graph $f(x)=2x^2$

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Define the function
def f(x):
    return 2 * x**2

# Generate x values
x = np.arange(0, 5, 0.001)
y = f(x)

# Plot the original function
plt.plot(x, y)

# Define colours for tangent lines
colors = ['k', 'g', 'r', 'b', 'c']

# Function to calculate tangent line
def approximate_tangent_line(x, approximate_derivative, b):
    return (approximate_derivative * x) + b

# Draw tangents at integer x-values from 0 to 4
for i in range(5):
    p2_delta = 0.0001
    x1 = i
    x2 = x1 + p2_delta
    y1 = f(x1)
    y2 = f(x2)
    
    print((x1, y1), (x2, y2))
    
    approximate_derivative = (y2 - y1) / (x2 - x1)
    b = y2 - (approximate_derivative * x2)
    
    to_plot = [x1 - 0.9, x1, x1 + 0.9]
    
    plt.scatter(x1, y1, c=colors[i])
    plt.plot(
        [point for point in to_plot],
        [approximate_tangent_line(point, approximate_derivative, b) for point in to_plot],
        c=colors[i]
    )
    
    print(f'Approximate derivative for f(x) where x = {x1} is {approximate_derivative}')

# Display the plot
plt.show()

For the simple function, $f(x) = 2x^2$, we didn't pay a high penalty by approximating the derivative (the slope of the tangent line) like this, and received a value that was close enough for our needs.

The *actual* function employed in our neural network is not so simple. The loss function contains all of the layers, weights, and biases - it's a massive function operating in multiple dimensions. Calculating derivatives using numerical differentiation requires multiple forward passes for a single parameter update.

To reiterate, as we quickly covered many terms, the derivative is the slope of the tangent line for a function that takes a single parameter as an input. We’ll use this ability to calculate the slopes of the loss function at each of the weight and bias points — this brings us to the multivariate function, which is a function that takes multiple parameters and is a topic for the next chapter — the partial derivative

##### The Partial Derivative
The partial derivative measures how much impact a single input has on a function's output. The method of calculation is the same as for derivatives explained in the previous chapter; we simply repeat this process for each of the independent inputs.

$$
f(x, y, z) \\
\frac{\partial f}{\partial x}, \quad \frac{\partial f}{\partial y}, \quad \frac{\partial f}{\partial z} \\
\nabla f = \left[ \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z} \right]
$$
In simple terms, a partial derivative tells you how much a function changes if you nudge just one input, while keeping all other inputs frozen.

The **partial derivative of the sum** with respect to any input equals 1:
$$
f(x, y) = x + y
$$
$$
\frac{\partial}{\partial x} f(x, y) = 1
$$
$$
\frac{\partial}{\partial y} f(x, y) = 1
$$
The **partial derivative of the multiplication** operation with 2 inputs, with respect to any input, equals the other input:
$$
f(x, y) = x * y
$$
$$
\frac{\partial}{\partial x} f(x, y) = y
$$
$$
\frac{\partial}{\partial y} f(x, y) = 1=x
$$
The **partial derivative of the max function of 2 variables** with respect to any of them is 1 if this variable is the biggest and 0 otherwise. An example of x:**

$$
f(x, y) = max(x,y)
$$
$$
\frac{\partial}{\partial x} f(x, y) = 1(x>y)
$$
The **derivative of the max function of a single variable** and 0 equals 1 if the variable is greater than 0 and 0 otherwise:
$$
f(x) = max(x,0)
$$
$$
\frac{\partial}{\partial x} f(x) = 1(x>0)
$$
The **derivative of chained functions** equals the product of the partial derivatives of the subsequent functions:
$$
\frac{d}{dx} f(g(x)) = \frac{d}{dg(x)}f(g(x))\times \frac{d}{dx}g(x) = f'(g(x))\times g'(x)
$$
The **same applies to the partial derivatives**. For example:

$$
\frac{\partial}{\partial x}f(g(y,h(x,z))) = f'(g(y,h(x,z)))\times g'(y,h(x,z))\times h'(x,z)
$$
The **gradient is a vector of all possible partial derivatives**. An example of a triple-input function:
$$
\nabla f =
\begin{bmatrix}
\frac{\partial}{\partial x}(x,y,z) \\
\frac{\partial}{\partial y}(x,y,z) \\
\frac{\partial}{\partial z}(x,y,z)
\end{bmatrix} 
=
\begin{bmatrix}
\frac{\partial}{\partial x} \\
\frac{\partial}{\partial y} \\
\frac{\partial}{\partial z}
\end{bmatrix}
f(x,y,z)
$$

##### What is the purpose of all the functions summarised above?
To gain an idea of how to measure the impact of variables on a function's output, we can begin to write the code to calculate these partial derivatives to see their role in minimising the model's loss.