<a href="https://colab.research.google.com/github/MJMortensonWarwick/ADA2425/blob/main/6_X_fundamental_math_for_DL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fundamental Mathematics for Deep Learning

Here we will cover (in as gentle a fashion as possible) some of the underlying mathematical concepts underpinning deep learning. (We'll also sneak in a bit of an intro to some concepts in PyTorch, which will be the solution we use in the module). Now I'm sure that has whet your appetite enough so let's begin!

## Scalars, Vectors, Matrices and Tensors
Although we've actually seen some of these concepts already in a programming sense, its worth going over some of them from a more mathematical perspective (and also some of the operations we can apply to them). Let's start with the simplest of these, a scalar:

In [1]:
# import the packages we need in this tutorial
import torch
import numpy as np
import sympy as sym

# Example scalar in PyTorch
my_scalar = torch.tensor(13)
my_scalar

tensor(13)

Really simple, a scalar is just a single numerical value we may want to use in our calculations or reporting. It can be an integer or a float or any other numerical type.

A matrix (plural matrices), is a little more nuanced (but not a lot):

In [2]:
# Example matrix in PyTorch
basic_matrix = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]])
basic_matrix

tensor([[1, 2, 3, 4],
        [5, 6, 7, 8]])

With our programmer's hat on we may say we have built a list of 2x lists (with the square brackets). However, by virtue of the fact each list is of the same length, we have effectively built a two-dimensional table (as we would build a DataFrame) which in this case has two rows and four columns. Often you would see this described as an $M$x$N$ matrix (where $M$ is the number of rows and $N$ the number of columns). Confusingly you'll sometimes see it described as an $N$x$M$ matrix where $N$ is rows and $M$ columns ... TL;DR its always rows by columns.

Let's look at a special case of a matrix ... a vector:

In [3]:
# Example vector in PyTorch (a vector is a special case of a matrix)
eg_vector = torch.tensor([1, 2, 3, 4])  # or torch.tensor([[1], [2], [3], [4]]) for a column vector
eg_vector

tensor([1, 2, 3, 4])

Effectively a vector is a $M$x$1$ matrix (a single column). Again, this is slightly confusing when programmed as we code it horizontally although conceptually we would consider it as a vertical slice.

Our final type for this Notebook is the tensor ... which have been popularised (in ML) by deep learning and tools like TensorFlow. Actually everything we have produced so far is a tensor as we can see in the outputs:

In [4]:
print(my_scalar)
print(basic_matrix)
print(eg_vector)

tensor(13)
tensor([[1, 2, 3, 4],
        [5, 6, 7, 8]])
tensor([1, 2, 3, 4])


As we can see a tensor acts as basically a container for our other numeric types. We can see "my_scalar" returns a tensor with no shape (basically our single value - 13); "basic_matrix" contains our matrix which is shape(2, 4) (2x rows, 4x columns); and "eg_vector" conains the vector of shape(4, 0) (4x rows, 1x column).

We often refer to these different shapes as rank-$n$ tensors, such that:
* A rank-0 tensor stores a scalar (e.g. "my_scalar")
* A rank-1 tensor stores a vector
* A rank-2 tensor stores a 2D matrix (e.g, a standard DataFrame)
* A rank-3 tensor has three dimensions (e.g. a digital image stored in RGB format - rows, columns and a colour dimensions)
* A rank-4 tensor adds a fourth dimension - e.g. a batch of rank-3 images.

So what do we gain from putting our objects in tensors? We could go into a long discussion into what a tensor really is, from a mathematical and/or physics sense, but in practice we just care about two things:
1. Tensors are a slightly more efficient and when we work at scale (and deep learning loves big datasets). Small efficiencies can make a big difference. In particular, compared to something like _numpy_, tensors can be used more easily with GPUs;
2. Tensors can be more connected into a system and can change their values when other values in the system change. In deep learning, this means keeping track of gradients and computational graphs (which we'll discuss in the module).

## Matrix Algebra
Matrix algebra is a big topic, and we don't need to go too far down the rabbit hole. However, there are some key topics that underpin a lot of deep learning (and ML for that matter) which, while more a backend operation than a frontend (i.e. you don't typically need to do the calculations yourself), it helps understand how these algorithms work.

Our first topic will be multiplying matrices. There are two main ways we can do this - _element\-wise_ and _matrix_ multiplication.<br><br>


### Element-wise Multiplication
Element-wise is probably the more obvious. It depends on both elements being of the same size. Let's look at some examples (using _tf.multiply_):

In [5]:
# Element-wise multiplication in PyTorch
matrix_one = torch.tensor([1, 2, 3, 4])
matrix_two = torch.tensor([5, 6, 7, 8])
ew_matrix = matrix_one * matrix_two  # Element-wise multiplication in PyTorch
print(f"First matrix: {ew_matrix}")
print("\n") # blank line

matrix_three = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]])
matrix_four = torch.tensor([[8, 7, 6, 5], [4, 3, 2, 1]])
ew_matrix_two = matrix_three * matrix_four
print(f"Second matrix: {ew_matrix_two}")

First matrix: tensor([ 5, 12, 21, 32])


Second matrix: tensor([[ 8, 14, 18, 20],
        [20, 18, 14,  8]])


As we can see, effectively we for loop for each list and multiply each item with its item in the corresponding list. So the 1st item of the 1st list (1) is multiplied with the 1st item of the 2nd list (5) and this produces the first item of the output (5). In the two row version we effectively multiple the top left item with the bottom left item, and so on.

### Matrix by Vector Multiplication
Although different sizes, matrix by vector multiplication is always element-wise. The size of the matrix is the size of the output:

In [6]:
# Matrix by vector multiplication
vector_one = torch.tensor([1, 2, 3, 4])
matrix_four = torch.tensor([[8, 7, 6, 5], [4, 3, 2, 1]])
ew_matrix_three = vector_one * matrix_four # Element-wise multiplication
print(f"Third matrix: {ew_matrix_three}")

Third matrix: tensor([[ 8, 14, 18, 20],
        [ 4,  6,  6,  4]])


### Matrix by Scalar Multiplication
Similarly, multiplying by a scalar is element-wise:

In [8]:
# Matrix by scalar multiplication
another_scalar = 10
matrix_four = torch.tensor([[8, 7, 6, 5], [4, 3, 2, 1]])
ew_matrix_four = another_scalar * matrix_four
ew_matrix_four

tensor([[80, 70, 60, 50],
        [40, 30, 20, 10]])

### Matrix Multiplication
Matrix multiplication is more flexible than element-wise in that it doesn't require the matrices to be of the same size. However, it is slightly less obvious how it works. Again, we'll look at a couple of examples (using _torch.matmul_ ... as in __mat__rix __mul__tiplication):

In [9]:
# Matrix Multiplication
matrix_five = torch.tensor([[1, 2], [3, 4], [5, 6]])
matrix_six = torch.tensor([[100], [200]])
matmul_matrix = torch.matmul(matrix_five, matrix_six)  # or matrix_five @ matrix_six
matmul_matrix

tensor([[ 500],
        [1100],
        [1700]])

This may be a little less obvious so let's go through the math. Our output is a 3x row vector so lets see the maths of each row:
* $1 \times 100 + 2 \times 200 = 100 + 400 = 500$
* $3 \times 100 + 4 \times 200 = 300 + 800 = 1100$
* $5 \times 100 + 6 \times 200 = 500 + 1200 = 1700$

In [11]:
matrix_seven = torch.tensor([[1, 2, 3] , [4, 5, 6]])
matrix_eight = torch.tensor([[100, 200], [300, 400], [500, 600]])
matmul_matrix_two = torch.matmul(matrix_seven, matrix_eight)
matmul_matrix_two

tensor([[2200, 2800],
        [4900, 6400]])

Effectively here we have a $3\times2$ matrix and a $2\times3$ matrix. When we multiply the two together we effectively take the first row of the first matrix and multiply it by the first column of the second to form the first value in our output; then the first row of the first matrix by the second column of the second to form the second output, and so on. Again, in the form of our $2\times2$ output, the maths is:<br><br>
$1\times100 + 2\times300 + 3\times500 = 100 + 600 + 1500 = 2200$
<br>
$1\times200 + 2\times400 + 3\times600 = 200 + 800 + 1800 = 2800$
<br>
$4\times100 + 5\times300 + 6\times500 = 400 + 1500 + 3000 = 4900$
<br>
$4\times200 + 5\times400 + 6\times600 = 800 + 2000 + 3600 = 6400$

### Vector Dot Product
The dot product of two vectors is an equivalent calculation to matrix multiplication via _matmul_. However, given that we are working with vectors we end up with a single number (a scalar). For example:

In [12]:
vector_one = torch.tensor([1, 2, 3, 4])
vector_two = torch.tensor([8, 7, 6, 5])

vector_dot_product = torch.dot(vector_one, vector_two)
vector_dot_product

tensor(60)

Let's check the math again:<br><br>
$ 1 \times 8 + 2 \times 7 + 3 \times 6 + 4 \times 5 = 8 + 14 + 18 + 20 = 60$

### Matrix Addition and Reduce-Sum
Matrix addition works as you may expect, but does rely on equal size matrices. Let's see an example again:

In [None]:
matrix_nine = tf.constant([1, 2, 3])
matrix_ten = tf.constant([6, 5, 4])
matrix_addition = tf.add(matrix_nine, matrix_ten)
matrix_addition

<tf.Tensor: shape=(3,), dtype=int32, numpy=array([7, 7, 7], dtype=int32)>

In [14]:
# Matrix Addition
matrix_nine = torch.tensor([1, 2, 3])
matrix_ten = torch.tensor([6, 5, 4])
matrix_addition = matrix_nine + matrix_ten # Direct addition in PyTorch
matrix_addition

tensor([7, 7, 7])

Ultimately this is just an element-wise addition. E.g.
<br><br>
$ 1 + 6 = 7$
<br>
$ 2 + 5 = 7$
<br>
$ 3 + 4 = 7$

Another related concept is _sum_, which is related to a MapReduce approach (later in the term). Nothing like an example amiright?

In [15]:
matrix_eleven = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix_sum = torch.sum(matrix_eleven)
matrix_sum

tensor(45)

Essentially the function has reduced our matrix to a single value - by simply summing up all of the nine values.

We can also sum by columns or rows:

In [16]:
# Matrix Sum
matrix_eleven = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

matrix_sr_cols = torch.sum(matrix_eleven, dim=0) # Sum along columns
print(matrix_sr_cols)

matrix_sr_rows = torch.sum(matrix_eleven, dim=1) # Sum along rows
matrix_sr_rows

tensor([12, 15, 18])


tensor([ 6, 15, 24])

Now we have an output of _shape=(3,)_ in each case. In the first ("matrix_sr_cols") the calculations is:
<br><br>
$1+4+7=12 $
<br>
$2+5+8=15 $
<br>
$3+6+9=18 $
<br><br>

In the case of "matrix_sr_rows", we do each row (each list in the list of lists):
<br><br>
$1+2+3=6 $
<br>
$4+5+6=15 $
<br>
$7+8+9=24 $

## Identity Matrices and Diagonal Matrices
An identity matrix is a matrix that if multiplied by another will return that matrix. Its the equivalent of multiplication of 1 if we are dealing with scalars/single values. I.e. $x \times 1 = x$ irrespective of what value $x$ takes. In practice this means a matrix filled with zeros except for ones on the diagonal. As an example:

In [17]:
# Identity Matrix (using PyTorch)
identity_matrix = torch.eye(3, dtype=torch.int32)
identity_matrix

tensor([[1, 0, 0],
        [0, 1, 0],
        [0, 0, 1]], dtype=torch.int32)

Let's confirm this is indeed an indentity matrix:

In [21]:
# Identity Matrix
test_matrix = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=torch.int32)
output_matrix = torch.matmul(test_matrix, identity_matrix)
output_matrix

tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]], dtype=torch.int32)

An identity matrix is a special case of diagonal matrices, which come up regularly in other settings as well. A diagonal matrix is any matrix that is all zeros except for on its diagonal (in an identity matrix recall the diagonal is filled with ones). As example:

In [22]:
# Example diagonal matrix in PyTorch
diagonal = torch.tensor([1, 2, 3, 4])
diagonal_matrix = torch.diag(diagonal)
diagonal_matrix

tensor([[1, 0, 0, 0],
        [0, 2, 0, 0],
        [0, 0, 3, 0],
        [0, 0, 0, 4]])

We can calculate the sum of the diagonal use TF's _trace_ function:

In [23]:
# Example diagonal matrix in PyTorch (trace)
diagonal_matrix_trace = torch.trace(diagonal_matrix)
diagonal_matrix_trace

tensor(10)

Quick math check:
<br><br>
$1+2+3+4=10$

## Inverse Matrices
An inverse matrix is a pair of matrices (let say $a$ and $b$) where multiplying the first ($a$) by its inverse matrix ($b$) results in an identity matrix. Let's see this in action:

In [24]:
# Example matrices
a_matrix = torch.tensor([[1, 2, 1], [4, 4, 5], [6, 7, 7]], dtype=torch.float32)
b_matrix = torch.tensor([[-7, -7, 6], [2, 1, -1], [4, 5, -4]], dtype=torch.float32)

# Calculate matrix products
inverse_it = torch.matmul(a_matrix, b_matrix)
inverse_it_again = torch.matmul(b_matrix, a_matrix)

print("A matrix")
print(a_matrix)
print("\n")
print("B matrix")
print(b_matrix)
print("\n")
print("Inverse a -> b")
print(inverse_it)
print("\n")
print("Inverse b -> a")
print(inverse_it_again)
print("\n")

# Calculating the inverse of a matrix using torch.inverse()
a_inverse = torch.inverse(a_matrix)
print("Inverse of A:")
print(a_inverse)
print("\n")

# Verify the inverse by multiplying the matrix by its inverse
identity_matrix = torch.matmul(a_matrix, a_inverse)
print("A * A_inverse (should be close to identity matrix):")
identity_matrix

A matrix
tensor([[1., 2., 1.],
        [4., 4., 5.],
        [6., 7., 7.]])


B matrix
tensor([[-7., -7.,  6.],
        [ 2.,  1., -1.],
        [ 4.,  5., -4.]])


Inverse a -> b
tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])


Inverse b -> a
tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])


Inverse of A:
tensor([[-7.0000, -7.0000,  6.0000],
        [ 2.0000,  1.0000, -1.0000],
        [ 4.0000,  5.0000, -4.0000]])


A * A_inverse (should be close to identity matrix):


tensor([[ 1.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  1.0000e+00, -1.9073e-06],
        [ 0.0000e+00,  0.0000e+00,  1.0000e+00]])

## Transpose and Orthogonal Matrices
We are familiar with the idea of transposing from our work with _pandas_ DataFrames. Basically a transpose flips a matrix so the columns become rows and vice versa. Let's visualise this:

In [25]:
# Example matrix
pre_transpose_matrix = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Original")
print(pre_transpose_matrix)

# Transpose the matrix
transpose_matrix = torch.transpose(pre_transpose_matrix, 0, 1) # or pre_transpose_matrix.T
print("\n")
print("Transposed!")
transpose_matrix

Original
tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])


Transposed!


tensor([[1, 4, 7],
        [2, 5, 8],
        [3, 6, 9]])

An orthogonal matrix is any matrix which remains the same when transposed. As an example:

In [26]:
# Example matrix
pre_transpose_matrix_two = torch.tensor([[1, 2, 3], [2, 0, 2], [3, 2, 1]])
print("Original")
print(pre_transpose_matrix_two)

# Transpose the matrix
transpose_matrix_two = torch.transpose(pre_transpose_matrix_two, 0, 1) # or pre_transpose_matrix_two.T
print("\n")
print("Transposed!")
transpose_matrix_two

Original
tensor([[1, 2, 3],
        [2, 0, 2],
        [3, 2, 1]])


Transposed!


tensor([[1, 2, 3],
        [2, 0, 2],
        [3, 2, 1]])

## Argmax Operations
Argmax means to find the maximum value in a set of potential values. This could be the posterior distribution of a Bayesian function or the output of a function(s) applied to a specific dataset. E.g. we may have a function $y = 10x^2 - 2x^3$ (where $x$ is a positive integer). The maximum value of $y$ increases while $x$ is less than nine, but decrease from nine onwards (check it out in Excel!). Therefore, argmax will tells us that we achieve the maximum value of $y$ when $x=8$.

In terms of vectors and matrices, we can use argmax to find the location of the maximum value:

In [27]:
print("Argmax of a vector")
another_vector = torch.tensor([4, 12, 42, 5])
vector_arg_max = torch.argmax(another_vector)
print(vector_arg_max)
print("\n")

print("Argmax of a matrix")
another_matrix = torch.tensor([[1, 12, 5], [6, 5, 42]])
matrix_arg_max = torch.argmax(another_matrix, dim=0)
print(matrix_arg_max)
print("\n")


Argmax of a vector
tensor(2)


Argmax of a matrix
tensor([1, 0, 1])




In the first case ("vector_arg_max") our algorithm finds the maximum value as 42 and returns the index of this item (remember we count from 0 ... so therefore it is 2).

In the second case we compare both rows and return the index of which is highest (0 if the top row and 1 if the bottom). In other words, our output is ["bottom row", "top row", "bottom row"].

## Derivatives
The essence of a derivative (a key concept of differential calculus) is to evaluate a function at some given point, and calculate the current rate of change. In a purely linear function the rate of change will be the same at any point ... i.e. in the function $y = \beta x$ the rate of change associated with $x$ is $\beta$ at any point. However, in a non-linear function we need to work a bit harder to get this rate of change.

More formally we want to know the rate of change in $y$ (written as $\Delta y$) as a ratio to the rate of change in $x$ (again ... $\Delta x$). Fortunately there are lots of rules (it is mathematics after all) to calculate this.


### Power Rule
One key concept you have likely seen in a calculus class somewhere is the _power rule_ for calculating the derivative of a function applied to a single variable (e.g. $x$). The power rule states:<br><br>
$f(x) = x^n \rightarrow f'(x) = nx^{n-1}$
<br>(i.e. we calculate the derivative ($f'$) of the function ($x^n$) by calculating $ nx^{n-1}$).

Let's see an example using the Python package _sympy_ for pretty outputs (inspired by Dario Radečić's [Medium post](https://towardsdatascience.com/taking-derivatives-in-python-d6229ba72c64)):

In [28]:
x = sym.Symbol('x')

# differentiate a function that is x^n and n=4 ... i.e. the function is x^4
sym.diff(x**4)

4*x**3

Python tells us the answer is $4x^3$ but let's be sure by doing the math:
* $ f(x) = x^4 $
* $ f'(x) = 4x^{4-1} = 4x^{3}$

Well done Python. Sorry I doubted you.

### The Product Rule
The product rule applies when we want to calculate the product (multiplication) of two functions. For example we may the following function:<br><br>
$ F(x) = f(x) \times g(x) $
<br><br>
In such cases we can use the product rule defined as:<br><br>
$ F(x) = f(x) \times g'(x) + f'(x) \times g(x) $<br><br>
I.e. we multiply each function with the derivate of the other and add these together.

We can see it in action. Given:<br><br>
$ F(x) = f(x) \times g(x) $<br>
$ f(x) = x^3 $<br>
$ g(x) = x^5 $
<br><br>
We can calculate the derivative of each as above:<br><br>
$ f'(x) = 3x^{2-1} = 3x^{1} = 3x$ <br>
$ g'(x) = 5x^{5-1} = 5x^{4} $
<br><br>
We then need to multiply each together and add them:<br><br>
$ F'(x) = f(x) \times g'(x) + f'(x) \times g(x) $<br>
$ F'(x) = x^3 \times 5x^{4} + 3x \times x^5  = 5x^7 + 3x^5 = 8x^7$
<br><br>
Let's verify this in sympy:



In [29]:
sym.diff(x**3 * x**5)

8*x**7

### The Chain Rule
So far we've seen single functions (via the power rule) and multiplicative functions (via the product rule) ... now we will look at functions inside functions (i.e. nested functions) via the _chain rule_. The chain rule gets super-relevant to deep learning as ultimately the very nature of having multiple hidden layers means we have functions inside functions.

Consider the following function:<br><br>
$ F(x) = (x^3 - 2x + 4)^3 $
<br><br>
We have an ($x^3 - 2x + 4)$ as an inner function and an outer function that raises the inner function to the power 3. The chain rule says:<br><br>
$ F(x) = f(g(x)) \rightarrow F'(x) = f'(g(x)) \times g'(x)$
<br><br>
In other words we reach the overall derivative by taking the derivative of the outer function multiplied by the inner function (kind of like the first half of the product rule), multiplied by the derivative of the inner function. It is almost certainly clearer if we look at the math on our earlier example function.<br><br>
$ F(x) = (x^3 - 2x + 4)^3 $<br><br>
$ F'(x) = 3(x^3 - 2x + 4)^{3-1} \times 3x^2 - 2x^{1-1}$<br>
_A note on the calculation of the inner function here. We can consider $2x$ as effectively $2x^1$ in terms of doing our power rule calculations. This means we end at the power $1-1$ and the $x$ will be cancelled out. We also ignore constants_ ($+4$) _when calculating a derivative. Note over, let's do some simplifications_ <br><br>
$F'(x) = 3(x^3 - 2x + 4)^{2} \times 3x^2 - 2$
<br><br>
$F'(x) = (9x^2 -6) \times (x^3 - 2x + 4)^{2} $
<br><br>

In [30]:
sym.diff((x**3 - 2 * x + 4)**3)

(9*x**2 - 6)*(x**3 - 2*x + 4)**2

### Partial Derivatives
The chain rule gets us a big chunk of the way towards what we need in terms of using derivatives to understand parameter optimisation in deep learning (which will be explained in the module don't worry), except for one thing. Everything we've looked at so far has looked at changes in $y$ with respect to a single $x$. We theoretically could have just one feature, but in ML/AI practice that is extremely unlikely. If we have multiple features ($x$'s), which we will, we need _partial derivatives_.

Partial derivatives deal with multi-variable functions by applying the usual rules we've just seen to a single variable in the function and effectively freezing the others (keeping them constant). We can follow this process for each of the variable in the function.

Again let's make up a function to work with:<br><br>
$ f(x_{1}, x_{2}, x_{3}) = {x_{1}}^2 \times x_{2} \times {x_{3}}^4 $
<br><br>
As above, the derivatives we seek will be partial ... i.e. we will find the derivative of $x_1$ independently of the other $x$'s. In our earlier discussion we were finding the ration between $\Delta y$ and $\Delta x$. Given this is a subset of the overall problem we use the lower case version of delta to show the partial ... so the notation would be $\delta_{x_{1}}$ (double subscripts is a bit yucky in LaTex ... sorry). Let's look at the partial derivative of $X_{3}$:
<br><br>
$ \delta_{x_{3}} = 4 \times ({x_{1}}^2 \times x_{2} \times {x_{3}}^3)$
<br><br>
Essentially we are doing normal power rule stuff here. ${x_{3}}^4$ becomes $ 4 \times {x_{3}}^3 $. The only difference is we keep the rest of the formula in and as-is. For completion, we can write out all three partial derivatives (but without discussion of the calculations - its the same as we've already seen):<br><br>
$ f(x_{1}, x_{2}, x_{3}) = {x_{1}}^2 \times x_{2} \times {x_{3}}^4 $
<br>
$ \delta_{x_{1}} = 2 \times x_{1} \times x_{2} \times {x_{3}}^4 $
<br>
$ \delta_{x_{2}} = {x_{1}}^2 \times {x_{3}}^4$
<br>
$ \delta_{x_{3}} = 4 \times ({x_{1}}^2 \times x_{2} \times {x_{3}}^3)$
<br><br>
Let's verify in sympy, and to make things easier we'll also write out the function rather than typing it each time:


In [31]:
x1, x2, x3 = sym.symbols('x1 x2 x3')
f = x1**2 * x2 * x3**4

print("Delta for x1")
print(sym.diff(f, x1))
print("\n")

print("Delta for x2")
print(sym.diff(f, x2))
print("\n")

print("Delta for x3")
print(sym.diff(f, x3))

Delta for x1
2*x1*x2*x3**4


Delta for x2
x1**2*x3**4


Delta for x3
4*x1**2*x2*x3**3


Everything checks out! And the good news is now the math is over - we will touch on how these concepts fit into the deep learning process in class ... but these aren't calculations we'll have to make ourselves. However, it is useful to get some understanding on what's happening under the tin of tools like Pytorch and TensorFlow.