Resource Link: https://www.learnpytorch.io/00_pytorch_fundamentals/#exercises

In [None]:
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print(torch.__version__)

2.1.0+cu121


In [None]:
#!nvidia-smi

# Introduction to Tensors




### Creating Tensors




PyTorch tensors are created using torch.Tensor() = https://pytorch.org/docs/stable/tensors.html

---

* Scalar (Rank 0 Tensor): A scalar is a single number. In PyTorch, you create a scalar by passing a single value to torch.tensor(). The code creates a scalar with a value of 7. .ndim checks the number of dimensions (which is 0 for a scalar), .shape returns an empty tuple since scalars have no dimensions, and .item() converts the tensor scalar to a standard Python scalar.

* Vector (Rank 1 Tensor): A vector is a 1-dimensional array of numbers. The code creates a vector [7, 7] using torch.tensor(). Here, .ndim returns 1 indicating it's 1-dimensional, and .shape returns (2), meaning the vector has 2 elements.

* Matrix (Rank 2 Tensor): A matrix is a 2-dimensional array of numbers. The example creates a 2x2 matrix with torch.tensor([[7, 8], [9, 10]]). .ndim shows it's 2-dimensional, and .shape returns (2, 2), indicating 2 rows and 2 columns.

* Higher-Dimensional Tensor (Rank 3+ Tensor): Tensors can have three or more dimensions. The provided code creates a tensor with dimensions (1, 3, 3), meaning it contains a 3x3 matrix in a 3-dimensional space. .ndim confirms it's 3-dimensional. The shape is described as having 1 matrix that contains 3 rows and 3 columns each.

This code is a basic introduction to the concept of tensors in PyTorch, demonstrating how to create them and inspect their properties like dimensionality and shape. Understanding these properties is essential for working with data in machine learning and deep learning applications.

---

**Naming Conventions**
* Scalar (dimension = 0) - lower-case $a$
* Vector (dimension = 1) - lower-case $y$
* Matrix (dimension = 2) - upper-case $Q$
* Tensor (dimension ≥ 3) - upper-case $X$

In [None]:
# Scalar (tensor rank 0)
scalar = torch.tensor(7) # A tensor can be constructed from a Python list or sequence using the torch.tensor() constructor

print(scalar)
print(scalar.ndim) # checks how many dimensions does scalar have
print(scalar.shape)
print(scalar.item()) # Get tensor back as Python int


tensor(7)
0
torch.Size([])
7


> The .item() method is specific to tensors that have a single element. Its purpose is to extract the value of that single element as a standard Python number (int or float). It's particularly useful when you're working with tensors that result from some computations, and you want to convert the result into a Python scalar.

> Attempting to use .item() on a tensor with more than one element will result in an error because it's not clear which element should be converted to a Python scalar. The method is designed to ensure that it's only used when there's exactly one element in the tensor, thereby avoiding ambiguous situations.

In [None]:
# Vector (tensor rank 1)
vector = torch.tensor([7, 7])
print(vector)
print(vector.ndim) # checks how many dimension does vector have
print(vector.shape) # outputs the shape of the vector, eg 2 by 1.

tensor([7, 7])
1
torch.Size([2])


In [None]:
# Matrix (tensor rank 2)
MATRIX = torch.tensor([[7, 8], [9, 10]])
print(MATRIX)
print(MATRIX.ndim) # checks how many dimension does MATRIX have
print(MATRIX.shape) # outputs the shape of the matrix

tensor([[ 7,  8],
        [ 9, 10]])
2
torch.Size([2, 2])


In [None]:
# Tensor (tensor rank 3+)
TENSOR = torch.tensor([[[1, 2, 3],   #[ <-- dim=0[ <-- dim=1 [ <-- dim=2 ]]]
                        [4, 5, 6],
                        [6, 7, 8]]])
print(TENSOR)
print(TENSOR.ndim) # checks how many dimension does TENSOR have

print(TENSOR.shape) # outputs the shape of the tensor
                    # torch.Size([1, 3, 3]) means that you have 1x 3 by 3 matrix in this tensor.
                    # torch.Size([dim=0, dim=1, dim=2])
                    # alternatively, dim=0 contains 1 value (matrix), dim=1 contains 3 values (vector), dim=2 contains 3 scalars each.


tensor([[[1, 2, 3],
         [4, 5, 6],
         [6, 7, 8]]])
3
torch.Size([1, 3, 3])


### Random Tensors



**Why random tensors?**

Random tensors are important because the way many neural network learn is that they start with tensors full of random numbers and then adjust those random numbers to better represent the data.

Workflow:

> `Start with random numbers -> Look at data -> Update random numbers -> Look at data -> Update random numbers`

PyTorch Docs (torch.rand()) url - https://pytorch.org/docs/stable/generated/torch.rand.html


In [None]:
# Create a random tensor of size (4, 3, 4)
random_tensor = torch.rand(4, 3, 4) # can write "touch.rand(size=(4, 3, 4))" // both are acceptable
print(random_tensor)

tensor([[[0.2227, 0.3177, 0.4999, 0.3556],
         [0.8992, 0.9434, 0.0417, 0.9339],
         [0.3301, 0.1877, 0.0852, 0.7816]],

        [[0.6055, 0.9440, 0.7531, 0.6954],
         [0.5516, 0.0211, 0.6532, 0.5071],
         [0.6850, 0.4792, 0.9160, 0.8725]],

        [[0.6563, 0.2124, 0.5223, 0.3818],
         [0.7180, 0.8954, 0.1314, 0.2712],
         [0.4022, 0.8065, 0.3183, 0.6993]],

        [[0.2008, 0.2597, 0.3368, 0.9321],
         [0.2766, 0.8006, 0.6173, 0.6912],
         [0.1076, 0.0813, 0.6475, 0.2482]]])


In [None]:
print(random_tensor.ndim) # checks how many dimension does random_tensor have
print(random_tensor.shape) # outputs the shape of the random_tensor

3
torch.Size([4, 3, 4])


In [None]:
# Create a random tensor with simillar shape to an image tensor
random_image_size_tensor = torch.rand(size=(224, 224, 3)) # alternative is colour channel coming first, eg. size=(3, 224, 224) // both image dimension of 224x224
print(random_image_size_tensor.ndim) # checks dimension of random_image_size_tensor
print(random_image_size_tensor.shape) # outputs shape of random_image_size_tensor

# random tensors are so valuable because neural network starts with random data -> look at data -> update the random numbers -> look at data .....

3
torch.Size([224, 224, 3])


### Zeros and Ones

In [None]:
# Create a tensor of all zeroes
zeros = torch.zeros(size=(3, 4))
print(zeros)
print(zeros * random_tensor) # multiply zero tensor with random tensor

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])


In [None]:
# Create a tensor of all ones
ones = torch.ones(size=(3, 4))
print(ones)
print(ones * random_tensor) # multiply zero tensor with random tensor

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])
tensor([[[0.2227, 0.3177, 0.4999, 0.3556],
         [0.8992, 0.9434, 0.0417, 0.9339],
         [0.3301, 0.1877, 0.0852, 0.7816]],

        [[0.6055, 0.9440, 0.7531, 0.6954],
         [0.5516, 0.0211, 0.6532, 0.5071],
         [0.6850, 0.4792, 0.9160, 0.8725]],

        [[0.6563, 0.2124, 0.5223, 0.3818],
         [0.7180, 0.8954, 0.1314, 0.2712],
         [0.4022, 0.8065, 0.3183, 0.6993]],

        [[0.2008, 0.2597, 0.3368, 0.9321],
         [0.2766, 0.8006, 0.6173, 0.6912],
         [0.1076, 0.0813, 0.6475, 0.2482]]])


In [None]:
# datatype
print(zeros.dtype)
print(ones.dtype)
print(random_tensor.dtype)

torch.float32
torch.float32
torch.float32


### Creating a Range of Tensors and Tensors-like



PyTorch docs for "torch.arange()" - https://pytorch.org/docs/stable/generated/torch.arange.html

In [None]:
#torch.range(0, 10) # depracated method, use "torch.arange()" instead
one_to_ten = torch.arange(0, 10) # tensor.arange() starts at start_int, ends at end_int - 1.
print(one_to_ten)
one_to_k = torch.arange(start=0, end=1000, step=77)
print(one_to_k)

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
tensor([  0,  77, 154, 231, 308, 385, 462, 539, 616, 693, 770, 847, 924])


In [None]:
# Creating tensors like
ten_zeros = torch.zeros_like(input=one_to_ten) # this will get you a tensor full of zeros, which is the same shape as the tensor input, ie "one_to_ten"
print(ten_zeros)

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


### Tensor Datatypes



**Note:** Tensor datatypes is one of the 3 big errors you'll run into with PyTorch & deep learning:
1. Tensors not right datatype (eg. happens when inconsistent dtype substitution between tensors)
2. Tensors not right shape (eg. happens in matrix multiplication)
3. Tensors not on the right device (eg. device="cpu" vs device="cuda", and device agnostic code)

In [None]:
# Float 32 tensor
float_32_tensor = torch.tensor([1.0, 2.0, 3.0], # even if dtype=None, PyTorch will then revert to default datatype
                               dtype=None, # what datatype is the tensor (e.g float32 or float16)
                               device=None, # what device is your tensor on
                               requires_grad=False) # whether or not to track gradients with this tensor's operation
print(float_32_tensor)
print(float_32_tensor.dtype)

tensor([1., 2., 3.])
torch.float32


In [None]:
float_16_tensor = float_32_tensor.to(dtype=torch.float16) # convert float_32_tensor to float_16_tensor // alternatively, torch.half would also work
print(float_16_tensor)
print(float_16_tensor.dtype) # float_16_tensor is now float 16

tensor([1., 2., 3.], dtype=torch.float16)
torch.float16


In [None]:
# suprisingly no error raised even though discrepancy in dtype; normal behavious, but can be an issue with large neural network,
# but preferably don't use different' dtype
print(float_16_tensor * float_32_tensor) # works even if multipled by torch.long or torch.int32, etc

tensor([1., 4., 9.])


### Getting Information from Tensors (Tensor Attributes)



1. Tensors not right datatype - to do get datatype from a tensor, can use `tensor.dtype`
2. Tensors not right shape - to get a shape from a tensor, can use `tensor.shape`
3. Tensors not on the right device - to get device from a tensor, can use `tensor.device`

In [None]:
# Create a test tensor
some_tensor = torch.rand(2, 3, 4)
print(some_tensor)

tensor([[[0.2966, 0.0125, 0.2912, 0.5563],
         [0.0746, 0.3665, 0.7442, 0.1715],
         [0.1300, 0.2580, 0.5232, 0.0725]],

        [[0.6415, 0.7326, 0.1114, 0.6876],
         [0.4124, 0.7489, 0.5754, 0.8741],
         [0.4082, 0.9335, 0.6183, 0.2421]]])


In [None]:
# Find out details about some_tensor
print(f"Datatype of tensor : {some_tensor.dtype }")
print(f"Shape of tensor    : {some_tensor.shape }")  # "some_tensor.shape" and "some_tensor.size()" does the same thing (shape is attribute, .size() is founction)
print(f"Device of tensor   : {some_tensor.device}")

Datatype of tensor : torch.float32
Shape of tensor    : torch.Size([2, 3, 4])
Device of tensor   : cpu


### Manipulating Tensors (Tensor Operations)



Tensor Operations include:
* Addition
* Subtraction
* Multiplication (element-wise)
* Division
* Matrix Multiplication

In [None]:
# Create a Tensor
tensor = torch.tensor([1, 2, 3]) # initialised from Python list or array
print(tensor)
print(tensor + 10) # addition
print(tensor - 10) # subtraction
print(tensor * 10) # multiplication
print(tensor / 10) # division

tensor([1, 2, 3])
tensor([11, 12, 13])
tensor([-9, -8, -7])
tensor([10, 20, 30])
tensor([0.1000, 0.2000, 0.3000])


In [None]:
# Try out PyTorch built-in functions
print(torch.add(tensor, 10)) # addition
print(torch.sub(tensor, 10)) # subtraction
print(torch.mul(tensor, 10)) # multiplication
print(torch.div(tensor, 10)) # division

print(torch.abs(tensor)) # absolute value
print(torch.sqrt(tensor)) # square root
print(torch.exp(tensor)) # exponential

tensor([11, 12, 13])
tensor([-9, -8, -7])
tensor([10, 20, 30])
tensor([0.1000, 0.2000, 0.3000])
tensor([1, 2, 3])
tensor([1.0000, 1.4142, 1.7321])
tensor([ 2.7183,  7.3891, 20.0855])


### Matrix Multiplication




Two main ways of performing multiplication in neural networks and deep learning are:
1. Element-wise multiplication
2. Matrix multiplication (dot product) [dot product and matrix multiplication is used interchangeably]

URL: https://www.mathsisfun.com/algebra/matrix-multiplying.html

Random Note: To multiply a matrix by another matrix, we need to do the "dot product" of rows and columns... (rewatch 3Blue1Brown if recap needed on linear algebra)

---

Regarding anatomy of Matrix, below as an example is a **4x2 matrix**, pay attention that the int*int order of matrix is classified by `_rows_x_columns_`. Use `tensor.shape`, to find the dimensions of the matrix:

`[`

>  `[1, 2],`

>  `[3, 4],`

>  `[5, 6],`

>  `[7, 8]`

`]`

There are 2 main rules that performing matrix multiplication needs to satisfy:

1. The **inner dimensions** must match:

* `(3x2) @ (3x2)` <-- this won't work
* `(2x3) @ (3x2)` <-- this will work
* `(3x2) @ (2x2)` <-- this will work

2. The resulting matrix has the shape of the **outer dimension**:

* `(4x3) @ (3x6) = (4x6)` <-- 4x6 matrix outputted
* `(3x1) @ (1x7) = (3x7)` <-- 3x7 matrix outputted


> Note: regarding matrix multiplication syntax, `tensor @ tensor`, is the same as `torch.matmul(tensor, tensor)`, same performance wise etc, the '@' operator is an overload of the \__matmul__ magic method in the PyTorch's tensor class.

In [None]:
# Element-wise Multiplication
print(tensor, "*", tensor)
print(f"Equals: {tensor * tensor}")

print(torch.rand(4, 2).shape) # outputs the shape of a 4x2 matrix
print((torch.rand(4, 3) @ torch.rand(3, 6)).shape) # 4x3 @ 3x6 -> 4x6
print((torch.rand(3, 1) @ torch.rand(1, 7)).shape) # 3x1 @ 1x7 -> 3x7

tensor([1, 2, 3]) * tensor([1, 2, 3])
Equals: tensor([1, 4, 9])
torch.Size([4, 2])
torch.Size([4, 6])
torch.Size([3, 7])


In [None]:
# Matrix Multiplication
torch.matmul(tensor, tensor)

tensor(14)

In [None]:
%%time
value=0
for i in range(len(tensor)):
    value += tensor[i] * tensor[i]
print(value)

tensor(14)
CPU times: user 2.09 ms, sys: 202 µs, total: 2.29 ms
Wall time: 2.52 ms


In [None]:
%%time
torch.matmul(tensor, tensor)

CPU times: user 101 µs, sys: 19 µs, total: 120 µs
Wall time: 127 µs


tensor(14)

### One of the Most Common Errors in Deep Learning: Shape Errors



In [None]:
# Shapes for Matrix Multiplication

tensor_A = torch.tensor([[1, 2],
                         [3, 4],
                         [5, 6]])

tensor_B = torch.tensor([[7, 10, 1],
                         [8, 11, 2]])

#torch.mm(tensor_A, tensor_B)  # torch.mm() is the same as torch.matmul() // It's an Alias.
torch.matmul(tensor_A, tensor_B) # As see, need to be 3x2 @ 2x3, which is multipliable. Which this outputs 3x3

tensor([[ 23,  32,   5],
        [ 53,  74,  11],
        [ 83, 116,  17]])

A way to fix our tensor shape issues, is that we can manipulate the shape of one of our tensors using a **transpose**.

A **transpose** switches the axes or dimensions of a given tensor.

In [None]:
print("Original Shape:", tensor_A.shape)
print("Transposed Shape:", tensor_A.T.shape)

Original Shape: torch.Size([3, 2])
Transposed Shape: torch.Size([2, 3])


### Finding the min, max, mean, sum, etc. (Tensor Aggregation)



Definition of "Aggregation": The formation of a number of things into a cluster.


In [None]:
# Create a Tensor

x = torch.arange(0, 100, 10)
print(x)

print(torch.min(x), x.min()) # Find the Min
print(torch.max(x), x.max()) # Find the Max
print(torch.mean(x.float()), x.type(torch.float32).mean()) # Find the Mean // When calculating mean, needs to be in float or complex dtype. Long doesn't work. Refer to MISC.
print(torch.sum(x), x.sum()) # Find the Sum


tensor([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
tensor(0) tensor(0)
tensor(90) tensor(90)
tensor(45.) tensor(45.)
tensor(450) tensor(450)


### Finding the Positional min and max

In [None]:
print(x) # outputs "tensor.arange(0, 100, 10)"

# Find the position in tensor that has the minimum value with "argmin()" -> returns index position of target tensor where the minimum value occurs
print(x.argmin(), x[x.argmin()])

# Find the position in tensor that has the maximum value with "argmax()" -> returns index position of target tensor where the maximum value occurs
print(x.argmax(), x[x.argmax()])

tensor([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
tensor(0) tensor(0)
tensor(9) tensor(90)


### Reshaping, Stacking, Squeezing and Unsqueezing Tensors



* Reshaping - Reshapes an input tensor to a defined shape, eg 1x12 matrix to a 2x6 matrix of a 4x3 matrix
* View - Return a view of an input tensor of certain shape but keep the same memory as the original tensor
* Stacking - combine multiple tensors on top of each other (vstack) or side by side (hstack)
* Squeeze - removes all '1' dimensions from a tensor
* Unsqueeze - add a '1' dimension to a target tensor
* Permute - Return a view of the input with dimensions permuted (swapped) in a certain way

In [None]:
x = torch.arange(0., 12.)
print(x, x.shape, '\n')

# Add an Extra Dimension -- tensor.reshape()
x_reshaped = x.reshape(2, 6)  # reshapes input 1x12 matrix to 2x6 matrix
print("Reshapes 1x12 matrix to 2x6 matrix")
print(x_reshaped, x_reshaped.shape, '\n')

x_reshaped = x.reshape(4, 3)  # reshapes input 1x12 matrix to 4x3 matrix
print("Reshapes 1x12 matrix to 4x3 matrix")
print(x_reshaped, x_reshaped.shape, '\n')

# Quick note regarding tensor.shape
#   For Scalar, it will results in "torch.Size([])"                    <-- indicating 0th dimension
#   For Vector, it will results in "torch.Size([int])"                 <-- indicating 1st dimension
#   For Matrix, it will results in "torch.Size([int, int])"            <-- indicating 2nd dimension
#   For Tensor, it will results in "torch.Size([int, int, int, ...])"  <-- indicating 3+  dimension

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.]) torch.Size([12]) 

Reshapes 1x12 matrix to 2x6 matrix
tensor([[ 0.,  1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10., 11.]]) torch.Size([2, 6]) 

Reshapes 1x12 matrix to 4x3 matrix
tensor([[ 0.,  1.,  2.],
        [ 3.,  4.,  5.],
        [ 6.,  7.,  8.],
        [ 9., 10., 11.]]) torch.Size([4, 3]) 



In [None]:
newTensor = torch.arange(0., 12.)

# Change the View -- tensor.view()
tempView = newTensor.view(2, 6)
print(tempView, tempView.shape)

#Changing "tempView" tensor variable, changes "x" tensor variable (because a view of a tensor shares the same memory as the original input)
tempView[:, 3] = 100  # refer to MISC regarding slicing
print("\n", tempView)
print(newTensor)

tensor([[ 0.,  1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10., 11.]]) torch.Size([2, 6])

 tensor([[  0.,   1.,   2., 100.,   4.,   5.],
        [  6.,   7.,   8., 100.,  10.,  11.]])
tensor([  0.,   1.,   2., 100.,   4.,   5.,   6.,   7.,   8., 100.,  10.,  11.])


In [None]:
# Stack tensors on top of each other
x_stacked = torch.stack((x, x, x, x), dim=0)
print(x_stacked, x_stacked.shape)
x_stacked = torch.stack((x, x, x, x), dim=1)
print(x_stacked, x_stacked.shape)

x_stacked = torch.stack((x, x, x, x), dim=-1) # dim=-1 refers to last dimension of tensor
print(x_stacked, x_stacked.shape)
x_stacked = torch.stack((x, x, x, x), dim=-2) # dim=-2 refers to second to last dimension of tensor
print(x_stacked, x_stacked.shape)

# NOTE: Refer to MISC on Stack in PyTorch

tensor([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.],
        [ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.],
        [ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.],
        [ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.]]) torch.Size([4, 12])
tensor([[ 0.,  0.,  0.,  0.],
        [ 1.,  1.,  1.,  1.],
        [ 2.,  2.,  2.,  2.],
        [ 3.,  3.,  3.,  3.],
        [ 4.,  4.,  4.,  4.],
        [ 5.,  5.,  5.,  5.],
        [ 6.,  6.,  6.,  6.],
        [ 7.,  7.,  7.,  7.],
        [ 8.,  8.,  8.,  8.],
        [ 9.,  9.,  9.,  9.],
        [10., 10., 10., 10.],
        [11., 11., 11., 11.]]) torch.Size([12, 4])
tensor([[ 0.,  0.,  0.,  0.],
        [ 1.,  1.,  1.,  1.],
        [ 2.,  2.,  2.,  2.],
        [ 3.,  3.,  3.,  3.],
        [ 4.,  4.,  4.,  4.],
        [ 5.,  5.,  5.,  5.],
        [ 6.,  6.,  6.,  6.],
        [ 7.,  7.,  7.,  7.],
        [ 8.,  8.,  8.,  8.],
        [ 9.,  9.,  9.,  9.],
        [10., 10.,

In [None]:
print(x_reshaped)
print(x_reshaped.shape, '\n')

# torch.unsqueeze() - adds a single dimension to a target tensor at a specific dim (dimension)
x_unsqueezed = x_reshaped.unsqueeze(dim=0).unsqueeze(dim=2) # turns torch.Size([4, 3]) to torch.Size([1, 4, 1, 3])    // adding 2 dimensions in this case

print(x_unsqueezed)
print(x_unsqueezed.shape, '\n')

# torch.squeeze() - removes all single dimensions from a target tensor
x_squeezed = x_reshaped.squeeze() # turns torch.Size([1, 4, 1, 3]) to torch.Size([4, 3]) // All single-dimensions removed

print(x_squeezed)
print(x_squeezed.shape, '\n')

tensor([[ 0.,  1.,  2.],
        [ 3.,  4.,  5.],
        [ 6.,  7.,  8.],
        [ 9., 10., 11.]])
torch.Size([4, 3]) 

tensor([[[[ 0.,  1.,  2.]],

         [[ 3.,  4.,  5.]],

         [[ 6.,  7.,  8.]],

         [[ 9., 10., 11.]]]])
torch.Size([1, 4, 1, 3]) 

tensor([[ 0.,  1.,  2.],
        [ 3.,  4.,  5.],
        [ 6.,  7.,  8.],
        [ 9., 10., 11.]])
torch.Size([4, 3]) 



In [None]:
# torch.permute() - rearranges the dimensions of a target tensor in a specified order
colour = torch.rand(size=(4, 6, 3))
print(colour)                         # [height, width, colour_channels]
print(colour.shape, '\n')

# Permute the original colour tensor to rearrange the axis (or dim) order
colour_permuted = colour.permute(2, 0, 1) # turns torch.Size([4, 3]) to torch.Size([3, 4])

print(colour_permuted)
print(colour_permuted.shape, '\n')    # [colour_channels, height, width]

# Indexing -- Original + Permuted
print(colour[1, 5, 2], colour_permuted[2, 1, 5]) # both this line and below accesses the same value, notice how their ith index is changed based on .permute() ordering
print(colour[2, 1, 2], colour_permuted[2, 2, 1])

tensor([[[0.5645, 0.1165, 0.1019],
         [0.6365, 0.7198, 0.4303],
         [0.3915, 0.4394, 0.0663],
         [0.5587, 0.2058, 0.1887],
         [0.1834, 0.5979, 0.0802],
         [0.1227, 0.6293, 0.7242]],

        [[0.0393, 0.2908, 0.0398],
         [0.3694, 0.6484, 0.2823],
         [0.5443, 0.5146, 0.0168],
         [0.4561, 0.8653, 0.1424],
         [0.1205, 0.0313, 0.7364],
         [0.7920, 0.0541, 0.1390]],

        [[0.5999, 0.4486, 0.7372],
         [0.4481, 0.8352, 0.8899],
         [0.9212, 0.1219, 0.3458],
         [0.2967, 0.7259, 0.9953],
         [0.3462, 0.4986, 0.4077],
         [0.8274, 0.8302, 0.4886]],

        [[0.8864, 0.9980, 0.4603],
         [0.9205, 0.8473, 0.3068],
         [0.0061, 0.0761, 0.8057],
         [0.9956, 0.2334, 0.2998],
         [0.0985, 0.1982, 0.5070],
         [0.9418, 0.0557, 0.1459]]])
torch.Size([4, 6, 3]) 

tensor([[[0.5645, 0.6365, 0.3915, 0.5587, 0.1834, 0.1227],
         [0.0393, 0.3694, 0.5443, 0.4561, 0.1205, 0.7920],
         [

### Indexing (Selecting Data from Tensors)

Indexing with PyTorch is simillar to indexing with NumPy.

In [None]:
# Create a Tensor
x = torch.arange(0., 12.)
print(x, '\n')
x = x.reshape(1, 3, 4)
print(x, x.shape)

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.]) 

tensor([[[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]]) torch.Size([1, 3, 4])


In [None]:
# Let's index on our new tensor
print(x[0])               # accesses dim=0
print(x[0][1])            # accesses dim=1
print(x[0][1][2], '\n')   # accesses dim=2

print(x[0])
print(x[0, 1])
print(x[0, 1, 2]) # [int, int, int] indexing is the same as [int][int][int]

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])
tensor([4., 5., 6., 7.])
tensor(6.) 

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])
tensor([4., 5., 6., 7.])
tensor(6.)


In [None]:
# You can also use ':' (slicing) to select "all" of a target dimension
print(x[0, :, 2], '\n') # outputs the 3rd value from all values (arrays) of dim=1
print(x[0, :, :], '\n') # outputs all values from all values (arrays) in dim=1

tensor([ 2.,  6., 10.]) 

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]]) 



### PyTorch Tensors && NumPy


NumPy is a very popular scientific Python numerical computing library.

And because of this, PyTorch has functionality to interact with it.

* Convert data in NumPy, to PyTorch Tensor --> `torch.from_numpy(ndarray)`
* Convert PyTorch Tensor to NumPy data --> `torch.Tensor.numpy()`

In [None]:
# NumPy array to tensor
array = np.arange(0.0, 8.0)
tensor = torch.from_numpy(array) # Warning: when converting from NumPy -> PyTorch, PyTorch reflects NumPy's default dtype of float64, unless specified otherwise
print(tensor)                   # float64 - NumPy default
print(tensor.float())           # converted to float32
print(tensor.double(), '\n')    # converted to float64 (double)

# Tensor to NumPy array
array = tensor.numpy()
print(array, array.dtype, '\n')

tensor([0., 1., 2., 3., 4., 5., 6., 7.], dtype=torch.float64)
tensor([0., 1., 2., 3., 4., 5., 6., 7.])
tensor([0., 1., 2., 3., 4., 5., 6., 7.], dtype=torch.float64) 

[0. 1. 2. 3. 4. 5. 6. 7.] float64 



In [None]:
# Changing the value of NumPy array, using "array = array + 1"
array = tensor.numpy()
array = array + 1                                                               # NOTE: Refer to notes below regarding behaviour
print("array = array + 1: ", tensor, array, '\n')

# Changing the value of NumPy array, using "array += 1"
tensor = tensor.float()
array = tensor.numpy()
array += 1                                                                      # NOTE: Refer to notes below regarding behaviour
print("array += 1: ", tensor, array, '\n')                                      # NOTICE HOW BOTH ARRAY AND TENSOR IS CHANGED HERE

print("\ndtype check for both Numpy array and PyTorch tensor\n------")
print(array, array.dtype) # dtype of NumPy array is float32, because the dtype of tensor was float32 prior.
print(tensor, tensor.dtype)

array = array + 1:  tensor([0., 1., 2., 3., 4., 5., 6., 7.], dtype=torch.float64) [1. 2. 3. 4. 5. 6. 7. 8.] 

array += 1:  tensor([1., 2., 3., 4., 5., 6., 7., 8.]) [1. 2. 3. 4. 5. 6. 7. 8.] 


dtype check for both Numpy array and PyTorch tensor
------
[1. 2. 3. 4. 5. 6. 7. 8.] float32
tensor([1., 2., 3., 4., 5., 6., 7., 8.]) torch.float32


When you perform `array = array + 1` in NumPy, it creates a new array with the result of `array + 1` and then assigns this new array to the variable array. This operation does not modify the original array in place, but rather rebinds the name array to a new object. Therefore, the PyTorch tensor that was originally linked to the NumPy array will not reflect this change since the underlying data is not modified, just the binding.

On the other hand, when you do `array += 1`, this is an in-place operation which modifies the existing array's data. Since the PyTorch tensor created by torch.from_numpy() shares the same memory as the NumPy array, the changes made in place to the NumPy array are also reflected in the PyTorch tensor.

Here's a brief outline of what happens with each operation:

* `array = array + 1`: Creates a new array with incremented values and assigns it to the variable array. The original array remains unmodified.
* `array += 1`: Adds 1 to each element of the original array in place, which also updates the values in the linked PyTorch tensor.

In the PyTorch documentation, this sharing of memory between NumPy arrays and PyTorch tensors is clearly stated, and it's important to be aware of this to avoid unexpected side effects when manipulating either of the objects.

### Reproducibility (Random Seed)



**In Short How a Neural Network Learns:**

---

1. Start with random numbers
2. Tensor Operations
3. Update random numbers to try and make  them better representations of data
4. Step 3 repeats, again, again, again.....

---

To reduce the randomness in neural networks and PyTorch, comes the concept of a **random seed**.

Essentially, what the rndopm see does is "flavour" the randomness.

In [None]:
# Create 2 random tensors
random_tensor_A = torch.rand(3, 4)
random_tensor_B = torch.rand(3, 4)

print(random_tensor_A)
print(random_tensor_B)

print(random_tensor_A == random_tensor_B)

tensor([[0.8853, 0.4271, 0.8289, 0.5907],
        [0.6330, 0.7076, 0.4465, 0.5986],
        [0.6570, 0.0455, 0.5694, 0.1602]])
tensor([[0.5202, 0.7657, 0.4440, 0.3385],
        [0.6085, 0.0650, 0.1643, 0.7576],
        [0.1722, 0.3328, 0.0262, 0.7212]])
tensor([[False, False, False, False],
        [False, False, False, False],
        [False, False, False, False]])


In [None]:
# Let's make some random but reproducible tensors

# Set the random seed
RANDOM_SEED = 42

torch.manual_seed(RANDOM_SEED)
random_tensor_C = torch.rand(3, 4)

torch.manual_seed(RANDOM_SEED) # Need to recall manual_speed(), if you want subsequent calls to have the same numbers in matrices.
random_tensor_D = torch.rand(3, 4)

print(random_tensor_C)
print(random_tensor_D)

print(random_tensor_C == random_tensor_D)

tensor([[0.8823, 0.9150, 0.3829, 0.9593],
        [0.3904, 0.6009, 0.2566, 0.7936],
        [0.9408, 0.1332, 0.9346, 0.5936]])
tensor([[0.8823, 0.9150, 0.3829, 0.9593],
        [0.3904, 0.6009, 0.2566, 0.7936],
        [0.9408, 0.1332, 0.9346, 0.5936]])
tensor([[True, True, True, True],
        [True, True, True, True],
        [True, True, True, True]])


Resources
* PyTorch docs on Reproducibility: https://pytorch.org/docs/stable/notes/randomness.html
* Wikipedia on Random Seed: https://www.wikiwand.com/en/Random_seed

### Running Tensors and PyTorch Objects on the GPUs (and making faster computations)


GPUs = faster computation on numbers, thanks to CUDA + Nvidia hardware + PyTorch working behind the scenes to make everything possible

#### 1. Getting a GPU



1. Easiest - Using Google Colab - Cloud Computing (Free or Pro)
2. Use personal GPU (investment required)
3. Use Cloud Computing - GCP, AWS, Azure... these services allow you to rent computers on the cloud and access them.

For Option 2 and Option 3, PyTorch + GPU drivers (CUDA) takes a little bit of setting up. To do this, refer to PyTorch setup documents:

>https://pytorch.org/get-started/locally/

In [None]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


#### 2. Check for GPU access with PyTorch

For PyTorch, since it's capable of running compute on the GPU or CPU, it's best practice to setup device agnostic code:

>https://pytorch.org/docs/stable/notes/cuda.html#best-practices

I.e. run on GPU if available, else default to CPU

In [None]:
# Check for GPU access with PyTorch
print(torch.cuda.is_available())

# Setup device agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

# Count number of devices
torch.cuda.device_count()

False
cpu


0

#### 3. Putting Tensors (and Models) on the GPU

The reason we want out tensors/models on the GPU is because using a GPU results in faster computations.

In [None]:
# Create a tensor (default on the CPU)
tensor = torch.tensor([[1, 2, 3], [4, 5, 6]])

# Print the tensor and the device
print(tensor, tensor.device) # tensor is initially on CPU

# Move tensor to GPU (if available)
tensor_on_gpu = tensor.to(device) # tensor is moved from CPU to GPU (if available)
print(tensor_on_gpu, tensor_on_gpu.device)

# Notes regarding " device='cuda:0' "
# 'cuda:0': This is the index of the GPU that you're referring to. :0 means the first GPU. If your system has multiple GPUs,
# they would be indexed as cuda:0, cuda:1, cuda:2, and so on.

tensor([[1, 2, 3],
        [4, 5, 6]]) cpu
tensor([[1, 2, 3],
        [4, 5, 6]]) cpu


#### 4. Moving Tensors Back to the CPU

In [None]:
# If tensor is on GPU, can't transform it to NumPy
# To fix the tensor on GPU -> Numpy issue; we need to first set the tensor back to CPU before calling .numpy()
tensor_on_cpu = tensor_on_gpu.cpu()
print(tensor_on_cpu.numpy(), tensor_on_cpu.numpy().dtype)
print('\n', tensor_on_gpu)

[[1 2 3]
 [4 5 6]] int64

 tensor([[1, 2, 3],
        [4, 5, 6]])


# Miscellaneous


###Miscellaneous Notes - Image Tensor

**GPT Prompt:** In tensor-speak (the language used to describe tensors), the tensor would have three dimensions, one for colour_channels, height and width.

<br>

**GPT Answer:**

The quote you've provided talks about how to describe the structure of a tensor, particularly in the context of image data, using the specific terminology or "language" associated with tensors.

In the realm of machine learning and computer vision, images are commonly represented as tensors. A tensor, in this context, is essentially a multi-dimensional array. For an image, these dimensions correspond to different aspects of the data it contains:

Color Channels: The number of channels in an image represents the depth of color information. For instance, a standard RGB (Red, Green, Blue) image has three color channels. Each channel stores intensity values for its respective color across the image. There are also grayscale images, which have just one channel, and other formats that may have more channels (like RGBA, where the A stands for Alpha or transparency).

Height: This dimension represents the vertical size of the image in pixels.

Width: This dimension represents the horizontal size of the image in pixels.

So, when the quote mentions a tensor having "three dimensions, one for color_channels, height, and width," it's describing how an image is structured in a way that's understood by tensor operations. For a color image in the RGB format, you could conceptualize it as a 3D tensor:

The first dimension corresponds to the color channels (3 for RGB).
The second dimension corresponds to the height of the image in pixels.
The third dimension corresponds to the width of the image in pixels.
Thus, if you have a 1080x1920 pixel RGB image, in tensor-speak, you would describe it as having dimensions of 3x1080x1920, where 3 represents the RGB channels, 1080 represents the height, and 1920 represents the width.

Understanding this terminology is crucial for working with image data in machine learning frameworks like TensorFlow or PyTorch, as it affects how you process, manipulate, and train models with image data.

<br>
---
<br>

Here's an example of what a tensor array for a 10x10 image with RGBA channels might look like. Each entry in this array is a randomly generated integer between 0 and 255, representing the intensity of the Red, Green, Blue, and Alpha (transparency) channels at each pixel:

---
```
Array shape: (4, 10, 10)

array([[[203, 178,  63, 203,  19,  80,  77,  99, 226, 227], #red
        [148,  31, 141, 150, 187, 184, 214,  10,  40,  95],
        [ 96,  76, 213, 117,  79, 224, 174,  44,   7, 236],
        [182,  25, 242, 120, 207, 194,  52,  38,  42, 222],
        [105, 193, 169, 146, 221,  76, 179,  84, 109, 222],
        [  4, 105,  73,  54, 242, 172, 238,  23,  89, 192],
        [ 79,  26,  75, 133,  92,  10, 115, 114,  87, 227],
        [124, 183,  36, 176, 115, 246, 151, 227,  54,   1],
        [  8, 255, 102, 242, 116,  93,  10, 184, 184,   1],
        [161, 118,  98, 150,  89, 208,  76,  77,  58, 136]],

       [[171, 231, 224, 128, 149, 101,  72, 167,  57, 178], #green
        [170,   3, 175, 194, 243,  77, 188,  97, 197, 206],
        [250, 197,  22, 191, 215, 156,  73,  38, 196,  34],
        [255, 164, 187,  28, 114, 140,  65, 162,  74, 168],
        [212, 127, 120, 162,  86,  22, 141, 126,  81, 167],
        [168, 201,  21, 219, 148,  70, 102, 206, 111, 214],
        [ 23, 204, 152,  56, 164, 110, 197, 213,  28, 206],
        [162, 149, 100, 206,  25,  88, 161,  83, 173, 177],
        [218,  38,  76, 220,  37,  36,  11, 228,   0, 107],
        [ 29,  94, 173, 170,  31, 207, 251,  65, 205,  95]],

       [[ 17, 117, 145,  64, 124,  20, 242, 202, 159,  72], #blue
        [214, 102, 124, 170, 191,  93,  23, 251, 249,  62],
        [ 25, 245, 169, 116, 131,  39,  68, 147,  79,  90],
        [111,  33,  35, 196, 197, 170,  51, 197,  62,  70],
        [ 47,  41, 113, 166, 202,  24, 242, 246,  68, 245],
        [174,   4,  77,  83,  42,  65, 183,  58, 113,  44],
        [221, 117, 252, 153, 119,  45, 137, 104,  58,  57],
        [168, 231, 210, 239,  32,  30,   9, 150,  95, 214],
        [127, 240,  56, 200, 161, 137, 197,   6, 231,  72],
        [  4, 109,  53,  11, 197, 217,  88,   1, 249, 254]],

       [[ 23, 252, 127,   8,  58, 222, 123,   6,  14, 108], #alpha
        [  4, 195,  26, 177,  59, 145, 199, 167, 181,  93],
        [  4, 119, 204, 182, 108,  53, 175, 159,  83,  50],
        [ 47, 233, 103, 198,  84, 134,  42, 145,  93, 235],
        [ 35, 165, 152,  77, 101, 153, 242, 206, 109,   9],
        [ 78, 210, 182, 151,  48, 209, 199,  33, 138,  95],
        [226, 206, 106,  47,  54,  49, 178, 107, 112, 105],
        [195,  17,  24, 144,  91,  14, 102, 123, 246, 231],
        [253, 137,  22, 133,  19, 192, 166,  24, 145,  78],
        [ 69, 165, 135, 255, 122, 125, 135, 160, 174,  54]]], dtype=uint8)
```
---

This representation shows a 4-dimensional tensor where:

* The first dimension corresponds to the color channels (RGBA), so there are 4 layers or slices.
* The second and third dimensions correspond to the height and width of the image, respectively, both set to 10 pixels in this example.

Each "layer" or slice of the tensor represents one of the RGBA channels across the entire image. The values within each layer represent the intensity of that channel for each pixel, with 0 being no intensity (or completely transparent for the alpha channel) and 255 being the maximum intensity. ​​






In Jupyter notebooks, which include Google Colab notebooks, you do not always need to use the print() function to display output. This environment is designed for interactive computing, and it has a feature that automatically displays the result of the last operation in each cell.

When you run a cell in a Jupyter notebook (or Google Colab, which is based on Jupyter), the following rules apply:

The last line of code in a cell, when executed, will have its result displayed below the cell, even if it is not printed with the print() function. This is known as the cell's output.
If the last line of code does not produce any output, nothing is displayed.
If you have other lines of code in the cell whose output you wish to display, you need to explicitly use print() or another display function for those lines.
This feature is especially useful for quick data exploration and analysis, as it allows you to see the results immediately without needing to use print() every time. However, if you want to format the output or display multiple outputs within a cell, you'll need to use print() or another suitable display function.

**Notes Regarding Why you don't need to declare print() to output value in Google Colab**

<br>

In Jupyter notebooks, which include Google Colab notebooks, you do not always need to use the $print()$ function to display output. This environment is designed for interactive computing, and it has a feature that automatically displays the result of the last operation in each cell.

When you run a cell in a Jupyter notebook (or Google Colab, which is based on Jupyter), the following rules apply:

1. The last line of code in a cell, when executed, will have its result displayed below the cell, even if it is not printed with the $print()$ function. This is known as the cell's output.
2. If the last line of code does not produce any output, nothing is displayed.
3. If you have other lines of code in the cell whose output you wish to display, you need to explicitly use $print()$ or another display function for those lines.


This feature is especially useful for quick data exploration and analysis, as it allows you to see the results immediately without needing to use $print()$ every time. However, if you want to format the output or display multiple outputs within a cell, you'll need to use $print()$ or another suitable display function.

### Miscellaneous Notes - Dimensions

**GPT Prompt**: what even is ndim in torch, as in, what is dimension of tensors even. as scalars are normally assosciated with the number line 1st dimension, however here it is assosicated with the 0th dimension. can i know why does tensorflow consider scalar as 0th, vector as 1st, matrix 2nd etc

<br>

**GPT Answer**:

The concept of dimensionality in tensors (as used in libraries like PyTorch and TensorFlow) is rooted in the mathematical definition of dimensions in linear algebra, but it's also slightly adapted to fit programming and data structure paradigms.

**Understanding Dimensions**

* Dimensionality: In the context of tensors, the dimensionality (or rank) refers to the number of indices required to uniquely select each element of the tensor. This concept is directly tied to how data can be organized and represented in structured forms:
> * A scalar can be thought of as a single number. It doesn't contain any axes (or directions), so it's considered 0-dimensional. You don't need any index to access it because it's just a single value.
> * A vector is a 1-dimensional array. You need one index to access each element. Think of it as a line where each element can be located at a certain position along that line.
> * A matrix is a 2-dimensional array. You need two indices to access each element (one for the row and one for the column). It can be visualized as a plane or grid where positions are defined by two coordinates.
> * Higher-dimensional tensors (3D, 4D, etc.) require more indices to access each element and represent more complex structures (like a cube for a 3D tensor).

**Why Scalars are Considered 0-Dimensional**

The designation of scalars as 0-dimensional in libraries like TensorFlow and PyTorch is consistent with how dimensions are mathematically defined. Since a scalar does not require any indices to access (it's just a single value), it is considered 0-dimensional. This definition helps maintain consistency across mathematical theory and its implementation in programming:

* It aligns with the mathematical concept of a dimension as the number of coordinates needed to specify a point within it.
* It provides a basis for defining operations and transformations on tensors in a way that is consistent regardless of their rank.

**Practical Implications**

This conceptual framework has practical implications in programming and data processing:

* Generalization: By defining a scalar as a 0-dimensional tensor, vectors as 1-dimensional, matrices as 2-dimensional, and so on, libraries can offer a unified and consistent API for tensor operations that works across tensors of any dimensionality.
* Broadcasting and Shape Manipulation: Understanding dimensionality is crucial for operations that involve multiple tensors, especially when their shapes differ. It allows for the implementation of broadcasting rules, where tensors of different shapes are used together in operations.
* Data Representation: This hierarchical structure (scalar < vector < matrix < tensor) provides an intuitive way to represent increasingly complex data. For instance, a 3D tensor could represent a sequence of images (2D tensors), where each image is a matrix of pixel values.

In summary, the classification of scalars as 0-dimensional and so on up the hierarchy to higher-dimensional tensors is a logical extension of mathematical principles into programming constructs, allowing for more versatile and powerful data manipulation capabilities in libraries like PyTorch and TensorFlow.

### Miscellaneous Notes - tensor.mean() Dtype 'Long' issue

**GPT Prompt:** Why does long not work with .mean() in PyTorch

**GPT Answer:** In PyTorch, the .mean() operation computes the arithmetic mean of all the elements in the tensor. For the mean to be meaningful, especially in the context of floating-point arithmetic, the operation is defined only for tensors with floating point or complex numbers. This is because the mean is not always an integer value, and rounding to an integer could lead to loss of precision and incorrect calculations in subsequent operations.

The tensor you are working with has been created with the default integer type (torch.int64 or Long type in PyTorch), as inferred from the torch.arange function. To calculate the mean, you need to first convert it to a floating-point type. You can do this by using the .float() or .to() methods on the tensor.

Here's how you can modify your tensor to compute the mean:

```
x_float = x.float()  # Convert x to a float tensor
print(torch.mean(x_float), x_float.mean())  # Now you can compute the mean
```

This converts the tensor x to a floating point tensor (torch.float32), which allows the .mean() method to work correctly.

### Miscellaneous Notes - Regarding Slicing [:] in Python

**GPT Answer Regarding Slicing[:] in Python**
In Python, slicing is a feature that allows you to access a portion of a sequence type (like lists, strings, tuples) or array-like objects (like NumPy arrays and PyTorch tensors). The slicing operator [:] is used to access elements of an iterable based on their indices. Here are some examples of how slicing works:

---

**Lists**

* Let's start with a simple list:
> `my_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`
* Accessing all elements:
> `all_elements = my_list[:]  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`
* Accessing the first 5 elements:
> `first_five = my_list[:5]  # [0, 1, 2, 3, 4]`
* Accessing elements from index 5 to the end:
> `from_five_onwards = my_list[5:]  # [5, 6, 7, 8, 9]`
* Accessing elements from index 3 to 7 (not including 7):
> `subset = my_list[3:7]  # [3, 4, 5, 6]`
* Accessing elements with a step (every second element):
> `every_second = my_list[::2]  # [0, 2, 4, 6, 8]`
* Accessing elements in reverse:
> `reversed_list = my_list[::-1]  # [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]`

---

**Strings**

* Slicing works similarly with strings:
> `my_string = "Hello, World!"`

* Accessing all characters:
> `all_chars = my_string[:]  # "Hello, World!"`

* Accessing the first 5 characters:
> `first_five = my_string[:5]  # "Hello"`

* Accessing characters from index 7 to the end:
> `from_seven_onwards = my_string[7:]  # "World!"`

---

**NumPy Arrays**

* For NumPy arrays (and similarly for PyTorch tensors), slicing can be multidimensional:

> ```
import numpy as np
my_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
> ```

* Accessing the entire array:
> `all_elements = my_array[:]  # array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])`

* Accessing the first two rows:
> `first_two_rows = my_array[:2]  # array([[1, 2, 3], [4, 5, 6]])`

* Accessing the second column:
> `second_column = my_array[:, 1]  # array([2, 5, 8])`

Accessing a submatrix (first two rows, last two columns):
> `submatrix = my_array[:2, 1:]  # array([[2, 3], [5, 6]])`

---

Slicing is a powerful tool in Python that lets you retrieve elements efficiently from sequences and array-like objects without the need for explicit loops.

### Miscellaneous Notes - torch.stack() + torch.squeeze()



#### torch.stack(), torch.vstack(), torch.hstack()

GPT Prompt: Explain to me what does `torch.stack()` do?

GPT Answer:

`torch.stack()` is a function in PyTorch, a popular open-source machine learning library, that is used to concatenate a sequence of tensors along a new dimension. All tensors in the sequence **must have the same shape**. The function is particularly useful when you want to combine tensors in a way that creates an additional dimension, rather than simply joining them end-to-end as in concatenation.

Here's a more detailed breakdown of its functionality:

* Parameters: The `torch.stack(tensors, dim=0, *, out=None) -> Tensor` function takes several parameters, the most important of which are:

  * tensors: a sequence of tensors to concatenate. These tensors must all have the same shape.
  * dim: the dimension along which to concatenate the tensors. This is a new dimension that the tensors will be stacked along.
  * out (Tensor, optional) – the output tensor.

* Behavior: `torch.stack()` joins a sequence of tensors along a new dimension. All tensors in the sequence are expanded to have one additional dimension, and then they are joined along this new dimension. The position of this new dimension in the resultant tensor is specified by the dim parameter.

* Result: The result of `torch.stack()` is a new tensor with one more dimension than the original tensors. For example, if you stack three 2D tensors (each of shape [2, 3]) along a new first dimension (`dim=0`), the resulting tensor will have the shape [3, 2, 3].

Here's an example to illustrate its use:

 ```
import torch

# Creating two 2D tensors of shape [2, 3]
tensor1 = torch.randn(2, 3)
tensor2 = torch.randn(2, 3)

# Stacking the two tensors along a new dimension (creates a new dimension at position 0)
result = torch.stack((tensor1, tensor2), dim=0)

print(result.shape)  # Output will be: torch.Size([2, 2, 3])
```

In this example, tensor1 and tensor2 are stacked along a new first dimension (dim=0), resulting in a tensor of shape [2, 2, 3], indicating that there are 2 tensors, each of shape [2, 3], stacked together.

---
GPT Prompt: Then what does param dim=-1 and dim=-2 do?

GPT Answer:

In PyTorch, `torch.stack()` is a function used to concatenate a sequence of tensors along a new dimension. All tensors in the sequence must have the same shape. The function creates a new tensor by stacking the input tensors, effectively adding one additional dimension to the resultant tensor, where each input tensor becomes a slice along the newly created dimension.

The `dim` parameter in `torch.stack()` specifies the dimension along which the tensors are stacked. The dimensions are zero-indexed, meaning dim=0 would stack the tensors along a new first dimension, and the original dimensions of the tensors would follow after this new dimension.

Regarding `dim=-1` and `dim=-2`:

* `dim=-1` refers to the last dimension of the tensor. When you use dim=-1 in torch.stack(), it means you're stacking the tensors along the last dimension. After stacking, this newly created dimension becomes the new last dimension of the resultant tensor.

* `dim=-2` refers to the second-to-last dimension of the tensor. Using dim=-2 in torch.stack() means the tensors are stacked along the second-to-last dimension, making the newly created dimension the new second-to-last dimension of the resultant tensor.

These negative indices are useful for specifying dimensions relative to the end of the tensor, which can be especially handy when you're dealing with tensors of varying dimensions and you want to stack them along one of the last few dimensions without explicitly calculating or hard-coding the dimension index.

---

GPT Prompt: Then what is torch.vstack() and torch.hstack()

GPT Answer:

In PyTorch, torch.vstack() and torch.hstack() are functions used to concatenate tensors vertically and horizontally, respectively. These functions provide a convenient way to stack tensors along specific axes and are particularly useful when working with 2D tensors (matrices) but can also be applied to tensors with more dimensions. Here's a brief overview of each:

`torch.vstack()`

* **Vertical stacking (vstack):** This function stacks tensors vertically, which is equivalent to concatenating tensors along the first dimension (rows) for 2D tensors, or more generally, along a new first dimension for tensors of any shape. It's akin to stacking matrices on top of each other or appending rows to a matrix.
* **Usage:** It's commonly used when you have tensors of the same width (i.e., the same size in all dimensions except for the first) and you want to combine them into a single tensor with greater height (more rows).

`torch.hstack()`

* **Horizontal stacking (hstack):** This function stacks tensors horizontally, equivalent to concatenating tensors along the second dimension (columns) for 2D tensors, or more generally, along the last dimension for 1D tensors. For tensors with more than two dimensions, it concatenates along the last dimension, which increases the size of that dimension.
* **Usage:** It's used when you have tensors of the same height (i.e., the same size in all dimensions except for the second in the case of 2D tensors) and you want to combine them into a single tensor with greater width (more columns).

These functions are particularly helpful in data manipulation and preprocessing tasks, such as combining data from different sources into a single dataset or reshaping data to fit the input requirements of a model.

**GPT Prompt:** Explain to me regarding torch.squeeze() and torch.unsqueeze()

**GPT Answer:**

In PyTorch, `torch.squeeze()` and `torch.unsqueeze()` are two functions used for modifying the dimensions of tensors, specifically for removing and adding dimensions of size 1, respectively. These functions are very useful for tensor shape manipulation, allowing for more flexible operations and easier integration with APIs expecting certain tensor shapes.

<br>

`torch.squeeze()`

* Purpose: Removes all dimensions of size 1 from the tensor. If a specific dimension is given as an argument, torch.squeeze() removes the given dimension only if it is of size 1; otherwise, the tensor is returned unchanged.
* Usage: This is useful when you want to eliminate redundant dimensions from your tensor, making your data more compact or ensuring it fits certain APIs that do not accept tensors with singleton dimensions. For example, if you have a tensor of shape (1, 3, 1, 5), applying squeeze() would reduce its shape to (3, 5).
* Parameter:
  * input (Tensor) – the input tensor.
  * dim (int or tuple of ints, optional) – if given, the input will be squeezed only in the specified dimensions.
* Syntax Example:
> `squeezed_tensor = torch.squeeze(input_tensor, dim=None)`

  Here, input_tensor is the tensor you want to squeeze, and dim is an optional argument specifying the dimension to squeeze. If dim is not provided, all dimensions of size 1 are removed.

`torch.unsqueeze()`

* Purpose: Adds a dimension of size 1 to the tensor at the specified dimension. This is the inverse operation of squeeze(), allowing you to expand the dimensions of your tensor.
* Usage: This is particularly useful when you need to increase the dimensionality of a tensor for it to be compatible with certain operations that expect tensors of a certain shape. For instance, if you have a tensor of shape (3, 5) and you want to add a singleton dimension as the first dimension, you would end up with a shape of (1, 3, 5).
* Syntax Example:
> `unsqueezed_tensor = torch.unsqueeze(input_tensor, dim)`

  Here, input_tensor is the tensor you want to unsqueeze, and dim is the position where the new dimension will be inserted.

Both `squeeze()` and `unsqueeze()` are essential for shaping tensors in preparation for various operations, such as matrix multiplications, broadcasting, and feeding data into models, ensuring that tensors conform to the expected dimensions.






### Miscellaneous Notes - torch.permute()


In PyTorch, `torch.permute()` is a function used to rearrange the dimensions of a tensor according to a specified order. This allows for complex reshaping and reordering of the data within a tensor, making it a powerful tool for manipulating tensor shapes and dimensions.

---

How `torch.permute()` Works

* Purpose: To reorder the dimensions of a tensor.
* Usage: When you want to change the order of axes/dimensions of a tensor. For example, if you have a tensor of shape (2, 3, 4) and you want to change the order of dimensions to make it (3, 4, 2), you can use torch.permute() to achieve this.

---

Syntax and Example
```
import torch

# Creating a tensor of shape (2, 3, 4)
x = torch.randn(2, 3, 4)

# Permute the dimensions
# Here, we want the dimension order to be (1, 2, 0),
# meaning the new shape will be (3, 4, 2)
y = x.permute(1, 2, 0)

print("Original shape:", x.shape)
print("New shape:", y.shape)
```

In the permute(1, 2, 0) call, the arguments represent the new order of the dimensions:

* The first dimension (0) of x becomes the third dimension of y.
* The second dimension (1) of x becomes the first dimension of y.
* The third dimension (2) of x becomes the second dimension of y.

This results in y having a shape of (3, 4, 2), as per the example.

**Key Points**

* `torch.permute()` does not change the data within the tensor; it only changes how the data is organized by rearranging the tensor dimensions.
* It's useful for operations that require specific dimension orders, such as when preparing data for convolutional layers in neural networks, where you might need to permute dimensions to match the expected input shape.

This flexibility makes `torch.permute()` an essential function for tensor manipulation, especially in complex data processing pipelines and neural network operations.

### Miscellaneous Notes - cuda-semantics

URL: https://pytorch.org/docs/master/notes/cuda.html#cuda-semantics

#### Tensor Allocation + CUDA Devices


**GPT PROMPT:**

>regarding pytorch.cuda, what does the below mean
>
>"However, once a tensor is allocated, you can do operations on it irrespective of the selected device, and the results will be always placed on the same device as the tensor."
>
>because i thought that to do operations on tensor which is already on the GPU's ram, you have to bring it back to the CPU before conducting operations?

---

**GPT ANSWER:**

The statement you're referring to actually highlights a key feature of PyTorch's handling of tensors on different devices (like CPUs and GPUs). Here's a breakdown of what it means:

- **Tensor Allocation**: When you create a tensor in PyTorch, you can specify the device it resides on. This could be the CPU or a GPU. For example, if you allocate a tensor on a GPU, it's stored in the GPU's memory.

- **Operations on Tensors**: Once a tensor is allocated on a device (let's say a GPU for this discussion), you can perform operations on it directly on that device. There's no need to move it back to the CPU for processing. PyTorch is designed to execute these operations on the device where the tensor resides. This is done for efficiency and performance reasons, as moving data between devices (especially between CPU and GPU) can be costly in terms of time and resources.

- **Result Placement**: After performing an operation on a tensor, the result of that operation will automatically be placed on the same device as the tensor. For example, if you have a tensor `A` on GPU and you add another tensor `B` to it (assuming `B` is also on the GPU or is broadcastable/compliant with `A`'s dimensions), the result of `A + B` will also be on the GPU.

This means you don't have to manually move tensors between devices for most operations. PyTorch handles these operations efficiently within the device's own memory (RAM in the case of CPUs, VRAM for GPUs), avoiding unnecessary data transfers.

It's a common misconception that tensors must be moved back to the CPU for operations. In reality, PyTorch and other modern deep learning frameworks are designed to perform tensor operations directly on the device where the tensors are located, leveraging the computational power of GPUs for faster processing without the bottleneck of data transfer between CPU and GPU.

Quick Note:
* Default CUDA - default GPU
* cuda:0 is GPU 1
* cuda:1 is GPU 2...

Code source: https://pytorch.org/docs/master/notes/cuda.html#cuda-semantics

More Notes;
* `torch.cuda` is used to set up and run CUDA operations. It keeps track of the currently selected GPU, and all CUDA tensors you allocate will by default be created on that device.
* Understanding as of now, to declare for tensor to be allocated on GPU, need to write `tensor(arr, device=cuda-ith-GPU)` or `tensor(arr).cuda()`
* `with torch.cuda.device(1):` - this line is referring to if your attempting to switch to a GPU with index 1, which implies a second GPU. (won't work on Colab as only have 1x GPU at most).
* Pay attention how tensor


```
cuda = torch.device('cuda')     # Default CUDA device
cuda0 = torch.device('cuda:0')
cuda2 = torch.device('cuda:2')  # GPU 2 (these are 0-indexed)

x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)
y = torch.tensor([1., 2.]).cuda()
# y.device is device(type='cuda', index=0)

with torch.cuda.device(1):
    # allocates a tensor on GPU 1
    a = torch.tensor([1., 2.], device=cuda)

    # transfers a tensor from CPU to GPU 1
    b = torch.tensor([1., 2.]).cuda()
    # a.device and b.device are device(type='cuda', index=1)

    # You can also use ``Tensor.to`` to transfer a tensor:
    b2 = torch.tensor([1., 2.]).to(device=cuda)
    # b.device and b2.device are device(type='cuda', index=1)

    c = a + b
    # c.device is device(type='cuda', index=1)

    z = x + y
    # z.device is device(type='cuda', index=0)

    # even within a context, you can specify the device
    # (or give a GPU index to the .cuda call)
    d = torch.randn(2, device=cuda2)
    e = torch.randn(2).to(cuda2)
    f = torch.randn(2).cuda(cuda2)
    # d.device, e.device, and f.device are all device(type='cuda', index=2)
  ```

  In the PyTorch framework, when you are specifying devices like this, you're telling PyTorch where you want your tensors to be stored and computed. Here's what each of the lines means:

1. `cuda = torch.device('cuda')`:

* This line creates a device object that represents the default CUDA device. In systems with multiple GPUs, PyTorch automatically selects the default GPU based on your CUDA setup or any environmental variables you might have set. If you simply specify 'cuda' without an index, PyTorch uses the default GPU. This is useful when you don't want to hardcode your GPU selection in your script but instead rely on the system's default choice.

2. `cuda0 = torch.device('cuda:0')`:

* This creates a device object pointing specifically to the first GPU of your system. GPUs are 0-indexed, so 'cuda:0' refers to the first GPU. When you use this device object to move tensors or create new tensors, you're explicitly telling PyTorch to use the first GPU.

3. `cuda2 = torch.device('cuda:2')`:

* Similarly, this line is creating a device object for what is labeled as the third GPU in your system (again, because of 0-indexing: 0, 1, 2, ...). This is specifying that you want to use the third GPU for your tensors and computations. This line assumes that you have at least three GPUs in your system. If you don't have a third GPU (i.e., if you have fewer than three GPUs), attempting to use cuda2 for operations will result in an error because the specified device does not exist.

Each of these device objects (cuda, cuda0, cuda2) can be used with tensors and PyTorch operations to control on which GPU (or whether on any GPU at all, in the case of the default cuda device) those operations are executed. This is particularly useful in multi-GPU setups where you might want to distribute your computation across different GPUs or ensure specific operations run on specific GPUs for efficiency or organizational reasons.

### PyTorch Docs -- Cuda Semantics



**MUST READS:**

* Info on RNN, CNN, GAN, AutoEncoder+: https://www.techtarget.com/searchenterpriseai/feature/CNN-vs-RNN-How-they-differ-and-where-they-overlap

* torch.docs, cuda semantics URL: https://pytorch.org/docs/master/notes/cuda.html#cuda-semantics

* Latent Space Introduction: https://samanemami.medium.com/a-comprehensive-guide-to-latent-space-9ae7f72bdb2f

* VAE: https://ai.plainenglish.io/what-is-a-variational-autoencoder-vae-0a8d337810f8

* Deep One-Class Classification: https://medium.com/analytics-vidhya/paper-summary-deep-one-class-classification-doc-adc4368af75c

**torch.rand() vs torch.randn()**

* In summary, randn generates numbers from a normal distribution with a mean of 0 and a standard deviation of 1, while `rand` generates numbers from a uniform distribution between 0 and 1.

---

**What is FP16 GEMM**

* FP16 GEMM refers to General Matrix Multiply operations performed using half-precision floating-point arithmetic, which uses 16-bit floating-point numbers (FP16). These operations are common in deep learning and other applications requiring high performance computing. Using FP16 can lead to faster computation and reduced memory usage compared to full precision (FP32) operations, at the potential cost of reduced numerical precision. This can be beneficial on hardware like GPUs that are optimized for lower precision arithmetic, allowing for more operations per second. The benchmarks shown in the image are likely comparing the performance of matrix multiplication operations using FP16 with and without reduced precision optimizations.

---

**Machine learning inference Definition**

* Machine learning inference is the process of running data points into a machine learning model to calculate an output such as a single numerical score. This process is also referred to as "operationalizing a machine learning model" or "putting a machine learning model into production."

---

**Spatial Transformer Network (STN)**

* Spatial transformer networks (STN for short) allow a neural network to learn how to perform spatial transformations on the input image in order to enhance the geometric invariance of the model. For example, it can crop a region of interest, scale and correct the orientation of an image.
* Spatial Transformer Networks (STN) is a deep learning technique that allows for the automatic spatial manipulation and transformation of data within a neural network.

---

**GPT Question:** explain what all of these are: {nn.Linear, nn.Conv*, cdist, tensordot, affine grid and grid sample, adaptive log softmax, GRU and LSTM}

<br>

**GPT Answer:**
Certainly, the terms you've listed are associated with neural networks and their operations, particularly in the context of PyTorch, a popular machine learning library for Python.

1. **nn.Linear**: This refers to a linear transformation to the incoming data, which is also known as a fully connected or dense layer. It applies a linear function `(y = xA^T + b)` that transforms the input `x` using learned weights `A` and biases `b`.

2. **nn.Conv***: This is a prefix for convolutional layers in PyTorch, where the asterisk (*) can stand for different dimensions of convolutional operations like `nn.Conv1d` for 1-dimensional convolutions, `nn.Conv2d` for 2-dimensional convolutions (commonly used for image data), and `nn.Conv3d` for 3-dimensional convolutions (often used for volumetric data).

3. **cdist**: This function computes the pairwise distances between two sets of vectors, often used to measure the similarity or dissimilarity between the sets. In PyTorch, `torch.cdist` calculates the distances using various metrics like Euclidean, Manhattan, or cosine similarity.

4. **tensordot**: A tensordot operation computes tensor dot products along specified axes. This is a generalized matrix multiplication that allows for higher dimensional tensor products, which is useful in multidimensional array computations.

5. **affine grid and grid sample**: These are functions often used together for spatial transformer networks. `affine_grid` generates a grid of coordinates that correspond to the input data and can be used to apply affine transformations. `grid_sample` then uses this grid to sample the input data, performing tasks like cropping, scaling, rotating, and translation, enabling the network to spatially transform the input image in a differentiable manner.

6. **adaptive log softmax**: This is an implementation of a softmax variant that is more computationally efficient for cases where the number of classes is extremely large. It adaptively uses fewer computations for less frequent targets and more for frequent ones, providing a trade-off between computational complexity and precision.

7. **GRU and LSTM**:
   - **GRU (Gated Recurrent Unit)**: A type of recurrent neural network architecture that is similar to an LSTM but has fewer parameters. It combines the forget and input gates into a single "update gate" and merges the cell state and hidden state.
   - **LSTM (Long Short-Term Memory)**: Another recurrent neural network architecture that is designed to better capture long-term dependencies compared to standard RNNs. LSTMs include mechanisms called gates to control the flow of information: these are the forget gate, input gate, and output gate. They have an internal state that helps them capture long-term dependencies in sequential data.

Both GRUs and LSTMs are commonly used in sequence modeling tasks such as language modeling, time-series prediction, and more. They are designed to address the vanishing gradient problem that can occur with standard RNNs when dealing with long sequences.



**Feature Learning ++ Dimensionality Reduction (Definition)**

* Feature learning, in the context of machine learning, is the automatic process through which a model identifies and optimizes key patterns, structures, or characteristics (called "features") from raw data to enhance its performance in a given task.

* Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.

---
**Latent Space + Brief Autoencoder**

* URL: https://samanemami.medium.com/a-comprehensive-guide-to-latent-space-9ae7f72bdb2f
* Definition Quick Peek: Latent space is a lower-dimensional space that captures the essential features of the input data. In simpler terms, it is a compressed representation of the original data where each dimension corresponds to a specific feature or characteristic. This dimensionality reduction is achieved through various techniques, such as autoencoders and variational autoencoders (VAEs), which learn to encode the most important information in the data.

---

**VAE - Variational Autoencoder**

* https://ai.plainenglish.io/what-is-a-variational-autoencoder-vae-0a8d337810f8
* Brief intro: Like GANs, Variational Autoencoders (VAEs) can be used for this purpose. Being an adaptation of classic autoencoders, which are used for dimensionality reduction and input denoising, VAEs are generative. Unlike the classic ones, with VAEs you can use what they’ve learned in order to generate new samples. Blends of images, predictions of the next video frame, synthetic music — the list goes on.

---

**GPT - Autoencoders**

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data, usually for the purpose of dimensionality reduction or feature learning. The main goal of an autoencoder is to compress the input data into a latent-space representation and then reconstruct the output from this representation to match the original input as closely as possible.

#### Structure of an Autoencoder:

An autoencoder typically consists of two main parts:

1. **Encoder**: This part of the network compresses the input into a latent-space representation. It encodes the high-dimensional data into a lower-dimensional code. The encoder layer(s) process the input data to reduce its dimensionality, usually through a series of linear transformations and non-linear activations.

2. **Decoder**: This part attempts to reconstruct the input from the latent space representation. The decoder architecture is typically symmetrical to the encoder, facilitating the learning of the reverse operation from the compressed code back to the original dimension of the data.

#### Working of an Autoencoder:

- **Encoding**: The input data is passed through the encoder, which is a feed-forward neural network, and it gets transformed into the latent space (code).
- **Code**: The code is a compact "summary" or "compression" of the input data, containing the essential information needed to reconstruct it.
- **Decoding**: The decoder takes this code and reconstructs the input data as closely as possible. This reconstruction can be lossy (some original information might be lost).
- **Reconstruction Loss**: The difference between the original input and the reconstructed input is calculated using a loss function (such as mean squared error). The autoencoder learns to minimize this reconstruction loss during training.

### Variations and Uses:

Autoencoders can be used for various purposes:

- **Dimensionality Reduction**: Similar to PCA, autoencoders can reduce the dimensionality of the data by learning a reduced set of features (encoding), which can then be used for tasks like visualization.

- **Feature Learning**: Autoencoders can learn a representation (encoding) for the data that can be useful for tasks like classification.

- **Denoising**: When trained to ignore "noise," autoencoders can learn to recover the underlying structure of the data and can be used to remove noise from signals.

- **Anomaly Detection**: By learning the normal pattern of input data, autoencoders can detect outliers by evaluating the reconstruction loss.

- **Generative Models**: Variations like Variational Autoencoders (VAEs) can generate new data that's similar to the input data.

In all these cases, the key concept is that the autoencoder learns a representation (encoding) for the set of data, typically for the purpose of dimensionality reduction or pattern recognition.

---

**Non-Linear Activation Functions**

* URL (12 types of Activation Functions): https://www.v7labs.com/blog/neural-networks-activation-functions#:~:text=Non%2Dlinear%20activation%20functions%20solve,can%20provide%20a%20better%20prediction
* Non-linear activation functions solve the following limitations of linear activation functions: They allow backpropagation because now the derivative function would be related to the input, and it's possible to go back and understand which weights in the input neurons can provide a better prediction.

---




### Notes for "12 types of Activation Functions"


#### Binary Step Function


**Multi-class Classification**

* Multi-class classification problems are a type of machine learning problem where the goal is to classify instances into one of three or more classes. Unlike binary classification, where there are only two possible outcomes, multi-class classification deals with situations where an instance could belong to any one of multiple categories.

* For example, a multi-class classification problem could involve classifying types of fruit in a basket into categories such as apples, bananas, cherries, and oranges. Each fruit must be classified into one of these categories, and there are more than two categories to choose from.

* The binary step function shown in your image is not suitable for multi-class classification because it is designed to provide only two possible outputs: 0 or 1. This limitation means it can only differentiate between two classes (binary classification) and not more. For multi-class problems, other functions and algorithms are needed that can output a probability distribution over multiple classes, like the softmax function commonly used in neural networks for multi-class classification tasks.

<br>

**Gradient 0 -- Binary Step Function -- Backpropagation**

**GPT Prompt:** The gradient of the step function is zero, which causes a hindrance in the backpropagation process." // why is the gradient 0? and how is this a hindrance to backpropagation

**GPT Answer:**

* The gradient of a function at a point measures how much the output of the function changes if the input is changed slightly. For a binary step function, which graphically looks like a step (it is constant at 0 for all values below 0 and constant at 1 for all values equal to or above 0), there is no change in the output value as the input changes, except exactly at the point where the step occurs (x = 0). However, the gradient at a point is defined as the limit of the slope of the function as the change in the input approaches zero, and for a step function, this limit is zero everywhere except at the step itself, where it is undefined.

* Mathematically, the derivative (or gradient) of the constant function is zero because there is no rate of change — the function is flat. Since the binary step function is composed of two constant functions, its gradient is zero everywhere except at the point of discontinuity (x = 0), where it technically does not have a derivative because the function jumps from one value to another without passing through intermediate values.

* In the context of neural networks, the backpropagation process relies on the gradient of the activation function to update the weights. This process is what allows the network to learn from data and improve its predictions over time. The gradient indicates the direction in which the weights need to be adjusted to minimize the error of the network's output.

* However, if the gradient is zero, as is the case with a binary step function, there is no direction for the backpropagation algorithm to follow — it suggests that no change to the weights will decrease the error, which is not helpful for learning. This is why the zero gradient (or flat gradient) is a hindrance to backpropagation: it provides no information on how to adjust the weights to improve the model, thus "hindering" the learning process.

* This is a major reason why smoother activation functions like sigmoid or tanh, which have non-zero gradients across most of their domain, or ReLU and its variants, which have non-zero gradients for positive inputs, are preferred in practice for neural networks. These functions provide gradients that facilitate weight updates and enable effective learning through backpropagation.


#### Linear Functions vs Non-linear Functions (sigmoid, tanh)

* URL of Activation Functions: https://www.techtarget.com/searchenterpriseai/feature/CNN-vs-RNN-How-they-differ-and-where-they-overlap
* URL of GPT Chat History: https://chat.openai.com/c/dd7ad124-4421-4451-ad72-7da4d830d555
* Refer to Desmos graph on Sigmoid, Sigmoid MSE lost function, Tanh: https://www.desmos.com/calculator/qswof9l3y6

**GPT Prompt:** "It’s not possible to use backpropagation as the derivative of the function is a constant and has no relation to the input x. " // why is there no relation // (This is regarding Linear Activation Functions, e.g. $f(x) = x$)

<br>

**GPT Answer:**

The statement "It's not possible to use backpropagation as the derivative of the function is a constant and has no relation to the input x" from the image you uploaded is slightly misleading. Let me clarify.

In the context of neural networks, an activation function's derivative is important because it is used during the backpropagation algorithm to update the weights in proportion to the error gradient. For a linear activation function, like $f(x) = x$, the derivative is a constant 1, because the rate of change of the function with respect to x is constant and does not depend on the value of x.

The derivative $df/dx$ of $f(x) = x$ is $1$ for all $x$. This means that the gradient will not change based on the input value. In the context of backpropagation, this has a couple of implications:

1. **Lack of Error Scaling:** The gradient will be the same regardless of the input, which means the change made to weights during backpropagation will not take into account how far off the predictions are. All errors, regardless of their magnitude, will result in the same size of the weight update, which is not ideal because large errors typically should lead to larger adjustments.

2. **No Non-linearity:** When using a linear activation function in all layers of a neural network, the entire network remains a linear model. This is because a stack of linear layers can only represent linear functions; the composition of two linear functions is still a linear function. One of the strengths of neural networks is their ability to capture non-linear relationships, and with linear activation functions, this capability is lost.

To summarize, it's not that backpropagation can't be used with a linear activation function; rather, the use of a linear activation function throughout a neural network results in a model that fails to capture non-linear relationships and may not benefit from the stacked layers of a deep network. Non-linear activation functions allow the network to learn more complex patterns in the data.

---

**GPT Prompt:** regadring the dL/dw and dL/db part, compare for me between linear and non-linear activation function, and utilise example etc to proof for me as to your previous arguments and why linear is unsuitable etc

<br>

**GPT Answer:**
Let's discuss the mathematical underpinnings of backpropagation with linear and non-linear activation functions in a neural network, focusing on the gradients of the loss with respect to weights $( \frac{\partial L}{\partial w} $) and biases $( \frac{\partial L}{\partial b} $), which are crucial for learning during the training process.

**Linear Activation Function:**

A linear activation function can be defined as $ f(x) = x $. In this case, the derivative with respect to $ x $ is constant (which is $ 1 $ in this case). This means that the gradient of the activation function does not depend on the input $ x $.

For a single-layer neural network with a linear activation function, the output $ y $ for a given input
$ x $ is:

$ y = w \cdot x + b $

Where $ w $ is the weight and $ b $ is the bias. The derivative of the loss $ L $ with respect to the weight $ w $ $( \frac{\partial L}{\partial w} $) is proportional to the input $ x $ (assuming a simple loss function like mean squared error), and the derivative with respect to the bias $( \frac{\partial L}{\partial b} $) is a constant.

If we had multiple layers, all with linear activations, the output of the network would still be a linear function of the input, because the composition of linear functions is itself a linear function. No matter how many layers you stack, the model's capacity is equivalent to that of a single-layer network, and it cannot capture complex patterns in the data that are inherently non-linear.

**Non-Linear Activation Function:**

Now let's consider a non-linear activation function like sigmoid, defined as:

$ f(x) = \frac{1}{1 + e^{-x}} $

The derivative of this function with respect to $ x $ (which is $ f'(x) = f(x) \cdot (1 - f(x)) $) varies depending on the value of $ x $, which allows for a more dynamic adjustment of the weights and biases during training. This variability is crucial because it allows the network to adjust the weights differently depending on the magnitude of the error.

For a single-layer neural network with a sigmoid activation function, the output $ y $ for a given input $ x $ is:

$ y = \frac{1}{1 + e^{-(w \cdot x + b)}} $

Here, the gradient of the loss with respect to the weight $( \frac{\partial L}{\partial w} $) and the gradient with respect to the bias $( \frac{\partial L}{\partial b} $) are now influenced by the derivative of the sigmoid function. So, the updates to the weights will now take into account the actual output of the activation function and its gradient. If the network's prediction is very wrong, the adjustments will be significant; if the prediction is close to the target, the adjustments will be smaller.

**Lack of Error Scaling in Linear Functions:**

In the linear case, because the derivative with respect to $ w $ is always proportional to $ x $ (and hence independent of the activation function), the backpropagation algorithm will update all weights by the same factor in each layer, regardless of the input value. This means that it doesn't take into account the 'scale' of the error. In contrast, with non-linear activation functions, the derivative with respect to $ w $ includes terms from the activation function's derivative, which varies with $x$, allowing the network to make smaller or larger updates depending on how confident it is about its predictions.

**Example:**

Imagine a scenario where you're training a neural network to predict house prices. Your features might include the number of rooms, location score, and age of the building. A linear model might just learn to assign weights to these features and add them up to get a price. But if house prices increase non-linearly with the number of rooms (say, due to market perceptions of houses with more than 3 rooms), a linear activation cannot model this pattern, and therefore, the network will not learn it regardless of how you adjust the weights. A non-linear activation function, however, can capture this non-linear relationship, allowing for much more nuanced and accurate predictions.

---

**GPT Prompt:** now just compare examples, and show me why cant the loss function of linear capture what the MSE of non-linear can. as in just show me examples of big and small changes for both type of activations, and show me why the delta_b and delta_w is not øptimal for linear but optimal for non-linear activation functions

<br>

**GPT Answer:**
Let's compare how the updates to weights $Δw$ and biases $Δb$ behave when using mean squared error $MSE$ as the loss function for both linear and non-linear activation functions.

We will consider a simple scenario: suppose we have a single neuron with a single input. The goal of the neuron is to learn to predict some output given this input.

The mean squared error loss function is given by:
$ L = \frac{1}{2}(y_{\text{true}} - y_{\text{pred}})^2 $
Where $ y_{\text{true}} $ is the true output and $ y_{\text{pred}} $ is the predicted output by the neural network.

**For a Linear Activation Function:**

The output of the neuron is a linear function of the input:
$ y_{\text{pred}} = w \cdot x + b $

The partial derivatives of the loss with respect to weight $w$ and bias $b$ are:
$ \frac{\partial L}{\partial w} = (y_{\text{pred}} - y_{\text{true}}) \cdot x $
$ \frac{\partial L}{\partial b} = (y_{\text{pred}} - y_{\text{true}}) $

**Example with Linear Activation:**

- Case 1: Large error (e.g., $ y_{\text{true}} = 10, y_{\text{pred}} = 0 $)
  - $ \frac{\partial L}{\partial w} = (0 - 10) \cdot x = -10x $
  - $ \frac{\partial L}{\partial b} = (0 - 10) = -10 $

- Case 2: Small error (e.g., $ y_{\text{true}} = 10, y_{\text{pred}} = 9 $)
  - $ \frac{\partial L}{\partial w} = (9 - 10) \cdot x = -x $
  - $ \frac{\partial L}{\partial b} = (9 - 10) = -1 $

Regardless of the size of the error, the changes to the weights and biases are just scaled versions of the input and do not consider the context of the error's size beyond that scaling.

**For a Non-Linear Activation Function:**

Now let's assume the activation function is sigmoid, so the output becomes:
$ y_{\text{pred}} = \frac{1}{1 + e^{-(w \cdot x + b)}} $

The partial derivatives of the loss with respect to w and b become more complex due to the chain rule:
$ \frac{\partial L}{\partial w} = (y_{\text{pred}} - y_{\text{true}}) \cdot y_{\text{pred}} \cdot (1 - y_{\text{pred}}) \cdot x $
$ \frac{\partial L}{\partial b} = (y_{\text{pred}} - y_{\text{true}}) \cdot y_{\text{pred}} \cdot (1 - y_{\text{pred}}) $

**Example with Non-Linear Activation:**

- Case 1: Large error (e.g., $ y_{\text{true}} = 1, y_{\text{pred}} = \text{sigmoid}(-10) \approx 0 $)
  - The derivative $ y_{\text{pred}} \cdot (1 - y_{\text{pred}}) $ will be significant because sigmoid(-10) is close to 0, so its derivative is close to its maximum value.
  - $ \frac{\partial L}{\partial w} $ and $ \frac{\partial L}{\partial b} $ will be large, leading to a significant weight update.

- Case 2: Small error (e.g., $ y_{\text{true}} = 1, y_{\text{pred}} = \text{sigmoid}(0.9) \approx 0.71 $)
  - The derivative $ y_{\text{pred}} \cdot (1 - y_{\text{pred}}) $ will be smaller because sigmoid(0.9) is closer to 1, and its derivative is smaller.
  -  $\frac{\partial L}{\partial w} $ and $ \frac{\partial L}{\partial b} $ will be smaller, leading to a more modest weight update.

In the non-linear case, the gradient not only scales with the input but also with the derivative of the activation function, which is dependent on the current output of the neuron. This means that the network can make smaller adjustments when the prediction is close to the target and larger adjustments when the prediction is far off.



**Why is Linear Activation Unsuitable?**

The linear activation function's lack of sensitivity to the size of the error (i.e., whether the neuron's prediction is slightly off or wildly incorrect) means that it does not scale the gradient updates to reflect the confidence or error in the prediction. This can lead to inefficient learning and the inability to capture the complex patterns in the data that typically emerge due to non-linear relationships.

On the other hand, non-linear activation functions allow the gradients to be scaled appropriately, which can lead to more nuanced updates and better learning, especially for complex tasks where modeling non-linear relationships is essential.




---

**Chain Rule Demonstration**

<br>

**GPT Answer:**
There seems to be a misunderstanding in the way the chain rule needs to be applied within the symbolic differentiation environment. Let me clarify the steps outside the symbolic computation to guide you through the process:

Given the mean squared error loss function:
$ L = \frac{1}{2}(y_{\text{true}} - y_{\text{pred}})^2 $

And $ y_{\text{pred}} $ as the output of the sigmoid function:
$ y_{\text{pred}} = \frac{1}{1 + e^{-z}} $
where $ z = wx + b $.

Here is how you would apply the chain rule:

1. **Find** $ \frac{\partial L}{\partial y_{\text{pred}}} $:

  $ \frac{\partial L}{\partial y_{\text{pred}}} = -(y_{\text{true}} - y_{\text{pred}}) $
    
  This is because the derivative of $ \frac{1}{2}x^2 $ with respect to $ x $ is $ x $, and you then multiply by the chain rule for the inside of the squared term which is $ (y_{\text{true}} - y_{\text{pred}})$.

2. **Find** $ \frac{\partial y_{\text{pred}}}{\partial z} $:

  $ \frac{\partial y_{\text{pred}}}{\partial z} = y_{\text{pred}} \cdot (1 - y_{\text{pred}}) $
  
  This comes from the derivative of the sigmoid function which can be expressed as $ \sigma(z) \cdot (1 - \sigma(z)) $.

3. **Find** $ \frac{\partial z}{\partial w} $:
    
  $ \frac{\partial z}{\partial w} = x$
  
  This is straightforward as $ z = wx + b $, so the derivative of $ z $ with respect to $ w $ is just $ x $, the input.

4. **Combine them using the chain rule** to find $ \frac{\partial L}{\partial w} $:
  
  $ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial y_{\text{pred}}} \cdot \frac{\partial y_{\text{pred}}}{\partial z} \cdot \frac{\partial z}{\partial w} $

Putting it all together, we have:

$ \frac{\partial L}{\partial w} = -(y_{\text{true}} - y_{\text{pred}}) \cdot y_{\text{pred}} \cdot (1 - y_{\text{pred}}) \cdot x $

This expression tells us how to update $ w $ during backpropagation to minimize the loss $ L $.

For the exact symbolic computation, we're running into a limitation with the sympy library's ability to handle derivatives with respect to more complex expressions directly. Normally, we would simplify and combine terms within the symbolic computation environment, but in this case, we would need to proceed with the substitutions manually after computing the derivative with respect to the more basic terms.

The steps described are the correct application of the chain rule for this context. However, in practice, the derivation is often handled by software libraries that specialize in automatic differentiation, which can handle these chain rule applications internally.

---

**GPT Prompt:** considering the formula for dL/dw, wont the formula be affected if x is a very vary large valye like 1 million? // (Essentially, I personally spotted an error with the sigmoid function.)

<br>

**GPT Answer:**
Yes, that's correct. In the formula for $\frac{\partial L}{\partial w}$, if $x$ (the input to the neuron) is very large, it can significantly affect the gradient during backpropagation.

The formula for the derivative of the loss $L$ with respect to the weight $w$ when using mean squared error and a sigmoid activation function is:

$ \frac{\partial L}{\partial w} = -(y_{\text{true}} - y_{\text{pred}}) \cdot y_{\text{pred}} \cdot (1 - y_{\text{pred}}) \cdot x $

Here's what happens if $x$ is very large:

- The term $x$ in the derivative will amplify any changes in the weight $w$.
- A large $x$ value will lead to a large gradient (assuming the other factors like the error $y_{\text{true}} - y_{\text{pred}}$ and the derivative of the sigmoid $y_{\text{pred}} \cdot (1 - y_{\text{pred}})$ are not extremely small).
- This large gradient can cause the weights to update by a large amount, potentially leading to overshooting the minimum of the loss function during optimization.
- It can cause the learning process to be unstable and may result in the model failing to converge or to converge very slowly.

To mitigate this, there are a few techniques commonly used:

1. **Feature Scaling:** Normalizing or standardizing the input features so that they are of a similar scale. This prevents any feature with a naturally larger scale from disproportionately affecting the gradient.

2. **Weight Initialization:** Choosing an appropriate weight initialization strategy can prevent gradients from becoming too large at the start of training.

3. **Gradient Clipping:** This technique involves scaling down gradients when they exceed a certain threshold to prevent drastic updates to the weights.

4. **Learning Rate Scheduling:** Adjusting the learning rate during training (making it smaller over time) can help manage the size of the weight updates.

5. **Use of Regularization:** Regularization methods like L1 or L2 can help control the magnitude of the weights and, by extension, the gradients.

By applying these techniques, we can help ensure that large values of $x$ don't cause instability in the neural network training process.

---

**GPT Prompt:** further, when x = 0, the derivative is also 0? even for dL/dw the y_true is 1, dont this make this method very inaccurate? // but then, if lets say the desired output is 1, if the input is -100, which sigmoid(-100) is approx 0, the derivative would also be close to 0, wont this make sigmoid very ineffective? // (Spotted another error with Sigmoid(x), good way to have a good understanding of things is by graphing and visualising it, then try and plug values into the function left and right.)

<br>

**GPT Answer:**
Yes, you are correct on both accounts. Let's address each of your points:

1. **Input x = 0:**
   If the input \( x \) to a neuron is zero, then for the weight update part \( \frac{\partial L}{\partial w} = -(y_{\text{true}} - y_{\text{pred}}) \cdot y_{\text{pred}} \cdot (1 - y_{\text{pred}}) \cdot x \), the term involving \( x \) will indeed be zero. This means that regardless of the error \( (y_{\text{true}} - y_{\text{pred}}) \) and the value of the output from the sigmoid function \( y_{\text{pred}} \), the gradient \( \frac{\partial L}{\partial w} \) with respect to the weight \( w \) will be zero, so there will be no update to the weight \( w \) during that particular iteration of training. This can be problematic if the data is not preprocessed correctly because it means that a particular feature (input) does not contribute to learning.

2. **Very Negative Input (e.g., -100):**
   For a very negative input to the sigmoid function, like -100, the output of the sigmoid \( y_{\text{pred}} \) will be extremely close to 0. The derivative of the sigmoid function, \( y_{\text{pred}}(1 - y_{\text{pred}}) \), will also be extremely small since it is maximal at 0.5 and decreases towards 0 as \( y_{\text{pred}} \) approaches 0 (or 1). Consequently, the gradient with respect to the weights, \( \frac{\partial L}{\partial w} \), will be small, potentially leading to very small updates to the weights. This is known as the vanishing gradient problem and is a known limitation of the sigmoid activation function.

The vanishing gradient problem can make sigmoid activations ineffective for deep networks or for networks where the activation can saturate at either 0 or 1 (such as with very large positive or negative inputs). This is why alternative activation functions like ReLU (Rectified Linear Unit) and its variants, which do not saturate in the positive direction and have a gradient of 1 for positive inputs, have become more popular in deep learning.

The image you've shown seems to illustrate the function $ y_{\text{pred}} $ and its derivative. As you can see, for extreme values of the input, the derivative becomes very small, which is a visualization of the vanishing gradient issue.

For networks that use sigmoid activations, careful initialization and normalization of inputs, as well as techniques like gradient clipping, are essential to mitigate these issues.


**GPT Prompt:**is there any more problems with sigmoid? and at what domains are sigmoid good at

<br>

**GPT Answer:**
Yes, besides the vanishing gradient problem, the sigmoid activation function has other limitations when used in neural networks:

1. **Saturation and Output Values:**
   The sigmoid function squashes its inputs to lie between 0 and 1. While this is useful for probabilities, for example, in the output layer of binary classifiers, it can lead to neurons that are "killed" during training. If the neuron's output is consistently near 0 or 1, it may stop learning, as even large changes in the input will not significantly change the output.

2. **Not Zero-Centered:**
   Because the sigmoid's output is not zero-centered (it ranges from 0 to 1 instead of, say, -1 to 1 like the tanh function), this can lead to gradients being all of the same sign during backpropagation. This can result in zig-zagging dynamics during gradient descent, which can slow down the convergence of the optimization algorithm.

3. **Expensive Exponentials:**
   The sigmoid function involves an exponential operation, which is computationally more expensive than the alternatives like the ReLU function.

4. **Susceptibility to Learning Rate and Initialization:**
   Sigmoid units are sensitive to the scale of inputs and the choice of learning rate. Improper initialization or a poorly chosen learning rate can push the majority of the sigmoid units into the saturated regime at the start of training.

Despite these limitations, the sigmoid function is still useful and has its advantages in certain domains:

1. **Output Layer of Binary Classifiers:**
   For binary classification tasks, the sigmoid function is often appropriate for the output layer of a neural network, where it can represent a probability that the input is either in one class or the other.

2. **Probabilistic Interpretation:**
   In models where a probabilistic interpretation of the outputs is needed, the sigmoid function is suitable because it can be seen as the probability of the input being in a particular class.

3. **Small Networks and Shallow Architectures:**
   In smaller networks or problems that are linearly separable, the sigmoid function can perform well without running into issues like vanishing gradients.

4. **Historical Relevance:**
   Sigmoid functions were used in some of the earliest neural networks, and they can still be useful for educational purposes or to reproduce historical models.

5. **Gradient-Based Optimization:**
   In networks with single-layer or shallow architectures, where vanishing gradients are less of an issue, the smooth gradient of the sigmoid can be an advantage in gradient-based optimization tasks.

Today, for hidden layers in deep neural networks, other functions like ReLU and its variants (Leaky ReLU, Parametric ReLU, Exponential Linear Unit, etc.) are typically preferred due to their computational efficiency and their ability to mitigate the vanishing gradient problem. However, sigmoid is still very relevant in specific contexts as mentioned above.

---

**GPT Prompt:** doesn't this apply to tanh too? its gradient is also the same sign. // "Because the sigmoid's output is not zero-centered (it ranges from 0 to 1 instead of, say, -1 to 1 like the tanh function), this can lead to gradients being all of the same sign during backpropagation. This can result in zig-zagging dynamics during gradient descent, which can slow down the convergence of the optimization algorithm."

<br>

**GPT Answer:**

Yes, the tanh function does share some characteristics with the sigmoid function, including the potential for gradients to be of the same sign. However, the issue of gradients all being of the same sign is more specific to the sigmoid function because its output ranges from 0 to 1. Since it never outputs negative values, if the weights in a particular layer of a neural network need to be adjusted downwards, the positive gradients can cause the weights to be adjusted in the wrong direction, leading to the zig-zagging dynamics during optimization.

The tanh function, on the other hand, outputs values in the range of -1 to 1 and is zero-centered, which means that its gradients can be positive or negative depending on the input. This helps mitigate the issue of gradients all being of the same sign to some extent. A zero-centered function means that for inputs that are roughly symmetrically distributed around 0, the gradients are also symmetrically distributed around 0, which can lead to more efficient learning updates.

However, the tanh function does still suffer from vanishing gradients for inputs with large absolute values. For these inputs, the function becomes flat, and the gradients become very small, which can slow down learning or cause it to stall completely. This is why, despite being zero-centered, tanh is also less commonly used in deep networks compared to functions like ReLU, which do not saturate in the positive direction and therefore maintain larger gradients during learning.

---

**GPT Prompt:** "its gradients can be positive or negative depending on the input." -- yes but the derivative of arctan x is only positive, how is it any different // (screenshot of desmos graph) // No? as seen in the green graph in my desmos site, the derivative of tanh x is once again positive for both positive and negative input

* This prompt and the one above is essentially about GPT's misphrase regarding sigmoid && tanh function's gradient being positive or negative
* Further sidenote: I think its referring to the gradient of the MSE / loss function instead.

<br>

**GPT Answer:**

You are correct in pointing out that the derivative of the tanh function is always positive for both positive and negative inputs. **When I mentioned that gradients can be positive or negative, I was referring to the overall gradient of the loss with respect to the weights (including the input term $x$ in the gradient calculation), not just the derivative of the activation function.**

For the tanh function, its derivative $ \frac{d}{dx} \tanh(x) = 1 - \tanh^2(x) $ is indeed always positive. However, the overall gradient of the loss function with respect to the weights in the network takes into account the input $ x $ to the neuron as well as the error signal from the loss function. The sign of the gradient with respect to the weights depends on both the input $ x $ and the error term. If the input $ x $ is negative and the error term is such that the network needs to adjust the weight downwards, the gradient will be negative, leading to an increase in the weight value after the update (because weight updates typically subtract the gradient). Conversely, if $ x $ is positive and the error term indicates the weight needs to decrease, the gradient will be positive, leading to a decrease in the weight after the update.

This is different from the sigmoid function where the output and its derivative are always non-negative. The tanh function's output being zero-centered means that for symmetrically distributed inputs, the weight updates will not be biased toward increasing or decreasing the weights but will be determined by the error term.

In your Desmos graph, the green graph indeed shows that the derivative of the tanh function is always positive, which is related to the steepness of the tanh curve at any given point $ x $. This positive derivative, while it does mean that the gradient with respect to the activation function is always positive, does not mean that the overall weight updates during backpropagation will always be in the same direction. The input $x$ and the loss gradient $\frac{\partial L}{\partial y_{\text{pred}}} $ play crucial roles in determining the sign and magnitude of the weight updates.

---

**Error Term Definition**

<br>

The "error term" typically refers to the difference between the predicted output of the network and the actual target value or ground truth. In the context of backpropagation and weight updates, this error term is used to calculate the gradients of the loss function with respect to the weights.

Here’s how it works in a typical supervised learning setting:

1. **Prediction:** The neural network makes a prediction $(y_{\text{pred}}$) based on the current weights and the input data.

2. **Error Calculation:** The difference between this prediction and the true target value $(y_{\text{true}}$) is calculated. This difference is often called the "error" or "loss". For regression tasks, this is often calculated using the Mean Squared Error (MSE), where the error for each data point is $(y_{\text{true}} - y_{\text{pred}})^2$, and for classification tasks, cross-entropy loss might be used instead.

3. **Gradient of the Loss Function:** The gradient of the loss function with respect to the weights $(\frac{\partial L}{\partial w}$) is computed using this error. For the MSE, the error term in the gradient would be $(y_{\text{true}} - y_{\text{pred}}$). This error term is essential for the backpropagation algorithm because it tells the network the direction in which it needs to adjust the weights to reduce the loss in the next iteration.

In the case of using activation functions like sigmoid or tanh in a network layer, the error term is propagated back from the output layer to earlier layers, being adjusted by the derivative of the activation function at each layer. This is due to the chain rule being applied in the computation of the gradients during backpropagation.

The overall sign and magnitude of the weight update depend on both the sign and magnitude of this error term and the derivative of the activation function at the value that the neuron outputted before applying the loss function.

* Ensemble Modelling Definition URL

---

**GPT Question:**

**GPT Answer:**