# NumPy vs Torch vs MLX

In the world of transformers, one needs to have a firm understanding of their building blocks, which boils down to matrix maths... 

NumPy is the go-to library for mathematics with python and uses `numpy.arry`, PyTorch is Facebook with `torch.tensor` and apple have now released their framework mlx with `mlx.core.array`. 

Due to the size of the matrices that we use in deep learning models, we end up performing millions, sometimes billions of (simple) calculations. It therefore becomes advisable to accelerate this with our systems GPU. While the industry standard is Nvidia CUDA, this is only available to intel based systems. Torch allows you to hand off calculations to the graphics chip by specifying `cuda` as the device. Apple has recently moved away from intel based hardware to its own proprietary Apple Silicon chips. While Torch can accelerate with `mps`, apple recently released their MLX library that is built to fully utilise Apple Silicon. 

This repository is going to investigate just how beneficial switching to the MLX library would be for the development of smaller local models. 

In [1]:
import numpy as np
import mlx.core as mx
import torch

In [2]:
## How do we define arrays with each library?

a = mx.array([1, 2, 3])
b = torch.tensor([1, 2, 3])
c = np.array([1, 2, 3])
a, b, c

(array([1, 2, 3], dtype=int32), tensor([1, 2, 3]), array([1, 2, 3]))

In [3]:
a = mx.linspace(0, 1, 5)
b = torch.linspace(0, 1, 5)
c = np.linspace(0, 1, 5)
a, b, c

(array([0, 0.25, 0.5, 0.75, 1], dtype=float32),
 tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000]),
 array([0.  , 0.25, 0.5 , 0.75, 1.  ]))

In [4]:
a = mx.arange(48).reshape(3, 4, 4)
b = torch.arange(48).reshape(3, 4, 4)
c = np.arange(48).reshape(3, 4, 4)

# General Broadcasting

So - what is an array, and does it behave?
NumPy compares their shapes element-wise. Starting with the trailing dimensions and works its way left. Two elements are compatible when:
1. They are equal
2. one of them is 1

**Example**  
The following 2 six dimension shapes are compatible  
Shape 1: (1, 6, 4, 1, 7, 2)  
Shape 2: (5, 6, 1, 3, 1, 2)  
Because each dimension follows that rules above.

In [5]:
a = np.ones((6, 5))
b = np.arange(5).reshape((1, 5))

In [6]:
a + b

array([[1., 2., 3., 4., 5.],
       [1., 2., 3., 4., 5.],
       [1., 2., 3., 4., 5.],
       [1., 2., 3., 4., 5.],
       [1., 2., 3., 4., 5.],
       [1., 2., 3., 4., 5.]])

In [7]:
a = torch.ones((6, 5))
b = torch.arange(5).reshape((1, 5))

In [8]:
a + b

tensor([[1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.]])

In [9]:
a = mx.ones((6, 5))
b = mx.arange(5).reshape((1, 5))

In [10]:
a + b

array([[1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5]], dtype=float32)

### Scaling by different amounts
The arrays/tensors dont need to have the same number of dimensions. If one of the arrays/tensors has fewer dimensions than the other. 

**Example**  
Scaling each other the colour channels of an image by a different amount.  

Image  (3D array): 256 x 256 x 3  
scale  (1D array):             3  
Result (3D array): 256 x 256 x 3  

In [11]:
image = torch.rand((256, 256, 3))
scale = torch.tensor([0.5, 1.5, 1])

In [12]:
result = image * scale
result.shape

torch.Size([256, 256, 3])

In [13]:
image = np.random.rand(256, 256, 3)
scale = np.array([0.5, 1.5, 1])

In [14]:
result = image * scale
result.shape

(256, 256, 3)

In [15]:
image = mx.random.normal((256, 256, 3))
scale = mx.array([0.5, 1.5, 1])

In [16]:
result = image * scale
result.shape

[256, 256, 3]

**Example**  
One has an array of 2 images and wants to scale the colour channels of each image by slightly different amounts:  

Image  (3D array): 2 x 256 x 256 x 3  
scale  (1D array): 2 x  1  x  1  x 3  
Result (3D array): 2 x 256 x 256 x 3  


In [17]:
image = mx.random.normal((2, 256, 256, 3))
scales = mx.array([0.5, 1.5, 1, 1.5, 1, 0.5]).reshape((2,1, 1, 3))

In [18]:
result = image * scale
result.shape

[2, 256, 256, 3]

# Operations Across Dimensions

Fundamental to these libraries is the ability to operate across dimensions.

In [19]:
## in 1 Dimension
t = torch.tensor([0.5, 1, 3, 4])
torch.mean(t), torch.std(t), torch.max(t), torch.min(t)

(tensor(2.1250), tensor(1.6520), tensor(4.), tensor(0.5000))

In [20]:
n = np.array([0.5, 1, 3, 4])
np.mean(n), np.std(n), np.max(n), np.min(n)

(2.125, 1.4306903927824497, 4.0, 0.5)

In [21]:
m = mx.array([0.5, 1, 3, 4])
## std is not available so we have to compute ourselves.
m.mean(), (sum((m.mean()-m)**2)/len(m))**0.5, m.max(), m.min()

(array(2.125, dtype=float32),
 array(1.43069, dtype=float32),
 array(4, dtype=float32),
 array(0.5, dtype=float32))

### What about operations on a multi-dimensional array/tensor?
Taking the mean of each column is referred to as "taking the mean across the rows"

In [22]:
t = torch.arange(20, dtype=float).reshape(5, 4)
t, torch.mean(t, axis=0)

(tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.],
         [12., 13., 14., 15.],
         [16., 17., 18., 19.]], dtype=torch.float64),
 tensor([ 8.,  9., 10., 11.], dtype=torch.float64))

In [23]:
a = np.arange(20, dtype=float).reshape(5, 4)
a, a.mean(axis=0)

(array([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.],
        [16., 17., 18., 19.]]),
 array([ 8.,  9., 10., 11.]))

In [24]:
m = mx.arange(20).reshape(5, 4)
m, m.mean(axis=0)

(array([[0, 1, 2, 3],
        [4, 5, 6, 7],
        [8, 9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]], dtype=int32),
 array([8, 9, 10, 11], dtype=float32))

### This can also be done for higher dimensional arrays/tensors.

In [25]:
t = torch.rand(5, 256, 256, 3)

In [26]:
torch.mean(t, axis=0).shape

torch.Size([256, 256, 3])

In [27]:
torch.mean(t, axis=-1).shape

torch.Size([5, 256, 256])

In [28]:
values, indx = torch.max(t, axis=-1)

## Where are they different?

**Pytorch** begins to differ away from **NumPy** when computing gradients of operations. 

$ y = \sum\limits_{i} x^{3}_{i}$  

has a gradient  

$ \frac{\partial y}{\partial x_i} = 3 x^{2}_{i}$

In [29]:
x = torch.tensor([[5., 8.], [4., 6.]], requires_grad=True)
x

tensor([[5., 8.],
        [4., 6.]], requires_grad=True)

In [30]:
y = x.pow(3).sum()
y

tensor(917., grad_fn=<SumBackward0>)

In [31]:
y.backward() ## compute the gradient
x.grad ## print the gradient (everything that has happened to x)

tensor([[ 75., 192.],
        [ 48., 108.]])

In [32]:
3*x**2

tensor([[ 75., 192.],
        [ 48., 108.]], grad_fn=<MulBackward0>)

This function is not available in NumPy or MLX.



# PyTorch to MLX
PyTorch supports the buffer protocol, but required an explicit `memoryview`.