<a href="https://colab.research.google.com/github/DavideScassola/PML2024/blob/main/Notebooks/02_numpy_pandas_sklearn/021_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 2.1: Introduction to `numpy`

Probabilistic Machine Learning -- Spring 2024, UniTS

This notebook is based on [NumPy: the absolute basics for beginners](https://numpy.org/doc/stable/user/absolute_beginners.html) and [NumPy Quickstart](https://numpy.org/doc/stable/user/quickstart.html#the-basics).

## What is `numpy`?


<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/NumPy_logo_2020.svg/1280px-NumPy_logo_2020.svg.png" width="400">


NumPy (Numerical Python) is the fundamental package for scientific computing in Python, it provides support to multidimensional arrays and many mathematical functions that operate on them. These include 
- mathematical operations
- linear algebra
- basic statistical operations
- random simulation

and much more.

Moreover, operations on arrays with NumPy are really fast, as these are based on C compiled code.

In [1]:
import numpy as np
# This is the standard way to import it

## Arrays
At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types. There are several important differences between NumPy arrays and the standard Python sequences:

- Fixed size at creation
- All elements have to be of the same type (e.g. float)
- Support mathematical operations (e.g. summing two arrays)

One way we can initialize NumPy arrays is from Python lists, using nested lists for two- or higher-dimensional data:

In [2]:
a = np.array([1, 2, 3, 4, 5, 6])
a

array([1, 2, 3, 4, 5, 6])

In [3]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

A NumPy array is characterized by a type (`.dtype`) and a shape (`.shape`):

In [4]:
a = np.array([1, 2, 3, 4, 5, 6])

print('a:\n', a)
print('type:', a.dtype)
print('shape:', a.shape) 

a:
 [1 2 3 4 5 6]
type: int64
shape: (6,)


In [5]:
a = np.array([[1, -2, 3.6, 4], [5, 6.0, 7, 8], [9, 10, 1.0, 2.2]])

print('a:\n', a)
print('type:', a.dtype)
print('shape:', a.shape) 

a:
 [[ 1.  -2.   3.6  4. ]
 [ 5.   6.   7.   8. ]
 [ 9.  10.   1.   2.2]]
type: float64
shape: (3, 4)


In NumPy, dimensions are called **axes**. For example, the previous array with shape (3, 4) have two **axes**, the first **axis** has length 3, and the second **axis** has length 4.

## Indexing and slicing

You can index and slice NumPy arrays in the same ways you can slice Python lists:

In [6]:
a = np.array([1, 2, 3, 4, 5, 6])

a[0] # accessing the first element

1

In [7]:
a[-1] # accessing the last element

6

In [8]:
# modifiying an element
a[0] = 10
a

array([10,  2,  3,  4,  5,  6])

Slicing refers to selecting a subset of elements of an array, the notation for selecting the elements of an array from `start` included to `stop` excluded is

```array[start : stop]```

or 

```array[start : stop : step]```

if you want a step different from 1

In [9]:
a[0:2]

array([10,  2])

In [10]:
a[1:-2]

array([2, 3, 4])

In [11]:
a[2:] # implicitly goes to the end

array([3, 4, 5, 6])

In [12]:
a[:4] # implicitly starts from the beginning

array([10,  2,  3,  4])

In [13]:
a[::2] # modify the step size

array([10,  3,  5])

In [14]:
a[::-1] # reverse the array

array([ 6,  5,  4,  3,  2, 10])

<img src="https://numpy.org/doc/stable/_images/np_indexing.png" width="1000">

With multidimensional arrays, you can specify an index/slice for every **axis**:

In [15]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

a[1:3, -2:]

array([[ 7,  8],
       [11, 12]])

In [16]:
# You don't need to specify the slice/index for all axes, the unspecified axes will considered as complete slices [:]
a[1:3] # equivalent to a[1:3, :]

array([[ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

<img src="https://numpy.org/doc/stable/_images/np_matrix_indexing.png" width="1000">

## Arrays operations

One of the main features of NumPy are **vectorized** operations:

In [17]:
a = np.array([20, 30, 40, 50])
b = np.array([1, 2, 3, 4])

c = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

In [18]:
2*a

array([ 40,  60,  80, 100])

In [19]:
a + b # element-wise addition

array([21, 32, 43, 54])

In [20]:
a**2

array([ 400,  900, 1600, 2500])

In [21]:
np.log(a)

array([2.99573227, 3.40119738, 3.68887945, 3.91202301])

In [22]:
c + c**2

array([[  2,   6,  12,  20],
       [ 30,  42,  56,  72],
       [ 90, 110, 132, 156]])

## Statistics

In [23]:
a = np.array([20, 30, 40, 50, 12, 90, 23])
b = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

In [24]:
np.sum(a)

265

In [25]:
np.mean(a)

37.857142857142854

In [26]:
np.std(a)

24.321821897385963

In [27]:
np.max(a)

90

In [28]:
np.sum(b)

78

In [29]:
# reduce the array along the specified axis
# (3, 4) -> (4)
b = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

np.sum(b, axis=0)

array([15, 18, 21, 24])

In [30]:
# (3, 4) -> (3)
np.sum(b, axis=1)

array([10, 26, 42])

<img src="https://numpy.org/doc/stable/_images/np_matrix_aggregation.png" width="1000">

<img src="https://numpy.org/doc/stable/_images/np_matrix_aggregation_row.png" width="1000">

## Reject loops, embrace **vectorization**!

Vectorized operations are orders of magnitude faster than python loops.
Unless strictly necessary, you should **avoid using loops** at all!

Here there is a demonstration, let's say we want to find the maximum of a given very large array:

In [31]:
a = np.arange(30_000_000) # create an array with elements from 0 to n-1
a

array([       0,        1,        2, ..., 29999997, 29999998, 29999999])

This is the inefficient way to do it:

In [32]:
# inefficient way
max_value = a[0] 
for i in a:
    if i>max_value:
        max_value = i
max_value

29999999

The vectorized operation is way faster:

In [33]:
np.max(a)

29999999

In [34]:
del a

## Reshaping

Sometimes it can be useful to reshape arrays:

In [35]:
a = np.array([1,2,3,4,5,6,7,8,9,10,11,12])

In [36]:
a.reshape(3, 4)

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [37]:
a.reshape(3, 2, 2)

array([[[ 1,  2],
        [ 3,  4]],

       [[ 5,  6],
        [ 7,  8]],

       [[ 9, 10],
        [11, 12]]])

Notice that the lexicographical order is preserved

In [38]:
a.reshape(-1, 2) # -1 means the value is inferred from the length of the array and remaining dimensions

array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10],
       [11, 12]])

In [39]:
a.reshape(12, 1) # is this useful?

array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12]])

## Random

NumPy's `random` contains functions for sampling from many distributions:

In [40]:
np.random.normal(loc=0, scale=1, size=(3, 2))

array([[-0.59833384,  0.72562245],
       [ 0.34488144,  1.30173943],
       [ 0.28970862,  1.74922199]])

In [41]:
np.random.binomial(n=10, p=0.2, size=7)

array([2, 3, 2, 1, 3, 2, 1])

## Broadcasting

We saw how to do element-wise binary operations among two (or more) arrays (e.g. the sum). Notice that we did it only among arrays with the same shape.

Now consider the following array of shape `(4,2)`

In [42]:
a = np.arange(8).reshape(4, 2)
a

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

Now let's say we want to add to each row the following array:

In [43]:
b = np.array([-1, -10])
b

array([ -1, -10])

We could do a for loop where we iteratively select rows of the matrix and add that array, but remeber, we don't do that here!
NumPy offers an alternative to that:

In [44]:
a + b.reshape(1,2)

array([[-1, -9],
       [ 1, -7],
       [ 3, -5],
       [ 5, -3]])

What happened?

Array `a` has shape `(4,2)` while array `b` has shape `(1,2)`.
Numpy **expanded** array `b` to `(4,2)` by repeating it 4 times, then it computed the sum.

Broadcasting is a mechanism that allows NumPy to perform operations on arrays of different shapes. The dimensions of your array must be compatible, for example, when the dimensions of both arrays are equal or when one of them is 1. If the dimensions are not compatible, you will get a ValueError.

<img src="https://www.w3resource.com/w3r_images/python-numpy-exercise-124.svg" width="700">

In [45]:
# Another example
row_vector = np.array([10, 20, 30]).reshape(1, 3)  # shape: (1,3)
col_vector = np.array([1, 2, 3]).reshape(3, 1)  # shape: (3,1)

print('row_vector:\n', row_vector)
print('col_vector:\n', col_vector)

print('row_vector + col_vector:\n', row_vector + col_vector) # (1,3) + (3,1) -> (3,3)

row_vector:
 [[10 20 30]]
col_vector:
 [[1]
 [2]
 [3]]
row_vector + col_vector:
 [[11 21 31]
 [12 22 32]
 [13 23 33]]


## Views and copies

When doing array operations, arrays are sometimes copied, and sometimes not, depending on the operation

In [46]:
a = np.zeros(5) # equivalent to np.array([0, 0, 0, 0, 0])
a

array([0., 0., 0., 0., 0.])

Just using the assignment operator `=` doesn't create a copy!

In [47]:
b = a
b[0] = 77
print('a: ', a)
print('b: ', b)

a:  [77.  0.  0.  0.  0.]
b:  [77.  0.  0.  0.  0.]


The same happens when slicing and reshaping.

Other operations instead create a copy automatically:

In [48]:
a = np.zeros(5)
b = a + 1
b[0] = 77
print('a: ', a)
print('b: ', b)

a:  [0. 0. 0. 0. 0.]
b:  [77.  1.  1.  1.  1.]


If you just want to make a deepcopy of an array, you can use the `.copy()` function