### Representing data for Machine Learning in Python: Introduction to Numpy

In the lecture, we saw that the "language" of Machine Learning is vectors: Machine Learning models use vectors to represent inputs and outputs

To be more precise, the "language" of Machine Learning is arrays.

An array is a grid of numbers that can have any number of dimensions.

A 0-dimension array is just a single number:

A = a

A 1-dimensional array is a vector: 

A = $ [a_1, a_2, a_3] $

A 2-dimensional array is a matrix:

$ 
A =  \begin{pmatrix}
a_{1}^1 & a_{2}^1 & \cdots & a_{n}^1 \\
a_{1}^2 & a_{2}^2 & \cdots & a_{n}^2 \\
\vdots  & \vdots  & \ddots & \vdots  \\
a_{1}^d & a_{2}^d & \cdots & a_{n}^d 
\end{pmatrix}
$




A 3-dimensional array can be thought as a "cube" of data:

![Screen Shot 2023-06-30 at 9.16.29.png](attachment:f23c7ffd-fa02-4cd5-9949-df163d8a8919.png)

In Machine Learning, we use these arrays to represent our input and output data.

In Python, we use the module "numpy" to represent arrays.

Let's see how these arrays work

## Install and imports

First, we need to install the module numpy and import it. (You can execute and ignore the code below for now)

In [19]:
import numpy as np

## Using array to represent data

Imagine that I want to represent the evolution of bitcoin price:

- Monday: 2
- Tuesday: 7
- Wednesday: 5
- Thursday: 3
- Friday: 8


Using the standard data types, I could define a "price" variable that represent this information as a list:

In [20]:
price = [2, 7, 5, 3, 8]
print(price)

[2, 7, 5, 3, 8]


Here my variable price is a list containing the 4 daily bitcoin prices

In [21]:
# I can show the price and size of the list using the operations type and len
type(price), len(price)

(list, 5)

I can also represent this information using a numpy array:

In [22]:
price = np.array([2, 7, 5, 3, 8])
price

array([2, 7, 5, 3, 8])

In [23]:
type(price), price.shape

(numpy.ndarray, (5,))

Now, my variable price is a 1-dimensional array.

In mathematical terms, both lists and 1-dimensional arrays represent vectors.

Compared to list, array can represent higher dimension data.

For example, consider the price of Bitcoin across the three firs weeks of August 2023.

I want to represent this data as a matrix, with each row representing a different week and each column representing a different day of the week.

I can do that using numpy as follows:

In [29]:
prices = np.array(
    [ [2, 7, 5, 3, 8],
      [9, 5, 8, 2, 3],
      [4, 6, 6, 8, 9]
    ])
print(prices)

[[2 7 5 3 8]
 [9 5 8 2 3]
 [4 6 6 8 9]]


In [30]:
type(prices), prices.shape

(numpy.ndarray, (3, 5))

My variable prices is now a 2-dimensional array: a matrix.

This matrix has 3 rows and 5 columns, which can be seen looking at the variable's shape above.

The number of dimensions represent the number of "axis" along which the data is defined: 

For matrices the two axis are columns and rows while vectors have only one axis.

Arrays can have any dimension, for example a three dimensional array can be though of as a cube of data:



In [35]:
x = np.ones((3, 4, 5)) # Create a 3-dimensional array containing the value 1 with shape 3, 4, 5
print(x.shape)

(3, 4, 5)


The above array can be though of as 3 matrices of shape (4,5) stacked along a third dimension, as illustrated in the image below:

In [36]:
print(x)

[[[1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]]

 [[1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]]

 [[1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]]]


### Indexing

I can access the values of an array by indexing it.

For example, let's select the first value of the price variable.

In python the first element is defined as the indes 0, so I can get the Monday Bitcoin price by writing:

In [38]:
price[0] # Get the first element of the array

2

Similarly, I can index a matrix by selecting its first row as follows:

In [39]:
prices[0]

array([2, 7, 5, 3, 8])

Here I selected the first row of the matrix. A matrix row is a vector, or a 1-dimensional array:

In [42]:
print(prices.shape)
print(prices[0].shape)

(3, 5)
(5,)


I can select the first element of the first row of the matrix as follows:

In [43]:
prices[0,0]

2

Which gives me the price of bitcoin on the first week's Monday 

### Exercise:

Index the variable prices so as to return the price of Bitcoin on Wednesday of the second week 

## Operations on arrays

Arrays are very convenient to perform operation on data.

For example, I can add up two vectors as follows:

In [44]:
x = np.ones(6)
y = np.ones(6)

In [47]:
print(x)

[1. 1. 1. 1. 1. 1.]


In [48]:
print(y)

[1. 1. 1. 1. 1. 1.]


In [50]:
z = x + y
print(z)

[2. 2. 2. 2. 2. 2.]


But be careful, some operations do not make sense.

For example, I can not add a vector of dimension 3 to a vector of dimension 2.

Only vectors of the same dimension can be added

In [51]:
x = np.ones(3)
y = np.ones(2)

In [52]:
print(x)

[1. 1. 1.]


In [53]:
print(y)

[1. 1.]


In [54]:
z = x + y

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

Above, I got an error because adding vectors of different dimensions (=different array shapes) is not possible

Any kind of operation on vector and matrices can be written in numpy.

For example, here is how to perform a vector-matrix multiplication:

In [56]:
x = np.ones(6)
w = np.ones((6,3))

In [61]:
print(x)

[1. 1. 1. 1. 1. 1.]


In [62]:
print(w)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


In [63]:
y = x@w

In [64]:
print(y)

[6. 6. 6.]


### Using Numpy for Machine Learning

To solve standard Classification and Regression problems, the first step is to assemble a dataset of inputs $x$ and outputs $y$

The $x$ inputs are typically high-dimensional vectors, and $y$ are typically either a number (for regression) or a class for classification.

Given $x$ and $y$, one can then train a model to output the y from x.

In Python, we represent the inputs as matrices in which each row represent a different input, and each column represent a different variable of the input.

We represent the output y as a vector representing the output for each row of the input x

For example, below is the price of bitcoin on Saturday for the three weeks four weeks of August 2023

In [67]:
y = np.array([6, 7, 9])
x = prices

In [69]:
print(y)

[6 7 9]


In [70]:
print(x)

[[2 7 5 3 8]
 [9 5 8 2 3]
 [4 6 6 8 9]]


Then I can use the data x and y to train a Machine Leanring model $f_w(x)$ to predict the Saturday bitoin price from the previous week days price: $y=f_w(x)$