### Representing data for Machine Learning in Python: Introduction to Numpy

In the lecture, we saw that the "language" of Machine Learning is vectors: Machine Learning models use vectors to represent inputs and outputs

To be more precise, the "language" of Machine Learning is arrays.

An array is a grid of numbers that can have any number of dimensions.

A 0-dimension array is just a single number:

A = a

A 1-dimensional array is a vector: 

A = $ [a_1, a_2, a_3] $

A 2-dimensional array is a matrix, which can be though of as a stack of vectors:

$ 
A =  \begin{pmatrix}
a_{1}^1 & a_{2}^1 & \cdots & a_{n}^1 \\
a_{1}^2 & a_{2}^2 & \cdots & a_{n}^2 \\
\vdots  & \vdots  & \ddots & \vdots  \\
a_{1}^d & a_{2}^d & \cdots & a_{n}^d 
\end{pmatrix}
$




A 3-dimensional array is a "cube" of data, which can be though of as a stack of matrices:

![3D%20matrix.png](https://raw.githubusercontent.com/TristHas/ClassMaterial/main/3D%20matrix.png)

In Machine Learning, we use these arrays to represent our input and output data.

In Python, we use the module "numpy" to represent arrays.

Let's see how these arrays work

## Install and imports

First, we need to import the module numpy.

In [57]:
import numpy as np

## Using array to represent data

Imagine that I want to represent the evolution of bitcoin price:

- Monday: 2
- Tuesday: 7
- Wednesday: 5
- Thursday: 3
- Friday: 8


Using the standard data types, I could define a "price" variable that represent this information as a list:

In [58]:
price = [2, 7, 5, 3, 8]
print(price)

[2, 7, 5, 3, 8]


Here my variable price is a list containing the 4 daily bitcoin prices

In [59]:
# I can show the price and size of the list using the operations type and len
type(price), len(price)

(list, 5)

I can also represent this information using a numpy array:

In [60]:
price = np.array([2, 7, 5, 3, 8]) # Create a numpy array representing the bitcoin price vector
price

array([2, 7, 5, 3, 8])

In [61]:
type(price), price.shape

(numpy.ndarray, (5,))

Now, my variable price is a 1-dimensional array.

In mathematical terms, both lists and 1-dimensional arrays represent vectors.

Compared to list, array can represent higher dimension data.

For example, consider the price of Bitcoin across the three firs weeks of August 2023.

I want to represent this data as a matrix, with each row representing a different week and each column representing a different day of the week.

I can do that using numpy as follows:

In [62]:
prices = np.array(
    [ [2, 7, 5, 3, 8],
      [9, 5, 8, 2, 3],
      [4, 6, 6, 8, 9]
    ])
print(prices)

[[2 7 5 3 8]
 [9 5 8 2 3]
 [4 6 6 8 9]]


In [63]:
type(prices), prices.shape

(numpy.ndarray, (3, 5))

My variable prices is now a 2-dimensional array: a matrix.

This matrix has 3 rows and 5 columns, which can be seen looking at the variable's shape above.

The number of dimensions represent the number of "axis" along which the data is defined: 

For matrices the two axis are columns and rows while vectors have only one axis.

Arrays can have any dimension, for example a three dimensional array can be though of as a cube of data:



In [10]:
x = np.ones((3, 4, 5)) # Create a 3-dimensional array containing the value 1 with shape 3, 4, 5
print(x.shape)

(3, 4, 5)


The above array can be though of as 3 matrices of shape (4,5) stacked along a third dimension, as illustrated in the image below:

In [11]:
print(x)

[[[1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]]

 [[1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]]

 [[1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]
  [1. 1. 1. 1. 1.]]]


### Indexing

I can access the values of an array by an operation called indexing.

For example, let's access the first value of the price variable.

In python the first element is defined as the index 0, so I can access the Monday Bitcoin price by indexing as follows:

In [12]:
price[0] # Get the first element of the array

2

Similarly, I can index a matrix by selecting its first row as follows:

In [14]:
print(prices)

[[2 7 5 3 8]
 [9 5 8 2 3]
 [4 6 6 8 9]]


In [15]:
print(prices[0])

[2 7 5 3 8]


Here I selected the first row of the matrix. A matrix row is a vector, or a 1-dimensional array:

In [16]:
print(prices.shape)
print(prices[0].shape)

(3, 5)
(5,)


I can select the first element of the first row of the matrix as follows:

In [17]:
prices[0,0]

2

Which gives me the price of bitcoin on the first week's Monday 

### Exercise:

Index the variable prices so as to return the price of Bitcoin on Wednesday of the second week 

## Operations on arrays

Arrays are very convenient to perform operations on data.

For example, I can add up two vectors as follows:

In [35]:
x = np.ones(6) # Create a vector of six elements with value 1
y = np.ones(6) # Create a vector of six elements with value 1

In [23]:
print(x)

[1. 1. 1. 1. 1. 1.]


In [24]:
print(y)

[1. 1. 1. 1. 1. 1.]


In [25]:
z = x + y # Add the two vectors
print(z)

[2. 2. 2. 2. 2. 2.]


But be careful, some operations can not be performed.

For example, I can not add a vector of dimension 3 to a vector of dimension 2.

Only vectors of the same dimension can be added

In [26]:
x = np.ones(3)
y = np.ones(2)

In [27]:
print(x)

[1. 1. 1.]


In [28]:
print(y)

[1. 1.]


In [29]:
z = x + y

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

Above, I got an error because adding vectors of different dimensions (=different array shapes) is not possible

Any kind of operation on vector and matrices can be written in numpy.

For example, here is how to perform a vector-matrix multiplication:

In [36]:
x = np.ones(6)     # Create a vector of six elements with value 1
w = np.ones((6,3)) # Create a matrix of six rows and 3 columns with every element of value 1

In [37]:
print(x)

[1. 1. 1. 1. 1. 1.]


In [38]:
print(w)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


In [39]:
y = x@w

In [40]:
print(y)

[6. 6. 6.]


It is also possible to add matrices:

In [42]:
y = w + w # Add w with itself
print(y)

[[2. 2. 2.]
 [2. 2. 2.]
 [2. 2. 2.]
 [2. 2. 2.]
 [2. 2. 2.]
 [2. 2. 2.]]


Or to multiply a matrix by a number

In [43]:
z = 5 * w
print(z)

[[5. 5. 5.]
 [5. 5. 5.]
 [5. 5. 5.]
 [5. 5. 5.]
 [5. 5. 5.]
 [5. 5. 5.]]


### Using Numpy for Machine Learning

To solve standard Classification and Regression problems, the first step is to create a dataset of inputs $x$ and outputs $y$ as numpy arrays

The $x$ inputs are typically high-dimensional vectors, and $y$ are typically either a number (for regression) or a class for classification.

Given $x$ and $y$, one can then train a model to output the y from x.

In Python, we represent the inputs as matrices X in which each row represent a different input, and each column represent a different variable of the input.

We represent the output Y as a vector representing the output for each row of the input matrix X

For example, below is the price of bitcoin on Saturday for the three weeks four weeks of August 2023

In [52]:
Y = np.array([6, 7, 9])
X = prices

In [53]:
print(Y)

[6 7 9]


In [54]:
print(X)

[[2 7 5 3 8]
 [9 5 8 2 3]
 [4 6 6 8 9]]


Then I can use the data $X$ and $y$ to train a Machine Leanring model $f_w(x)$ to predict the Saturday bitoin price from the previous week days price: $y=f_w(x)$

Each row of the matrix $X$ represent one input vector of $x$ containing the daily values of bitcoin price and each element of the $y$ vector contains the saturday value.

For example, the data $(x,y)$ for the first week of August is as follows:

In [56]:
print("Input x:")
print(X[0])
print("Output y:")
print(Y[0])

Input x:
[2 7 5 3 8]
Output y:
6


In the following lecture, we will see how to collect data as input matrices $X$ and output vectors $Y$, and how to train a model on this data