## ML1 Notebook

Each topic in this part of the course will have an accompanying Python notebook, where we can implement the concepts introduced using code. 

The key points from the video are that
- We can represent data points using vectors
- These can be stacked into a matrix to represent a whole dataset
- We can standardise our data to prevent large measurements from dominating 

We will examine these concepts by exploring the Iris flower dataset.

### Representing a datapoint

Many of you will be familiar with the Iris flower dataset from PSE2. It was composed by Ronald Fisher in 1936. It consists of 150 data points, each corresponding to a particular iris flower.

Each data point consists of 4 measurements (in cm). The sepal length, sepal width, petal length, and petal width of an iris.

<img src="./iris.png" title="iris"/>

We can represent this using a 4D vector. In Python, we represent vectors using NumPy arrays.

In [1]:
import numpy as np # Import the numpy package

x = np.array([3.6, 4.1, 1.5, 0.7]) # Create a numpy array to represent a vector.
print(f'Our data point x is represented by {x}') # F-strings are a nice way to print!

Our data point x is represented by [3.6 4.1 1.5 0.7]


### Representing a dataset

We now have `x` which contains the measurements for the flower above. This is a **single data point** represented by a vector. We can represent an entire dataset simply by stacking the vectors for all the different data points to form a matrix.

<img src="./irisdataset.png" title="irisdataset"/>



This has already been done for us in the `sklearn` package for the Iris dataset.

In [2]:
import sklearn.datasets # If you are running this locally, then `pip install sklearn` in your Python environment.
iris = sklearn.datasets.load_iris() # Load the iris dataset
X = iris.data # Assign the dataset matrix to X
print(f'Our dataset X is represented by \n {X}')

Our dataset X is represented by 
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.

We now have a matrix with 150 rows, where each **row** corresponds to a single data point. Each **column** corresponds to one of the four measurements for each data point; these are consistent (i.e. each column is measuring the same thing).

### Standardising your data

In the video we introduce the concept of **standardising data.** This can prevent large measurements from dominating. A quick look at the dataset above shows that the first column (sepal length) has the largest values.

This can be done by computing the mean and standard deviation of each measurement (column) across data points and then subtracting them from their respective columns. We can do this concisely in Python.

In [3]:
column_means = X.mean(0) # This means take the mean across columns
print(f'The column means are {np.round(column_means,2)}') # Rounding to 2 DP

column_stds = X.std(0) # This means take the std across columns
print(f'The column stds are {np.round(column_stds,2)}') # Rounding to 2 DP



The column means are [5.84 3.06 3.76 1.2 ]
The column stds are [0.83 0.43 1.76 0.76]


Following the video, we can now subtract the mean, and divide by the standard deviation in each column. Remember that when we divide it is safe to add a small constant to the denominator to prevent division by zero!

<img src="./standardise.png" title="standardise"/>



In [4]:
eps = 1e-8 # A small constant to prevent division by zero
X_s = (X - column_means)/(column_stds + eps) # Standardise
X_s = np.round(X_s, 2) # Round to 2 d.p.

Has this done the job? Let's look at the new column means and stds.

In [5]:
column_means = X_s.mean(0) # This means take the mean across columns
print(f'The column means are {np.round(column_means,2)}')

column_stds = X_s.std(0) # This means take the std across columns
print(f'The column stds are {np.round(column_stds,2)}')

The column means are [ 0.  0. -0. -0.]
The column stds are [1. 1. 1. 1.]


Our dataset is now standardised!

In [6]:
print(X_s)

[[-0.9   1.02 -1.34 -1.32]
 [-1.14 -0.13 -1.34 -1.32]
 [-1.39  0.33 -1.4  -1.32]
 [-1.51  0.1  -1.28 -1.32]
 [-1.02  1.25 -1.34 -1.32]
 [-0.54  1.94 -1.17 -1.05]
 [-1.51  0.79 -1.34 -1.18]
 [-1.02  0.79 -1.28 -1.32]
 [-1.75 -0.36 -1.34 -1.32]
 [-1.14  0.1  -1.28 -1.45]
 [-0.54  1.48 -1.28 -1.32]
 [-1.26  0.79 -1.23 -1.32]
 [-1.26 -0.13 -1.34 -1.45]
 [-1.87 -0.13 -1.51 -1.45]
 [-0.05  2.17 -1.45 -1.32]
 [-0.17  3.09 -1.28 -1.05]
 [-0.54  1.94 -1.4  -1.05]
 [-0.9   1.02 -1.34 -1.18]
 [-0.17  1.71 -1.17 -1.18]
 [-0.9   1.71 -1.28 -1.18]
 [-0.54  0.79 -1.17 -1.32]
 [-0.9   1.48 -1.28 -1.05]
 [-1.51  1.25 -1.57 -1.32]
 [-0.9   0.56 -1.17 -0.92]
 [-1.26  0.79 -1.06 -1.32]
 [-1.02 -0.13 -1.23 -1.32]
 [-1.02  0.79 -1.23 -1.05]
 [-0.78  1.02 -1.28 -1.32]
 [-0.78  0.79 -1.34 -1.32]
 [-1.39  0.33 -1.23 -1.32]
 [-1.26  0.1  -1.23 -1.32]
 [-0.54  0.79 -1.28 -1.05]
 [-0.78  2.4  -1.28 -1.45]
 [-0.42  2.63 -1.34 -1.32]
 [-1.14  0.1  -1.28 -1.32]
 [-1.02  0.33 -1.45 -1.32]
 [-0.42  1.02 -1.4  -1.32]
 

The measurements aren't in cm any more (some of the values are negative!) but **this is not important**. It is simply a linear rescaling. Large/positive is long and small/negative is short.

In [7]:
np.save('iris_standardised', X_s) # Save the standardised dataset for later use
