# Learning outcomes

When you've worked through the tasks and exercises in this notebook, you'd have

* built a machine learning model using a standard software library;
* run experiments to explore effects of regularization and data augmentation.

# Objectives




* To introduce you to 2 of the main (Python-based) software libraries we'll be using throughout the module:
> 1. scikit-learn (https://scikit-learn.org/stable/) - one of the well-used machine learning libraries.
> 2. numpy (https://numpy.org/) -  a very common library for mathematical functions.

>**Note**: It is your responsibility as a machine learning scientist to read documentations for any library function you use and to thoroughly understand what it is doing, if it validly serves your purpose, and which of its parameters you need to consider.

* To see some of the basic components of machine learning first hand - training data, test (i.e. unseen) data, and machine learning model (with weights and biases being the primary parameters that specify a model for most types of models):
>1. Data - today, we'll use the Iris dataset. You can read more about it here: https://scikit-learn.org/stable/datasets/toy_dataset.html.
>2. Model - we'll explore linear regression, regularization, and augmentation, which we covered in the Week 1 mini-videos


# Section 1 - Set up imports and random number generator

In [None]:
%matplotlib inline

import numpy
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import copy


# Set up the random number generator
rng =  numpy.random.default_rng()

# Section 2 - Load the Iris dataset

In [None]:
from sklearn import datasets

iris_data, iris_labels = datasets.load_iris(return_X_y=True, as_frame=False)

print("The dimensions of the Iris feature matrix", iris_data.shape)

The dimensions of the Iris feature matrix (150, 4)


# Section 3 - Explore the Iris dataset

* Read about the Iris dataset here: https://scikit-learn.org/stable/datasets/toy_dataset.html
* What type of labels does it have (real continuous or categorical)? What kind of machine learning task is this type of label suited to, i.e. classification or regression?
* What is the feature dimensionality of the dataset, i.e. the number of features?
* How many data instances are there? What is the distribution of instances across classes?



---


* Select one of the features. What association does the selected feature have with the iris classes, with respect to differentiating between them (Hint - use a search engine to read about Iris Setosa, Iris Versicolour, and Iris Virginica plant)?
* What factors do you think limited the number of data instances per class?
* How do you think the data was collected? What implication would this have for real world deployment of a model for automatic detection of iris classes based on this dataset?
* How do you think it was labelled? What kind of challenge might this pose for collection of more training data (and labels) for automatic detection of iris classes?

**Solution**

* The dataset has categorical labels, and so, classification is more appropriate for it, rather than regression.
* The Iris dataset has 4 features.
* There are 150 data instances, and an equal number of instances per class.

The last 4 sets of questions are open questions. You can discuss your thoughts with your peers or with a TA.


> **Note** - In good practice, machine learning (ML) development does not start with thinking about the learning algorithm but must begin with thinking about the real world problem that ML aims to address, as well as the dataset associated with the problem. You can listen here to an overview of the data-centric AI movement that encourages the approach of exploring and 'working on' the data (rather than only focusing on crafting the ML architecture) at the centre of developing a model for a given problem: https://datacentricai.org/blog/opening-remarks/.

# Section 4 - Split into training and test sets

In [None]:

# Randomly split the data into 50:50 training:test sets
rand_inds = numpy.arange(iris_labels.shape[0],)
rng.shuffle(rand_inds)
split_point = int(0.5*iris_labels.shape[0])

training_data_x = iris_data[rand_inds[0:split_point], :]
training_labels_y = iris_labels[rand_inds[0:split_point]]
test_data_x = iris_data[rand_inds[split_point:iris_labels.shape[0]], :]
test_labels_y = iris_labels[rand_inds[split_point:iris_labels.shape[0]]]

print("Size of the training data:", training_data_x.shape)
print("Size of the ttest data:", test_data_x.shape)

Size of the training data: (75, 4)
Size of the ttest data: (75, 4)


# Section 5 - Train linear regression model

In [None]:

print(training_data_x)

# Train a linear regression model
lr_model = linear_model.LinearRegression()
lr_model.fit(training_data_x, training_labels_y)
print("\nThe weight (w):",  lr_model.coef_)
print("The bias (b):",  lr_model.intercept_)

# Check the performance of the model on the data used to train it
training_pred_y = lr_model.predict(training_data_x)
print("\nMean squared error (error on training data): %.2f " % mean_squared_error(training_labels_y, training_pred_y))




[[5.7 3.  4.2 1.2]
 [7.9 3.8 6.4 2. ]
 [5.5 2.4 3.8 1.1]
 [4.4 3.  1.3 0.2]
 [4.8 3.  1.4 0.3]
 [6.7 3.  5.2 2.3]
 [5.7 2.9 4.2 1.3]
 [5.5 2.5 4.  1.3]
 [7.7 2.6 6.9 2.3]
 [4.9 3.  1.4 0.2]
 [5.1 3.5 1.4 0.2]
 [5.5 2.4 3.7 1. ]
 [6.7 3.1 4.4 1.4]
 [5.6 3.  4.5 1.5]
 [5.4 3.4 1.5 0.4]
 [4.8 3.  1.4 0.1]
 [4.8 3.4 1.9 0.2]
 [7.7 3.8 6.7 2.2]
 [6.4 3.2 5.3 2.3]
 [7.6 3.  6.6 2.1]
 [6.5 3.2 5.1 2. ]
 [5.2 3.4 1.4 0.2]
 [5.9 3.  5.1 1.8]
 [6.6 3.  4.4 1.4]
 [4.6 3.4 1.4 0.3]
 [4.4 2.9 1.4 0.2]
 [6.9 3.1 5.1 2.3]
 [4.7 3.2 1.3 0.2]
 [5.8 2.7 5.1 1.9]
 [4.9 2.4 3.3 1. ]
 [6.5 3.  5.2 2. ]
 [6.3 2.7 4.9 1.8]
 [6.7 3.3 5.7 2.5]
 [5.6 2.5 3.9 1.1]
 [6.1 3.  4.9 1.8]
 [4.9 3.6 1.4 0.1]
 [4.6 3.2 1.4 0.2]
 [7.7 3.  6.1 2.3]
 [5.5 4.2 1.4 0.2]
 [5.6 3.  4.1 1.3]
 [6.7 3.1 5.6 2.4]
 [5.7 2.8 4.5 1.3]
 [5.5 2.3 4.  1.3]
 [6.4 2.7 5.3 1.9]
 [4.4 3.2 1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.1 2.5 3.  1.1]
 [5.4 3.9 1.3 0.4]
 [6.8 2.8 4.8 1.4]
 [6.4 2.8 5.6 2.1]
 [6.1 2.9 4.7 1.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.7 1.

# Section 6 - Explore reproducibility


* Run Sections 4 and 5 code multiple times (e.g. 3 times) - each time, copy and paste your outputs (training data, weight, bias, mean squared error) somewhere so that you can compare outputs across the multiple runs. What do you notice? What is the implication, and how could you address it?

**Solution**

* As no random seed is set in Section 1, the random data generated in each run of Section 5 code is different, and this affects everything else in the section.
* To address this, you need to replace

>`rng = numpy.random.default_rng(random_seed)`

  in Section 1 with

>```
> random_seed = 1
> rng = numpy.random.default_rng(random_seed)

* You can set *random_seed* to any integer.

* There is random number generation at several different levels within available software libraries for building and evaluating machine learning models. It is your responsibility as a machine learning scientist to take note of the points in the library you are using (or in your code) where randomness is introduced and address it so as to have reproducible models and model evaluation results.

* Addressing it is usually as simple as applying a random seed (e.g. for both Python and the specific machine learning library functions being used). However, note that some machine learning libraries, e.g. Tensorflow 2.11 GPU, have in them randomness that cannot be completely eliminated by the library user. You should look for the section on reproducibility in the documentation for the machine learning library that you want to use to be sure how randomness may (or not) be addressed for the given library.

# Section 7 - Explore the linear regression model



*   Why are the 4 weights for the model?
*   Why is there one bias for the model?



**Solution**

The number of weights matches the number of features, whereas there is only one bias per basic linear model.

# Section 8 - Explore the effects of L1 and L2 regularization

*   Train and evaluate a linear regression model
  * See above examples; also see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
*   Train and evaluate a linear regression model with L2 regularization
  * Set alpha to 0.5
  * See https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge)
*   Train and evaluate a linear regression model with L1 regularization
  * Set alpha to 0.5
  * See https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)
*   What are the effects of regularization that you notice?
  * See Week 1 mini-videos
  * Hint - Compare the weights (and bias) and the errors.




**Solution**

In [None]:

# Get the weights and bias of trained linear regression model from Section 5
print("\nThe weights (w):",  lr_model.coef_)
print("The bias (b):",  lr_model.intercept_)

alpha = 0.5

# Train a new linear regression model with L2 regularization
# Get its weights and bias
lr_model_L2 = linear_model.Ridge(alpha=alpha)
lr_model_L2.fit(training_data_x, training_labels_y)
print("\nThe weights (w) - L2 reg:",  lr_model_L2.coef_)
print("The bias (b) - L2 reg:",  lr_model_L2.intercept_)

# Train a separate linear regression model with L1 regularization
# Get its weights and bias
lr_model_L1 = linear_model.Lasso(alpha=alpha)
lr_model_L1.fit(training_data_x, training_labels_y)
print("\nThe weights (w) - L1 reg:",  lr_model_L1.coef_)
print("The bias (b) - L1 reg:",  lr_model_L1.intercept_)
print()


# Check the performance of the models on the data used to train them
training_pred_y = lr_model.predict(training_data_x)
training_pred_y_L2 = lr_model_L2.predict(training_data_x)
training_pred_y_L1 = lr_model_L1.predict(training_data_x)
print("\nMean squared error (training error): %.2f " % mean_squared_error(training_labels_y, training_pred_y))
print("Mean squared error (training error) - L2: %.2f " % mean_squared_error(training_labels_y, training_pred_y_L2))
print("Mean squared error (training error) - L1: %.2f " % mean_squared_error(training_labels_y, training_pred_y_L1))

# Check the performance of the models on test data not seen by the models in training
test_pred_y = lr_model.predict(test_data_x)
test_pred_y_L2 = lr_model_L2.predict(test_data_x)
test_pred_y_L1 = lr_model_L1.predict(test_data_x)
print("\nMean squared error (error on test data): %.2f " % mean_squared_error(test_labels_y, test_pred_y))
print("Mean squared error (error on test data) - L2: %.2f " % mean_squared_error(test_labels_y, test_pred_y_L2))
print("Mean squared error (error on test data) - L1: %.2f " % mean_squared_error(test_labels_y, test_pred_y_L1))
print()



The weights (w): [-0.09279415 -0.07641343  0.17415337  0.71879044]
The bias (b): 0.2774695677305642

The weights (w) - L2 reg: [-0.09448983 -0.07012257  0.20941161  0.63727928]
The bias (b) - L2 reg: 0.23588894051204656

The weights (w) - L1 reg: [ 0.         -0.          0.28847935  0.        ]
The bias (b) - L1 reg: -0.032247321880808366


Mean squared error (training error): 0.04 
Mean squared error (training error) - L2: 0.04 
Mean squared error (training error) - L1: 0.15 

Mean squared error (error on test data): 0.05 
Mean squared error (error on test data) - L2: 0.05 
Mean squared error (error on test data) - L1: 0.13 



* With L2 regularization, you should see smaller weight values compared to no regularization.
* With L1 regularization, you should see more weights that are value zero, compared to L2 and no regularization.
* Remember from Week 1 lecture that regularization is a strategy for reducing overfitting to training data and making it more generalizable to unseen data (i.e. represented as the test data). See Week 1 lecture and Week 1 suggested readings for more on overfitting and regularization. As the experiments above are based on randomly generated data (for which there is likely no relationship between the features and labels that can be learnt by a model), you may not notice any effect of regularization on the mean squared errors.

# Section 9 - Explore the effect of alpha on L1 and L2 regularization

* Using your code in Section 8, compare the effect of multiple alpha values, e.g. alpha = 0.000000001, 0.0001, 0.1, on regularization.

**Solution**

You will find that higher values of alpha increase the effect of regularization, and alpha value of zero corresponds to no regularization applied.

# Section 10 - Explore the effect of data augmentation

*   Train and evaluate a linear regression model
  * See above examples; also see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
*   Train and evaluate a linear regression model with data augmentation applied to the training data
  * See Week 1 mini-videos
  * You could try adding randomly setting feature values to '-1'
  * You could try multiple augmentation intensities, e.g. probability of 0.01, 0.1, and 0.5 of setting to '-1'
*   Train and evaluate a linear regression model with another data augmentation applied to the training data
  * See Week 1 mini-videos
  * You could try randomly adding Gaussian noise to the data
  * You could try multiple augmentation intensities, e.g. Gaussian noise of standard deviation = 0.01, 0.1, and 0.5
*   What are the effects of augmentation that you notice?


**Solution**

The solution below tests data augmentation setting to -1 with probability of 0.01 as well as adding Gaussian noise of standard deviation of 0.01.

With real world data, depending on the data and the specific data augmentation method and settings appropriate for it, you would expect a model trained with augmented data to be more generalizable (see Week 1 mini-videos for what this means).

In [None]:

# Randomly select training data instances and dimensions to 'drop out'by setting to -1
# -- Generate random numbers between 0 and 1 for each dimension of each data instance
# -- Select dimensions (within data instances) with generated random number <= thresh
# -- And set their values to -1
thresh = 0.01
dropout_ind = rng.random(size=training_data_x.shape)
training_data_x_droppedout = copy.deepcopy(training_data_x)
numpy.place(training_data_x_droppedout, dropout_ind<=thresh, -1)


print(training_data_x_droppedout)

# Train a new linear regression model with the data with the 'drop outs'
lr_model_droppedout = linear_model.LinearRegression()
lr_model_droppedout.fit(training_data_x_droppedout, training_labels_y)
print("\nThe weight (w):",  lr_model_droppedout.coef_)
print("The bias (b):",  lr_model_droppedout.intercept_)




# Generate Gaussian noise of standard deviation std
# and of the same size as the training data
# then add the noise to the features for the training data
std = 0.01
noise = rng.normal(loc=0.0, scale=std, size=training_data_x.shape)
training_data_x_noised = training_data_x + noise


print(training_data_x_noised)

# Train a new linear regression model with the noised data
lr_model_noised = linear_model.LinearRegression()
lr_model_noised.fit(training_data_x_noised, training_labels_y)
print("\nThe weight (w):",  lr_model_noised.coef_)
print("The bias (b):",  lr_model_noised.intercept_)



# Check the performance of the models on the data used to train them
training_pred_y_droppedout = lr_model_droppedout.predict(training_data_x)
training_pred_y_noised = lr_model_noised.predict(training_data_x_noised)
print("\nMean squared error (error on training data): %.2f " % mean_squared_error(training_labels_y, training_pred_y))
print("Mean squared error (error on training data) - droppedout: %.2f " % mean_squared_error(training_labels_y, training_pred_y_droppedout))
print("Mean squared error (error on training data) - noised: %.2f " % mean_squared_error(training_labels_y, training_pred_y_noised))

# Check the performance of the models on test data not seen by the models in training
test_pred_y_droppedout = lr_model_droppedout.predict(test_data_x)
test_pred_y_noised = lr_model_noised.predict(test_data_x)
print("\nMean squared error (error on test data): %.2f " % mean_squared_error(test_labels_y, test_pred_y))
print("Mean squared error (error on test data) - droppedout: %.2f " % mean_squared_error(test_labels_y, test_pred_y_droppedout))
print("Mean squared error (error on test data) - noised: %.2f " % mean_squared_error(test_labels_y, test_pred_y_noised))
print()

[[ 5.7  3.   4.2  1.2]
 [ 7.9  3.8  6.4  2. ]
 [ 5.5  2.4  3.8  1.1]
 [ 4.4  3.   1.3  0.2]
 [-1.   3.   1.4  0.3]
 [ 6.7  3.   5.2  2.3]
 [ 5.7  2.9  4.2  1.3]
 [ 5.5  2.5  4.   1.3]
 [ 7.7  2.6  6.9  2.3]
 [ 4.9  3.   1.4  0.2]
 [ 5.1  3.5  1.4  0.2]
 [ 5.5  2.4  3.7  1. ]
 [ 6.7  3.1  4.4  1.4]
 [ 5.6  3.   4.5  1.5]
 [ 5.4  3.4  1.5  0.4]
 [ 4.8  3.   1.4  0.1]
 [ 4.8  3.4  1.9  0.2]
 [ 7.7  3.8  6.7 -1. ]
 [ 6.4  3.2  5.3  2.3]
 [ 7.6  3.   6.6  2.1]
 [ 6.5  3.2  5.1  2. ]
 [ 5.2  3.4  1.4  0.2]
 [-1.   3.   5.1  1.8]
 [ 6.6  3.   4.4  1.4]
 [ 4.6  3.4  1.4  0.3]
 [ 4.4  2.9  1.4  0.2]
 [ 6.9  3.1  5.1  2.3]
 [ 4.7  3.2  1.3  0.2]
 [ 5.8  2.7  5.1  1.9]
 [ 4.9  2.4  3.3  1. ]
 [ 6.5  3.   5.2  2. ]
 [ 6.3  2.7  4.9  1.8]
 [ 6.7  3.3  5.7  2.5]
 [ 5.6  2.5  3.9  1.1]
 [ 6.1  3.   4.9  1.8]
 [ 4.9  3.6  1.4  0.1]
 [ 4.6  3.2  1.4  0.2]
 [ 7.7  3.   6.1  2.3]
 [ 5.5  4.2  1.4  0.2]
 [ 5.6  3.   4.1  1.3]
 [ 6.7  3.1  5.6  2.4]
 [ 5.7  2.8  4.5  1.3]
 [ 5.5  2.3  4.   1.3]
 [ 6.4  2.7