# Learning outcomes

When you've worked through the tasks and exercises in this notebook, you'd have

* built a machine learning model using a standard software library;
* run experiments to explore effects of regularization and data augmentation.

# Objectives




* To introduce you to 2 of the main (Python-based) software libraries we'll be using throughout the module:
> 1. scikit-learn (https://scikit-learn.org/stable/) - one of the well-used machine learning libraries.
> 2. numpy (https://numpy.org/) -  a very common library for mathematical functions.

>**Note**: It is your responsibility as a machine learning scientist to read documentations for any library function you use and to thoroughly understand what it is doing, if it validly serves your purpose, and which of its parameters you need to consider.

* To see some of the basic components of machine learning first hand - training data, test (i.e. unseen) data, and machine learning model (with weights and biases being the primary parameters that specify a model for most types of models):
>1. Data - today, we'll use the Iris dataset. You can read more about it here: https://scikit-learn.org/stable/datasets/toy_dataset.html.
>2. Model - we'll explore linear regression, regularization, and augmentation, which we covered in the Week 1 mini-videos


# Section 1 - Set up imports and random number generator

In [None]:
%matplotlib inline

import numpy
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import copy


# Set up the random number generator
rng =  numpy.random.default_rng()

# Section 2 - Load the Iris dataset

In [None]:
from sklearn import datasets

iris_data, iris_labels = datasets.load_iris(return_X_y=True, as_frame=False)

print("The dimensions of the Iris feature matrix", iris_data.shape)

# Section 3 - Explore the Iris dataset

* Read about the Iris dataset here: https://scikit-learn.org/stable/datasets/toy_dataset.html
* What type of labels does it have (real continuous or categorical)? What kind of machine learning task is this type of label suited to, i.e. classification or regression?
* What is the feature dimensionality of the dataset, i.e. the number of features?
* How many data instances are there? What is the distribution of instances across classes?



---


* Select one of the features. What association does the selected feature have with the iris classes, with respect to differentiating between them (Hint - use a search engine to read about Iris Setosa, Iris Versicolour, and Iris Virginica plant)?
* What factors do you think limited the number of data instances per class?
* How do you think the data was collected? What implication would this have for real world deployment of a model for automatic detection of iris classes based on this dataset?
* How do you think it was labelled? What kind of challenge might this pose for collection of more training data (and labels) for automatic detection of iris classes?

# Section 4 - Split into training and test sets

In [None]:

# Randomly split the data into 50:50 training:test sets
rand_inds = numpy.arange(iris_labels.shape[0],)
rng.shuffle(rand_inds)
split_point = int(0.5*iris_labels.shape[0])

training_data_x = iris_data[rand_inds[0:split_point], :]
training_labels_y = iris_labels[rand_inds[0:split_point]]
test_data_x = iris_data[rand_inds[split_point:iris_labels.shape[0]], :]
test_labels_y = iris_labels[rand_inds[split_point:iris_labels.shape[0]]]

print("Size of the training data:", training_data_x.shape)
print("Size of the ttest data:", test_data_x.shape)

# Section 5 - Train linear regression model

In [None]:

print(training_data_x)

# Train a linear regression model
lr_model = linear_model.LinearRegression()
lr_model.fit(training_data_x, training_labels_y)
print("\nThe weight (w):",  lr_model.coef_)
print("The bias (b):",  lr_model.intercept_)

# Check the performance of the model on the data used to train it
training_pred_y = lr_model.predict(training_data_x)
print("\nMean squared error (error on training data): %.2f " % mean_squared_error(training_labels_y, training_pred_y))




# Section 6 - Explore reproducibility


* Run Sections 4 and 5 code multiple times (e.g. 3 times) - each time, copy and paste your outputs (training data, weight, bias, mean squared error) somewhere so that you can compare outputs across the multiple runs. What do you notice? What is the implication, and how could you address it?

# Section 7 - Explore the linear regression model



*   Why are the 4 weights for the model?
*   Why is there one bias for the model?



# Section 8 - Explore the effects of L1 and L2 regularization

*   Train and evaluate a linear regression model
  * See above examples; also see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
*   Train and evaluate a linear regression model with L2 regularization
  * Set alpha to 0.5
  * See https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge)
*   Train and evaluate a linear regression model with L1 regularization
  * Set alpha to 0.5
  * See https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)
*   What are the effects of regularization that you notice?
  * See Week 1 mini-videos
  * Hint - Compare the weights (and bias) and the errors.




# Section 9 - Explore the effect of alpha on L1 and L2 regularization

* Using your code in Section 8, compare the effect of multiple alpha values, e.g. alpha = 0.000000001, 0.0001, 0.1, on regularization.

# Section 10 - Explore the effect of data augmentation

*   Train and evaluate a linear regression model
  * See above examples; also see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
*   Train and evaluate a linear regression model with data augmentation applied to the training data
  * See Week 1 mini-videos
  * You could try adding randomly setting feature values to '-1'
  * You could try multiple augmentation intensities, e.g. probability of 0.01, 0.1, and 0.5 of setting to '-1'
*   Train and evaluate a linear regression model with another data augmentation applied to the training data
  * See Week 1 mini-videos
  * You could try randomly adding Gaussian noise to the data
  * You could try multiple augmentation intensities, e.g. Gaussian noise of standard deviation = 0.01, 0.1, and 0.5
*   What are the effects of augmentation that you notice?
