**<p style='text-align: right;'>Ver. 1.0.3</p>**

# Introductory Applied Machine Learning (IAML) Coursework - Semester 2, 2022-23

### Author: Hiroshi Shimodaira and Rohan Gorantla

## Important Instructions

#### It is important that you follow the instructions below carefully for things to work properly.

You need to set up and activate your environment as you would do for your labs, see Learn section on Labs.  **You will need to use Noteable to create the files you will submit (the Jupyter (IPynthon) Notebook and the PDF)**.  Do **NOT** create the PDF in some other way, we will not be able to mark it.  If you want to develop your answers in your own environment, you should make sure you are using the same packages we are using, by running the cell which does imports below.

Read the instructions in this notebook carefully, especially where asked to name variables with a specific name. Wherever you are required to produce code you should use code cells, otherwise you should use markdown cells to report results and explain answers. In most cases we indicate the nature of answer we are expecting (code/text), and also provide the required code/markdown cell.

- We will use the IAML Learn page for any announcements, updates, and FAQs on this assignment. Please ***visit the page frequently*** to find the latest information/changes.
- Data files that you will be using are included in the coursework zip file that you have downloaded from the Learn assignment page for this coursework.
- There is a helper file 'iaml23cw_helpers.py' in the zip file, which you should upload to your environment.
- Some of the topics in this coursework are covered in weeks 7 and 8 of the course. Focus first on questions on topics that you have covered already, and come back to the other questions as the lectures progress.
- Keep your answers brief and concise.
- Make sure to show all your code/working.
- All the figures you present should have axis labels, titles, and grid lines unless specified explicitly. If you think grid lines spoiling readability, you can adjust the line width and/or line style. Figures should not be too small to read.
- Write readable code. While we do not expect you to follow PEP8 to the letter, the code should be adequately understandable, with plots/visualisations correctly labelled. Do use inline comments when doing something non-standard.
- When asked to present numerical values, make sure to represent real numbers in the appropriate precision corresponding to your answer. 
- When you use libraries specified in this coursework, you should use the default parameters unless specified explicitly.
- The criteria on which you will be judged include the quality of the textual answers and/or any plots asked for. For higher marks, when asked you need to give good and concise discussions based on experiments and theories using your own words.

- You will see <html>\\pagebreak</html> at the start of each subquestion.  ***Do not remove these, if you do we will not be able to mark your coursework.***

#### Good Scholarly Practice
Please remember the University requirement regarding all assessed work for credit. Details about this can be found at:
http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct

Specifically, this assignment should be your own individual work. We will employ tools for detecting misconduct.

Moreover, please note that Piazza is NOT a forum for discussing the solutions of the assignment. You may ask private questions. You can use the office hours to ask questions.

### SUBMISSION Mechanics
This assignment will account for 30% of your final mark. We ask you to submit answers to all questions.

You will submit (1) a PDF of your Notebook and (2) the Notebook itself via Gradescope.  Your grade will be based on the PDF, we will only use the Notebook if we need to see details.  **You must use the following procedure to create the materials to submit**.

1. Make sure your Notebook, the helper file, and the datasets are in Noteable and will run.  If you developed your answers in Noteable, this is already done.

2. Select **Kernel->Restart & Run All** to create a clean copy of your submission, this will run the cells in order from top to bottom.  This may take a while (a few hours) to complete, ensure that all the output and plots have complete before you proceed.

3. Select **File->Download as->PDF via LaTeX (.pdf)** and wait for the PDF to be created and downloaded.

4. Select **File->Download as->Notebook (.ipynb)**

5. You now should have in your download folder the pdf and the notebook.  Rename them sNNNNNNN.pdf and sNNNNNNN.ipynb, where sNNNNNNN is your matriculation number (student number).

**Details on submission instructions will be announced and documented on Learn before the deadline**. 

The submission deadline for this assignment is **28th March 2023 at 12:00 (midday) UK time (UTC)**.  Don't leave it to the last minute!


#### IMPORTS
Execute the cell below to import all packages you will be using for this assignment.  If you are not using Noteable, make sure the python and package version numbers reported match the python and package numbers, which can be checked by running the following cell. The Python version does not need to be the same, but it should be $3.9.p$, where $p \ge 12$.

In [None]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

from iaml23cw_helpers import *
print_versions();

# You may add other libraries here or in your other cells as needed.



\pagebreak

# Question 1: Experiments with a stock price  data set

#### 65 marks out of 130 for this coursework

The stock price data set we use in this coursework is a stock market index (composite stock price index of common stocks) in a country for the period between 2000 and 2022, consisting of four historical prices ('Open', 'High', 'Low', 'Close', which denote the opening, highest, lowest, and closing prices on the trading day, respectively) and trading volume. For the convenience of the coursework, we have added some features to the data set. They are four [technical indicators](https://www.fidelity.com/bin-public/060_www_fidelity_com/documents/learning-center/Understanding-Indicators-TA.pdf) (RSI, SMA, BBP, ADX), 'Tomorrow', and 'Target'. 'Tomorrow' holds the closing price of next trading day, which we will use for price prediction, and 'Target' is a binary indicator (label), which takes 1 if 'Tomorrow' is higher than 'Close', 0 otherwise, which we will use for the prediction of movement direction.

*** Loading data ***
Make sure that you have the data set files "dset_q1a.csv" and "dset_q1b.csv" in your environment. We will use the first file in the following sub questions except the last subquestion 1.8. Run the following cell to load the first file.

In [None]:
# Load the data set "dset_q1a.csv"
df = pd.read_csv("dset_q1a.csv", index_col="Date", parse_dates=True)

# ========== Question 1.1 --- [5 marks] ==========
###  Describe the main properties of the data:
1. [Code] Display the shape of the data
2. [Code] Display the range of the dataframe index
3. [Code] What data are present and what types of data are they? Display the information using **pandas.DataFrame.info**.
4. [Code] Display the highest price, the lowest price, and the mean of the closing price ('Close') for each year in the data. (Hint: the highest price for each year is sought from the price 'High'.)

\pagebreak
## Your answers for Question 1.1

In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

In [None]:
#(3) Your code goes here

In [None]:
#(4) Your code goes here

\pagebreak

# ========== Question 1.2 --- [8 marks] ==========
Perform an *exploratory data analysis* on the dataset by studying the following:
1. [Code and text] Plot the stock market closing price ('Close') and comment on it.
2. [Code] For the period from the beginning of year 2007 until the end of 2008, plot the closing price ('Close') and volumes ('Volume') respectively, where you show months on the x-axis and indicate the positions of the highest and lowest values for the period.
3. [Code and text] Plot a pairplot for the dataset features using the seaborn **pairplot** and report the patterns in the given dataset.
4. [Code] Plot the correlation matrix for the dataset features.
5. [Text] Based on the results you obtained in 3 and 4 above, comment on the relationships among the features present in the dataset.

\pagebreak
## Your answers for Question 1.2

In [None]:
#(1) Your code and text goes here

In [None]:
#(2) Your code goes here

In [None]:
#(3) Your code and text goes here

In [None]:
#(4) Your code goes here

#(5) Your text goes here

\pagebreak

# ========== Question 1.3 --- [9 marks] ==========

We here apply linear regression to predict 'Tomorrow' from 'SMA'.
For this question, you should use the sklearn implementation of Linear Regression. Use the first 80% of the data for training and the rest 20% for testing ***without shuffling***.
1. [Code] Fit a linear regression model to the training data so that we can predict 'Tomorrow' from 'SMA'. Report the estimated model parameters w and the coefficient of determination $R^2$.
2. [Text] Describe what the parameters represent for the fitted dataset with the linear regression model.
3. [Code] Report the root mean-square error (RMSE) for the training set and test set, respectively.
4. [Code] Plot predicted values versus actual values for the test set, where the x-axis corresponds to actual values and the y-axis to predicted values. Draw a line of $y=x$ on the plot.
5. [Code] Plot 'Tomorrow' versus 'SMA' for the training set and display the regression line on the same graph. The x-axis corresponds to 'SMA' and the y-axis to 'Tomorrow'.
6. [Text] Examining the results (e.g. $R^2$ and RMSE), comment on the predictability of the model.

\pagebreak
## Your answers for Question 1.3

In [None]:
#(1) Your code goes here

#(2) Your text goes here

In [None]:
#(3) Your code goes here

In [None]:
#(4) Your code goes here

In [None]:
#(5) Your code goes here

#(6) Your text goes here

\pagebreak

# ========== Question 1.4 --- [5 marks] ==========

1. [Code] Instead of using libraries for linear regression, write the code of your own for finding the regression coefficients of the regression model that predicts 'Tomorrow' from 'SMA'. Run your code and show the coefficients, where you should use the same training data as Question 1.3.
2. [Text] One of the common metric used for evaluating the performance of regression models is Mean Squared Error (MSE). Write out the expression for MSE and list one of its limitations and how it can be addresses with alternative metrics.

\pagebreak
## Your answers for Question 1.4

In [None]:
#(1) Your code goes here

#(2) Your text goes here


\pagebreak

# ========== Question 1.5 --- [6 marks] ==========
#### Multiple linear regression and polynomial regression

We here consider multiple linear regression that employs four variables ('RSI', 'SMA', 'BBP', 'ADX') to predict 'Tomorrow'. We use the same training data and test data as Question 1.3.
1. [Code] Train the multiple linear regression model on the training set and show the model parameters and the coefficient of determination $R^2$. You also show the RMSE for the training set and test set respectively.
2. [Code] We now extend the model to the polynomial regression model, in which we use all polynomial combinations of the variables up to the specified degree $p$. Using $p=2$, run an experiment in the same manner as 1 above and report the model parameters and $R^2$. You also report the RMSE for the training and test sets respectively. You should use the sklearn implementation of Linear Regression and Polynomial Features. 
3. [Text] Comparing the results you obtained here and those in Question 1.3, report your findings and give discussions briefly.

\pagebreak
## Your answers for Question 1.5

In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

#(3) Your text goes here

\pagebreak

# ========== Question 1.6 --- [12 marks] ==========
#### Classification

We now consider the prediction of stock price movement as a binary classification problem - class 1 for upward movement and class 0 otherwise. We use the four technical Indicators, 'RSI', 'SMA', 'BBP', 'ADX', as input features to a classifier to predict 'Target'.

1. [Code] Using 10-fold cross validation with ***no shuffling*** on ***the whole data***, train four classifiers, Logistic Regression, SVM, Decision Trees, and Random Forests. Display, in a single graph, the validation accuracy with boxplot for each model. For each model, you also report the mean accuracy and mean F-score for the training set and validation set, respectively.
(NB: You should obtain the accuracy and F-score for each trial of k-fold cross validation, which will be used for plotting a boxplot. A mean value/score denotes the average value over the $k$ trials, where $k=10$).
<br> ***Note***: you should use sklearn's KFold, SVC, DecisionTreeClassifier, RandomForestClassifier, and LogisticRegression. For each classification model, use default parameters except that "***random_state=0***" should be specified.
2. [Code] Further to the above, for each model, display the confusion matrix for the validation sets, where rows correspond to true class labels and columns to predicted ones, and each element of the matrix shows the number of corresponding instances.
3. [Text] Comment on which model is best with respect to false positives and false negatives. 


\pagebreak
## Your answers for Question 1.6

In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

#(3) Your text goes here

\pagebreak

# ========== Question 1.7 --- [5 marks] ==========
#### Dimensionality Reduction 
Here we will perform dimensionality reduction with PCA to the data and run classification experiments on the dimensionality reduced data.

1. [Code] Using the four technical features ('RSI', 'SMA', 'BBP', 'ADX') as input data, apply PCA to ***the whole data*** and find the minimum set of principal components that explains at least 95% of the variance of the data. Report the number of principal components in the set you found.
2. [Code] Using the set of principal components you found above, reduce the dimensionality of the data and run classification experiments for the four classifiers in the same manner as we did in Question 1.6, but we now use the dimensionality-reduced data instead. You should plot boxplots and report accuracy and F-score in the same manner as Question 1.6. (Note that this experiment is not a formal one, as we apply PCA to the whole data, whose subset is used for testing.)
3. [Text] Comparing the results with those you obtained in Question 1.6, report your findings and give brief discussions.


\pagebreak
## Your answers for Question 1.7

In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

#(3) Your text goes here

\pagebreak

# ========== Question 1.8 --- [15 marks] ==========


We considered only four technical features so far to find that movement classification with the four classifiers is challenging.
We would like to know whether we could improve the performance if we use more features, apply preprocessing to the data, and tune up model parameters.
To find some answer to the question, carry out a mini project with the following conditions:
* We use another data set file ("dset_q1b.csv") for this project, which is an extended version of the original one and contains 15 technical indicators. Load the dataset in the following manner:
>   df1b = pd.read_csv("dset_q1b.csv", index_col="Date", parse_dates=True)
* We consider SVM (SVC) only.
* We split the data into two subsets without shuffling - the first 80% of data should be used for training and validation, and the remaining 20% for testing. 
* We will limit the duration of the project to a few hours only.
* The outcome of the project is not necessarily positive. It is not surprising that you cannot find much improvement.

1. [Text] Describe your ideas for improving the classification performance. Your ideas should be concrete and feasible - the project should be done in the specified time length.
2. [Code and text] Implement your ideas, run experiments, and report the results including accuracy and F-score for the training set and test set respectively.
3. [Code and text] Examine whether your improvement or deterioration is statistically significant.
4. [Text] Summarise your findings and show your answer to the question. In case of negative results, explain the reasons for the negative outcomes.

\pagebreak
## Your answers for Question 1.8

#(1) Your text goes here

In [None]:
#(2) Your code and text goes here

In [None]:
#(3) Your code and text goes here

#(4) Your text goes here

\pagebreak

# Question 2: Experiments with image data

#### 65 marks out of 130 for this coursework

Image data are made up of $H × W × C$ pixels, where $H, W, C$ denote the height, width, and the number of channels, respectively. For simplicity, we assume a grayscale image (i.e. $C=1$). Let $p_{ij}$ denote the pixel value at a grid point $(i,j), 1 \le i \le H, 1 \le j \le W$, where $p_{11}$ corresponds to the the pixel at the top-left corner and $p_{HW}$ to the one at the bottom-right corner. We assume that $p_{ij}$ takes an integer value between 0 and 255 (i.e. 8-bit coding). In computers, we can store a grayscale image of $\{p_{ij}\}$ in a $D$-dimensional vector, $x = (x_1,x_2,...,x_D)$, where $D = H \times W$, and $x_1$ corresponds to $p_{11}$ and $x_D$ to $p_{HW}$.

In this question, we use a subset of the [Fashion MNIST Dataset](https://github.com/zalandoresearch/fashion-mnist), which contains images of fashion products from ten categories (e.g. T-shirt and trousers). The ten categories are represented as integer numbers ($0,\ldots,9$) and they are referred to as classes. There are 1000 training instances and 200 test instances per class. Each instance is a 28-by-28 grayscale image. Note that you will find some errors (e.g. incorrect labels) in the data set, but we use the data set as it is.
Load the data and apply some pre-processing in the following manner in your code.

***Loading data:***
Make sure that you have the data set file "dset_q2.mat" in your environment and run the following cell to load the data set.

In [None]:
# Load the data set and apply some preprocessing

Xtrn_org, Ytrn_org, Xtst_org, Ytst_org = load_q2_dataset()

Xtrn = np.copy(Xtrn_org) / 255.0   # Training data : (10000, 784)
Xtst = np.copy(Xtst_org) / 255.0   # Testing data : (2000, 784)
Ytrn = np.copy(Ytrn_org)           # Labels for Xtrn : (10000,)
Ytst = np.copy(Ytst_org)           # Labels for Xtst : (2000,)
Xmean = np.mean(Xtrn, axis=0)
Xtrn_mn = Xtrn - Xmean; Xtst_mn = Xtst - Xmean  # Mean-normalised versions of data


You can display the image of the fourth instance in **Xtrn** in the following manner, for example. Run the following cell.

In [None]:
plt.figure(figsize=(1.0,1.0)) # You could try a much large fig size
plt.imshow(Xtrn[3,:].reshape(28,28), cmap=plt.cm.gray_r);
# plt.grid(lw=1, ls=':')
# plt.axis('off')

# ========== Question 2.1 --- [5 marks] ==========
[Code] For each class, display the grayscale images of the first five instances of the class in the training set **(Xtrn,Ytrn)**, where you should follow the specifications shown below.
- You will display a total of 50 images, which should be displayed in a 10-by-5 grid,  where a grid point $(i,j), i=0,\ldots,9, j=0,\ldots,4$,  displays the image of $j$-th instance of class $i$.  Note that we use zero-based numbering here.
- Use plt.imshow to display an image.
- Specify the figure size by plt.figure(figsize=(10, 20)).
- The image of each instance should be displayed properly in the right orientation.
- For each image, you should display the class number and the instance number in **Xtrn**, for which you could use **pyplot.title**. For example, if the first instance of class 0 is held in **Xtrn[21,:]**, the instance number is 21, so that "C0: 21" (or "0: 21") may be the information you should display.

\pagebreak
## Your answers for Question 2.1

In [None]:
# Your code goes here

\pagebreak

# ========== Question 2.2 --- [11 marks] ==========

You may have understood there is a wide variety of images in each class. We now would like to display the images of representative instances for each class in the training data set **(Xtrn, Ytrn)**. To that end, we apply the k-means clustering with $k = 6$ to each class. Instead of displaying the image of each cluster centre, which would look blurred due to averaging, we display the image of the instance that is closest to the centroid (i.e. cluster centre) as the representative of the cluster. We also display the mean image (i.e. the image of the mean vector) of each class.

[Code] Following the specifications shown below, display the result.
- For clustering, use sklearn's **KMeans** with the default parameters except that you specify **n_clusters=6** and **random_state=0**. Note that the two parameters should be specified explicitly when you run clustering for each class.
- You will display a total of 60+10=70 images, which should be displayed in a 10-by-7 grid. Each row corresponds to a class. The grid point $(i, 0)$ displays the mean image of class $i$ data, and the grid point $(i, j), j=1,\ldots,6$  displays the image of the representative of cluster $j$-1 for class $i$. Clusters should be sorted in increasing order in terms of the Euclidean distance to the centre of the class (i.e. the mean of the instances in the class), so that the column $j$=1 corresponds to the cluster that is closest to the class centre, whereas column $j$=6 to the one that is farthest from the class centre. Note that we use zero-based numbering.
- For each image of an instance, display the class number ($c$), the number of instances ($m$) in the cluster, and the instance number ($\ell$) in the training data set, in the format of "C{$c$} [{$m$}] {$\ell$}". For example, "C2 [165] 9734" represents $c$=2. $m$=165, and $\ell$=9734.
- Use a large figure size for plotting, e.g. plt.figure(figsize=(16,20)).

\pagebreak
## Your answers for Question 2.2

In [None]:
# Your code goes here

\pagebreak

# ========== Question 2.3 --- [7 marks] ==========
1. [Code] Apply Principal Component Analysis (PCA) to the data of **Xtrn_mn** using sklearn's **PCA** and show the variances of projected data for the first five principal components. 
2. [Code] Plot a graph of the cumulative explained variance ratio $r_i$ as a function of the number of principal components, $i$, where $ 1 \le i ≤ D$, $r_i$ is defined as follows, and $D$ is the number of dimensions of the data.<br>
> $$ r_i = \frac{\sum_{j=1}^i \lambda_j}{\sum_{j=1}^D \lambda_j}$$
3. [Code] Find the minimum number of principal components required to explain 50%, 60%, 70%, 80%, 90%, and 95% of the total variance, respectively.


\pagebreak
## Your answers for Question 2.3

In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

In [None]:
#(3) Your code goes here

\pagebreak

# ========== Question 2.4 --- [10 marks] ==========
We now consider a simple application of PCA, in which we (as sender A) apply dimensionality reduction to image samples and send them to someone (as receiver B) who tries to reconstruct the samples from the dimensionality-reduced samples. The underlying assumption is that the both parties, A and B, share the same set of principal components (i.e. eigen vectors) and the mean vector (**Xmean**) in advance.
You will expect some degradation in the reconstructed images.
1. [Code] Follow the instructions shown below.
- Apply PCA to the whole **Xtrn_mn** at first to find all principal components. 
- For each class and for each number of principal components $K = 5,20,50,200,400$, apply the dimensionality reduction to the first instance in the class.
- Display the reconstructed images and the original images in a 10-by-6 grid, where each row corresponds to a class (in increasing order) and the first five columns show the reconstructed images for the five values of $K$ (in increasing order) and the last column to shows the original image.
- Note that you should add **Xmean** to each reconstructed data to display the corresponding image.
2. [Text] Explain your findings briefly.

\pagebreak
## Your answers for Question 2.4

In [None]:
#(1) Your code goes here

#(2) ***Your text goes here***

\pagebreak

# ========== Question 2.5 --- [6 marks] ==========

We now would like to know how the training data **Xtrn_mn** distribute in a vector space. To visualise distributions, we reduce the dimensionality of the data to two dimensions using PCA and plot the dimensionality-reduced data on the two-dimensional plane spanned by the first principal components. Note that each instance in the data set is now displayed as a single point on the plane.
1. [Code] Plot all the training instances (**Xtrn_mn**) on the two-dimensional PCA plane, where each instance is displayed as a small point with a colour specific to the class of the instance. Use the ’tab10’ colormap for plotting (i.e. cmap="tabl10"), and adjust the marker size so that points do not overlap each other very much.
2. [Text] Give comments on the separation of the classes, and explain your findings briefly.


\pagebreak
## Your answers for Question 2.5

In [None]:
#(1) Your code goes here

#(2) ***Your text goes here***

\pagebreak

# ========== Question 2.6 --- [8 marks] ==========

We consider applying multiclass classification to the data set. Make sure that you use **Xtrn_mn** for training and **Xtst_mn** for testing. 
1. [Code] Carry out a classification experiment using sklearn's **LogisticRegression** with "random_state=0", and report the classification accuracy and confusion matrix for the training set and test set respectively. Use sklearn's **ConfusionMatrixDisplay** to display the confusion matrix. Note that you may ignore a warning message in the training.
2. [Code] Run a classification experiment with SVM and report the classification accuracy and confusion matrix for the training set and test set respectively. Use sklearn's **SVC** with "random_state=0".
3. [Text] Based on the results obtained in 1 and 2, explain your findings and give brief discussions.

\pagebreak
## Your answers for Question 2.6

In [None]:
#(1) Your code goes here

In [None]:
#(2) Your code goes here

#(3) Your text goes here

\pagebreak

# ========== Question 2.7 --- [18 marks] ==========

This is a mini project, in which you are asked to improve the classification accuracy for the logistic regression model as much as possible from the one obtained in Question 2.6. 
1. [Text] Discuss possible approaches, and decide the one(s) you implement. Note that you should stick to the multinomial logistic regression model and should not use other classification models.
2. [Code and Text] Implement the approach you have chosen, carry out a classification experiment and report accuracy for the training set and test set respectively. Note that training and parameter tuning should be done on the training set and not on the test set. In case that you run parameter tuning, show and explain the result clearly.
3. [Text] Making a quick investigation to the result, report your findings and give brief discussions.

\pagebreak
## Your answers for Question 2.7

#(1) Your text goes here

In [None]:
#(2) Your code and text goes here

#(3) Your text goes here