# MA5634/5663 - Fundamentals of Machine Learning

## Assignment 2022/23 (First Sitting)

This assignment carries 40% of the marks, with the remaining 60% coming from
the unseen exam.

You should refer to the project brief for further details relating to this
assignment. 

The key instructions **YOU MUST** adhere to are as follows:

- Enter your 7-digit student ID as the value of `ID` in the next cell. Ignore
the backslash (if it is present) and any numbers that follow it.

- All other prepopulated cells in this notebook should be left untouched.

- It will be clear below which parts of this notebook contain code that
produces data that you should use for your submission.

- It will also be clear in which cells you should enter your submitted work.

- Feel free to create more cells.

>**REQUIREMENT:** This notebook will be assessed by executing it sequentially 
from the top down and in one session. It must run to completion and without
error.

>**NOTE: If you alter a variable's value in a cell low down the notebook and
then execute a cell near the top that uses an unrelated variable with that
same name, then the unwanted new value will be used. This can cause bugs.**

>**REMEDY:** always execute your Jupyter notebook from the top down. An easy 
way to do this is to select _Run All Above_ from the *Cell* menu. This will 
ensure that code further down does not affect the present cell.

>**NOTE:** you will be asked to discuss results in your report. Note that due
to the randomization in the `sklearn` routines you may not always get the 
same results. For this reason it is acceptable to quote the results of a 
specific run in your report. However, make sure that these results are truly
representative of the run and not just an outlier.


## ENTER YOUR 7-digit STUDENT ID HERE ...

In [1]:
ID = 2309765  # replace this number with your 7-digit ID

In [2]:
# standard imports
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# TASK 1

You will create a $k$-NN binary classifier. 

You will be given a subset of the feature data from $569$ breast cancer test results. This subset is generated, along with personalized values of $k$ and $p$ (for the $p$-norm) for the $k$-NN method by the *untouchable* code below.
Why *untouchable* - because **that code should not be altered in any way**.

In this notebook for Task 1

- Extract your data, check for invalid entries
- Select a suitable train/test split fraction and gives the sizes of the resulting data sets.
- Use the $k$-Nearest Neighbours method from `sklearn` to classify a breast cancer
testing result as *benign* or *malignant*. 
- Plot the confusion matrix.
- Give the accuracy score
- Estimate the probability that the test is positive (malignant) given that the classifier predicts that it is negative (benign). Denote this as $\mathrm{Prob}(P\mid-)$.

In you report for Task 1

- Give a short overview of the $k$-NN method and explain its main features and hyperparameters.
- Explain your choice of **train/test** split.
- Explain how you calculated $\mathrm{Prob}(P\mid-)$.

**- - DO NOT ALTER THE CONTENTS OF THE NEXT CELL(S) IN ANY WAY - -**

In [4]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
import random

# Load the breast cancer dataset
data = load_breast_cancer()

# Create a data frame, using the feature data as column headings
dfbigbc = pd.DataFrame(data.data, columns=data.feature_names)

# Add a target column at the end, and fill it with the target data
dfbigbc['target'] = data.target   # target 0/1 means malignant/benign

# Make the 1/0 more user friendly: taken from (24 feb 2023)
# https://www.datacamp.com/tutorial/principal-component-analysis-in-python
dfbigbc['target'].replace(0, 'Benign', inplace=True)
dfbigbc['target'].replace(1, 'Malignant', inplace=True)

# This dataset has a lot of features - we'll work with a subset
print('Number of Original Features: ', len(data.feature_names))
print('Original data Frame shape: dfbigbc.shape = ', dfbigbc.shape)
# set a random seed dependent on the student ID
random.seed(ID+30)
# get a list of integers indexing feature columns 0,1,2,...,29
nums = list(range(0,30))
# shuffle them randomly and add the target column index on at the end
random.shuffle(nums)
newnums = nums[0:5]
newnums.append(30)
print(f'We will work only with the features in columns: {newnums}')
dfbc = dfbigbc.iloc[:,newnums]
print('These features are ...')
print(list(dfbc))
# get personalized algorithm parameters
kn = random.randint(3, 8)
pn = random.randint(1, 10)
print('Specific Personal Values for Task 1')
print(f' - Number, k, to use in k-NN:      {kn}')
print(f' - Value of p for the norm ||.||p: {pn}')
print('Items in the target columns: ', dfbc.target.unique())
dfbc.head()

Number of Original Features:  30
Original data Frame shape: dfbigbc.shape =  (569, 31)
We will work only with the features in columns: [0, 15, 27, 11, 29, 30]
These features are ...
['mean radius', 'compactness error', 'worst concave points', 'texture error', 'worst fractal dimension', 'target']
Specific Personal Values for Task 1
 - Number, k, to use in k-NN:      3
 - Value of p for the norm ||.||p: 3
Items in the target columns:  ['Benign' 'Malignant']


Unnamed: 0,mean radius,compactness error,worst concave points,texture error,worst fractal dimension,target
0,17.99,0.04904,0.2654,0.9053,0.1189,Benign
1,20.57,0.01308,0.186,0.7339,0.08902,Benign
2,19.69,0.04006,0.243,0.7869,0.08758,Benign
3,11.42,0.07458,0.2575,1.156,0.173,Benign
4,20.29,0.02461,0.1625,0.7813,0.07678,Benign


You now have access to a data frame `dfbc`, and above you will find
your values of $k$ and $p$.

**- - SUBMIT YOUR WORK FOR TASK 1 IN THE CELL(S) BELOW - -**

**- - CREATE MORE CELLS AS NEEDED - -**

# TASK 2

This task is a continuation of Task 1 and involves PCA (*Principal Component Analysis*).
You should use the data subset and personalized values from above. Also, after executing
the *untouchable* code below you will get a personalized value of `nc`. This is the number
of principal component you should use. 

In this notebook for Task 2

- Use PCA to analyze the variance in your training data. You may use
`sklearn` for this or work from basic principles.
- How many principal components are there?
- Produce a plot or bar graph of the explained variance percentages for all components.
- Perform PCA to compress your training data using `nc` components. You may use
`sklearn` for this or work from basic principles.
- Produce a plot or bar graph of the explained variance percentages the `nc` component(s).
- Re-run the $k$-NN method using the data compression resulting from choosing
just `nc` principal components.
- Obtain the accuracy score and a confusion matrix as above
- *You must use the same training and test data as above*
- *You must adhere to the principal that the __test data is regarded as unseen__**

In you report for Task 2

- Give a short overview of PCA. Include main concepts and formulae as necessary but
do not give proofs or derivations.
- Explain how much variance is captured by your value of `nc`.
- Discuss the results in terms of accuracies and confusion matrices. Are they comparable?
Do you recommend the use of just `nc` principal components for this model? Feel free to use probabilistic
arguments to elicit the advantages and disadvantages.
- Don't spell *principal* as *principle*. This will be an unconditional fail and you
will be asked to leave the Earth for ever. (Just Kidding!)


**- - DO NOT ALTER THE CONTENTS OF THE NEXT CELL(S) IN ANY WAY - -**

In [11]:
nc = random.randint(1, 4)
print(f'You should use {nc} principal components for you data compression')
print('The variable nc should be used for this')

You should use 1 principal components for you data compression
The variable nc should be used for this


After executing the untouchable cell above you will see how many 
principal components - nc - you should use in your analysis below.

**- - SUBMIT YOUR WORK FOR TASK 2 IN THE CELL(S) BELOW - -**

**- - CREATE MORE CELLS AS NEEDED - -**

# TASK 3

You will compress daily stock data by performing a
*Singular Value Decomposition* (SVD). 
You will use the SVD transformation to add
additional data and illustrate the augmented data set graphically.

The untouchable code below will set up the dataframes for you but you will need to
obtain the CSV files from Brightspace. They are called `TSLAhistory.csv`
and `TSLAupdate.csv`.

Once you have executed the code below you will have access to two dataframes.
**This code should not be altered**.

The data frame in `dfth` will contain historical data for the TESLA share price. 
A set of more recently aquired data is in `dftu`. The real-world situation
we are simulating here is that you have an intial download of data, and you have
performed an SVD on it so that you can select the dominant transformed components
and use those as a **training set** for your machine learning tools. An updated set
of data arrives. These data points are **unseen** as far as your analysis tools are
concerned and so can be designated as a test set. However, your codes have been
trained on SVD-transformed data and so the test set needs also to be transformed
to be comptatible.


In this notebook for Task 3

- Run the untouchable code. Check the data is *clean*. If it isn't then clean it up.
- Use *seaborn* and `sns.pairplot` to create a pair plot for `dfth`. 
- Produce a combined scatter plot of *Volume* vertically against *Open* horizontally with both data sets but distinguished by colour.
- Select training data, `X_train`, from `dfth` using all columns except *Date* and *Adj Close*. 
- Perform an SVD of this training data and determine the rank of the data set.
- Create a (logarithmic) scree plot from the singular values. 
- Create `Xc_train`, an SVD-compressed version of the training data formed by taking just the first `c` dominant singular components.
- Use `linalg.norm(X_train - Xc_train)` from `numpy` to calculate the error in the SVD approximation of `X_train` by `Xc_train`. Plot a graph, or bar chart, of this error against all appropriate values of $c$. 
- Create a compressed training data set using $c=1$ by SVD transformation of `X_train` to a transformed training set, called, for example, `Kc`.
- Create a scatter plot of *open* against *Volume* with `X_train` and `X_test` on the same set of axes, but in different colours. Make sure that your axes are labelled correctly. You can use, for example,

```
# put both of these in the same cell 
plt.scatter(X_train[:,0], X_train[:,4], color='red')
plt.scatter(X_test[:,0],  X_test[:,4],  color='blue')
plt.xlabel('Open'); plt.ylabel('Volume')
```

- Now create similar scatter plot but with `Kc` and `X_test`. How does this plot differ from the last? Explain this difference.
- Transform the test data `X_test` to, say, `Qc`.
- Create yet another similar scatter plot but with `Kc` and `Qc`.
- Repeat the construction of these three scatter plots but with $c=2$. Comment on the results. In particular contrast and compare these plots to the $c=1$ plots.


In you report for Task 3

- Give a short overview of the *Singular Value Decomposition* (SVD).
- Refer to your pairplot - discuss its features. Thinking ahead, how many dominant independent components would you expect to lie in these data?
- Give the rank of the data set.
- Give an outline of the mathematical details of your SVD-transformation of `X_train` to `Kc`.
- Give an outline of the mathematical details and a justification for your method of transformation of `X_test` to `Qc`.
- For $c=1$, how does the second scatter plot differ from the first? Explain this difference. 
- For $c=1$, how does the third scatter plot differ from the first two? Explain this difference. 
- For $c=2$, how do these plots change? How do you interpret this change?


**- - DO NOT ALTER THE CONTENTS OF THE NEXT CELL IN ANY WAY - -**

In [21]:
dfth = pd.read_csv("TSLAhistory.csv")
dftu = pd.read_csv("TSLAupdate.csv")

After executing the untouchable cell above you will see how many 
principal components - nc - you should use in your analysis below.

**- - SUBMIT YOUR WORK FOR TASK 3 IN THE CELL(S) BELOW - -**

**- - CREATE MORE CELLS AS NEEDED - -**

In [46]:
print('End of Notebook')

End of Notebook
