<a href="https://colab.research.google.com/github/K-Martin-RGULec/CM1101_Week_2/blob/main/W2_Lab_Instance_Based_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2 - Instance-Based Learning

##Lab Exercises

---

<br>

###Task 1: Psuedo-Code for the kNN Algorithm

As your first task for this week, consider what you have learned from the lecture regarding the kNN algorithm. Use that thinking to write simple psuedo-code for the kNN algorithm for the following problems.

<br>

####Task 1.1

Imagine you are creating an instance-based learner to estimate the prices of new products in a production line. The features describing these products include a list of their components, the cost of these components, and the total cost-to-manufacture. The label is the Recommended Retail Price (RRP) of the product. The owner of the production line has recently revealed a new product, but is not sure what price to recommend. Using your knowledge of kNN, write psuedo-code to describe the process by which the price could be decided based on the **single most similar product** that the production line currently manufactures.

<br>

####Task 1.2
Adapt the psuedo-code that you have written above to demonstrate how the pricing of the new product could be based on **an average of the RRP of the three most similar products** the company currently makes.



####Answers for Task 1.1

<br>

*Please use this cell to enter your answers for the exercise.*

####Answers for Task 1.2

<br>

*Please use this cell to enter your answers for the exercise.*


---

##Task 2: Implementing the kNN Algorithm

<br>

The second task for this week is to implement the kNN algorithm. The next few tasks will talk you through the process of building an implementation using the scikit-learn framework.

<br>

####Task 2.1: Loading a Dataset

The first step will be to load a dataset. Try downloading the iris.data file from the Module Resources link on CampusMoodle, or the following URL: https://archive.ics.uci.edu/ml/datasets/iris.

To load the dataset, try the following code. Please note that when you run the code, you will be prompted to select a file from your computer to upload, so make sure that you have fully downloaded the iris dataset before starting.

In [2]:
import pandas as pd #we will use pandas to view data in dataframes
import io #io we use to load the data after uploading (as colab uses dictionaries to store uploaded files)
from google.colab import files #finally, we import the files package from google.colab framework to be able to upload files

uploaded = files.upload() #we then call the upload() function to give us the opportunity to load a file into our script

Saving iris.data to iris.data


In [20]:
iris_file = io.BytesIO(uploaded['iris.data']) #the file is uploaded as a value in a dictionary, with the filename as the key
iris_df = pd.read_csv(iris_file, header=None) #after getting the file to a format we can work with, we use pandas to read the data as a dataframe
print(iris_df) #calling print allows us to view the dataframe

       0    1    2    3               4
0    5.1  3.5  1.4  0.2     Iris-setosa
1    4.9  3.0  1.4  0.2     Iris-setosa
2    4.7  3.2  1.3  0.2     Iris-setosa
3    4.6  3.1  1.5  0.2     Iris-setosa
4    5.0  3.6  1.4  0.2     Iris-setosa
..   ...  ...  ...  ...             ...
145  6.7  3.0  5.2  2.3  Iris-virginica
146  6.3  2.5  5.0  1.9  Iris-virginica
147  6.5  3.0  5.2  2.0  Iris-virginica
148  6.2  3.4  5.4  2.3  Iris-virginica
149  5.9  3.0  5.1  1.8  Iris-virginica

[150 rows x 5 columns]


The dataframe tells us some useful information about the iris dataset we have just downloaded. For example, we can see that there are 5 columns (4 features and 1 label) and 150 rows (as there is no header, this means there are 150 examples in our datset).

Storing data in dataframes is really useful, as dataframes have some inbuilt functionality to help you deal with missing values. For this exercise we will convert the dataframe to an array to make things clear.

In [24]:
import numpy as np #import numpy to enable array functionality

x = iris_df[[0,1,2,3]].to_numpy() #next we convert the feature columns into a numpy array 
y = iris_df[[4]].to_numpy() #swiftly followed by the label column - and our dataset is now ready
#print(x) #feel free to uncomment and check the output
#print(y)

####Task 2.2 - Implementing kNN in sci-kit learn

Now that we have our datset, we can implement a kNN using sci-kit learn library: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html


In [31]:
from sklearn.neighbors import KNeighborsClassifier #import the relevant tools from the library

model = KNeighborsClassifier(n_neighbors = 1) #then we instantiate a k nearest neighbour model, with a k value of 1 using the n_neighbours parameter
model.fit(x,y) #we then fit our data and labels to the model (remember, there is no 'training' for kNN)
prediction = model.predict([[0.3,0.4,0.5,0.6]]) #our model is now ready to make predictions. predictions expect a list as input and produce a list as output
print(prediction) #voila!

['Iris-setosa']


  return self._fit(X, y)


So we can make predictions with our model - great! Only one problem - how do we know if the prediction is accurate? In the previous example, I just made up the numbers. That is not a robust way to decide if the model is good or not.

To decide that, we need to use an evaluation methodology. An evaluation methodology will ensure that the model can be evaluated with scientific rigour, giving us the capability to describe how well our model performs.

In this lab, we will use the **cross-fold validation** evaluation mehodology. Cross-fold validation effectively divides your dataset into non-overlapping train and test sets. These train and test sets are completely separate from one another, allowing the train set to be used for model learning and the test set for model evaluation. 

The next question may then be, well how do we know our model did not just get really lucky and pick an easy test set? We don't know the answer to that - if we only perform the evaluation once. In cross-fold evaluation, the the process of splitting into train and test is repeated a set number of times (i.e. the number of 'folds'). This ensures that the testing is rigourous - let's try it out!

In [45]:
from sklearn.model_selection import StratifiedKFold #cross fold is sometimes called k-fold. Calling the stratified version ensures that classes have equal representation across folds
from sklearn.metrics import accuracy_score #import an accuracy metric to tell us how well the model is doing

acc_score = [] #create a list to store the accuracy values
model2 = KNeighborsClassifier(n_neighbors=1) #instantiate the model

kf = StratifiedKFold(n_splits=5) #we instantiate the kfold instance, and set the number of folds to 5
for train, test in kf.split(x,y): #we use a for loop to iterate through each fold using the train and test indexes from the dataset
  
  x_train, x_test, y_train, y_test = x[train], x[test], y[train], y[test] #things can get a bit weird when inputting indexes to functions, so lets save them as variables
  #print(train)
  #print(test) #this will print the train and test indexes respectively, if you want to be sure they do not overlap
  
  model2.fit(x_train, y_train) #we then only fit the training data
  predictions = model.predict(x_test) #and can predict on the test data
  acc = accuracy_score(predictions, y_test) #we use the accuracy score we imported to give an idea how well the model is doing
  acc_score.append(acc) #we can append it to our list

print(acc_score) #wowza, looks ike it's pretty good!


[1.0, 1.0, 1.0, 1.0, 1.0]


  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)




---


###Task 3 - Try it Out

<br>

Now that you have had an opportunity to see kNN at work, the time has come to try it out yourself!

<br> 

####Task 3.1 - Implement on Digits Dataset
Start off by trying out what you have learned on the digits dataset. The dataset is available at: https://archive.ics.uci.edu/ml/datasets/wine

In [None]:
#Type your code for Task 3.1 here

####Task 3.2 - Experimenting with Wine

Now that you have implemented the original kNN code on the wine dataset, why not try out some changes. Specifically:

*   What happens if you increase the number of neighbours?

*   What happens if you change the distance metric?

*   What happens if you increase the number of neighbours and add distance weighting in to the equation?


Feel free to modify the above code, or add another cell below.
