# Introductory applied machine learning
# Assignment 3 (Part B): Mini-Challenge [25%]

## Important Instructions

**It is important that you follow the instructions below to the letter - we will not be responsible for incorrect marking due to non-standard practices.**

1. <font color='red'>We have split Assignment 3 into two parts to make it easier for you to work on them separately and for the markers to give you feedback. This is part B of Assignment 3 - Part A is an introduction to Object Recognition. Both Assignments together are still worth 50% of CourseWork 2. **Remember to submit both notebooks (you can submit them separately).**</font>

1. You *MUST* have your environment set up as in the [README](https://github.com/michael-camilleri/IAML2018) and you *must activate this environment before running this notebook*:
```
source activate py3iaml
cd [DIRECTORY CONTAINING GIT REPOSITORY]
jupyter notebook
# Navigate to this file
```

1. Read the instructions carefully, especially where asked to name variables with a specific name. Wherever you are required to produce code you should use code cells, otherwise you should use markdown cells to report results and explain answers. In most cases we indicate the nature of answer we are expecting (code/text), and also provide the code/markdown cell where to put it

1. This part of the Assignment is the same for all students i.e. irrespective of whether you are taking the Level 10 version (INFR10069) or the Level-11 version of the course (INFR11182 and INFR11152).

1. The .csv files that you will be using are located at `./datasets` (i.e. use the `datasets` directory **adjacent** to this file).

1. In the textual answer, you are given a word-count limit of 600 words: exceeding this will lead to penalisation.

1. Make sure to distinguish between **attributes** (columns of the data) and **features** (which typically refers only to the independent variables, i.e. excluding the target variables).

1. Make sure to show **all** your code/working. 

1. Write readable code. While we do not expect you to follow [PEP8](https://www.python.org/dev/peps/pep-0008/) to the letter, the code should be adequately understandable, with plots/visualisations correctly labelled. **Do** use inline comments when doing something non-standard. When asked to present numerical values, make sure to represent real numbers in the appropriate precision to exemplify your answer. Marks *WILL* be deducted if the marker cannot understand your logic/results.

1. **Collaboration:** You may discuss the assignment with your colleagues, provided that the writing that you submit is entirely your own. That is, you must NOT borrow actual text or code from others. We ask that you provide a list of the people who you've had discussions with (if any). Please refer to the [Academic Misconduct](http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct) page for what consistutes a breach of the above.


### SUBMISSION Mechanics

**IMPORTANT:** You must submit this assignment by **Thursday 22/11/2018 at 16:00**. 

**Late submissions:** The policy stated in the School of Informatics is that normally you will not be allowed to submit coursework late. See the [ITO webpage](http://web.inf.ed.ac.uk/infweb/student-services/ito/admin/coursework-projects/late-coursework-extension-requests) for exceptions to this, e.g. in case of serious medical illness or serious personal problems.

**Resubmission:** If you submit your file(s) again, the previous submission is **overwritten**. We will mark the version that is in the submission folder at the deadline.

**N.B.**: This Assignment requires submitting **two files (electronically as described below)**:
 1. This Jupyter Notebook (Part B), *and*
 1. The Jupyter Notebook for Part A
 
All submissions happen electronically. To submit:

1. Fill out this notebook (as well as Part A), making sure to:
   1. save it with **all code/text and visualisations**: markers are NOT expected to run any cells,
   1. keep the name of the file **UNCHANGED**, *and*
   1. **keep the same structure**: retain the questions, **DO NOT** delete any cells and **avoid** adding unnecessary cells unless absolutely necessary, as this makes the job harder for the markers. This is especially important for the textual description and probability output (below).

1. Submit it using the `submit` functionality. To do this, you must be on a DICE environment. Open a Terminal, and:
   1. **On-Campus Students**: navigate to the location of this notebook and execute the following command:
   
      ```submit iaml cw2 03_A_ObjectRecognition.ipynb 03_B_MiniChallenge.ipynb```
      
   1. **Distance Learners:** These instructions also apply to those students who work on their own computer. First you need to copy your work onto DICE (so that you can use the `submit` command). For this, you can use `scp` or `rsync` (you may need to install these yourself). You can copy files to `student.ssh.inf.ed.ac.uk`, then ssh into it in order to submit. The following is an example. Replace entries in `[square brackets]` with your specific details: i.e. if your student number is for example s1234567, then `[YOUR USERNAME]` becomes `s1234567`.
   
    ```
    scp -r [FULL PATH TO 03_A_ObjectRecognition.ipynb] [YOUR USERNAME]@student.ssh.inf.ed.ac.uk:03_A_ObjectRecognition.ipynb
    scp -r [FULL PATH TO 03_B_MiniChallenge.ipynb] [YOUR USERNAME]@student.ssh.inf.ed.ac.uk:03_B_MiniChallenge.ipynb
    ssh [YOUR USERNAME]@student.ssh.inf.ed.ac.uk
    ssh student.login
    submit iaml cw2 03_A_ObjectRecognition.ipynb 03_B_MiniChallenge.ipynb
    ```
    
   What actually happens in the background is that your file is placed in a folder available to markers. If you submit a file with the same name into the same location, **it will *overwrite* your previous submission**. You should receive an automatic email confirmation after submission.
  


### Marking Breakdown

The Level 10 and Level 11 points are marked out of different totals, however these are all normalised to 100%. Note that Part A (Object Recognition) is worth 75% of the total Mark for Assignment 3, while Part B (this notebook) is worth 25%. Keep this in mind when allocating time for this assignment.

**70-100%** results/answer correct plus extra achievement at understanding or analysis of results. Clear explanations, evidence of creative or deeper thought will contribute to a higher grade.

**60-69%** results/answer correct or nearly correct and well explained.

**50-59%** results/answer in right direction but significant errors.

**40-49%** some evidence that the student has gained some understanding, but not answered the questions
properly.

**0-39%** serious error or slack work.

Note that while this is not a programming assignment, in questions which involve visualisation of results and/or long cold snippets, some marks may be deducted if the code is not adequately readable.

## Imports

Use the cell below to include any imports you deem necessary.

In [116]:
# Nice Formatting within Jupyter Notebook
%matplotlib inline
from IPython.display import display # Allows multiple displays from a single code-cell

# System functionality
import sys
sys.path.append('..')

# Import Here any Additional modules you use. To import utilities we provide, use something like:
#   from utils.plotter import plot_hinton

# Your Code goes here:
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, log_loss
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression



from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import GridSearchCV

# Mini challenge

In this second part of the assignment we will have a mini object-recognition challenge. Using the same type of data as in Part A, you are asked to find the best classifier for the person/no person classification task. You can apply any preprocessing steps to the data that you think fit and employ any classifier you like (with the provision that you can explain what the classifier is/preprocessing steps are doing). You can also employ any lessons learnt during the course, either from previous Assignments, the Labs or the lecture material to try and squeeze out as much performance as you possibly can. The only restriction is that all steps must be performed in `Python` by using the `numpy`, `pandas` and `sklearn` packages. You can also make use of `matplotlib` and `seaborn` for visualisation.

### DataSet Description

The datasets we use here are similar in composition but not the same as the ones used in Part A: *it will be useful to revise the description in that notebook*. Specifically, you have access to three new datasets: a training set (`Images_C_Train.csv`), a validation set (`Images_C_Validate.csv`), and a test set (`Images_C_Test.csv`). You must use the former two for training and evaluating your models (as you see fit). As before, the full data-set has 520 attributes (dimensions). Of these you only have access to the 500 features (`dim1` through `dim500`) to test your model on: i.e. the test set does not have any of the class labels.

### Model Evaluation

Your results will be evaluated in terms of the logarithmic loss metric, specifically the [logloss](http://scikit-learn.org/0.19/modules/model_evaluation.html#log-loss) function from SKLearn. You should familiarise yourself with this. To estimate this metric you will need to provide probability outputs, as opposed to discrete predictions which we have used so far to compute classification accuracies. Most models in `sklearn` implement a `predict_proba()` method which returns the probabilities for each class. For instance, if your test set consists of `N` datapoints and there are `K` class-labels, the method will return an `N` x `K` matrix (with rows summing to 1).

### Submission and Scoring

This part of Assignment 3 carries 25% of the total marks. Within this, you will be scored on two criteria:
 1. 80% of the mark will depend on the thoroughness of the exploration of various approaches. This will be assessed through your code, as well as a brief description (<600 words) justifying the approaches you considered, your exploration pattern and your suggested final approach (and why you chose it).
 1. 20% of the mark will depend on the quality of your predictions: this will be evaluated based on the logarithmic loss metric.
Note here that just getting exceptional performance is not enough: in fact, you should focus more on analysing your results that just getting the best score!

You have to submit the following:
 1. **All Code-Cells** which show your **working** with necessary output/plots already generated.
 1. In **TEXT** cell `#ANSWER_TEXT#` you are to write your explanation (<600 words) as described above. Keep this brief and to the point. **Make sure** to keep the token `#ANSWER_TEXT#` as the first line of the cell!
 1. In **CODE** cell `#ANSWER_PROB#` you are to submit your predictions. To do this:
    1. Once you have chosen your favourite model (and pre-processing steps) apply it to the test-set and estimate the posterior proabilities for the data points in the test set.
    1. Store these probabilities in a 2D numpy array named `pred_probabilities`, with predictions along the rows i.e. each row should be a complete probability distribution over whether the image contains a person or not. Note that due to the encoding of the `is_person` class, the negative case (i.e. there is no person) comes first.
    1. Execute the `#ANSWER_PROB#` code cell, making sure to not change anything. This cell will do some checks to ensure that you are submitting the right shape of array.

You may create as many code cells as you need (within reason) for training your models, evaluating the data etc: however, the text cell `#ANSWER_TEXT#` and code-cell `#ANSWER_PROB#` showing your answers must be the last two cells in the notebook.

In [155]:
# This is where your working code should start. Fell free to add as many code-cells as necessary.
#  Make sure however that all working code cells come BEFORE the #ANSWER_TEXT# and #ANSWER_PROB#
#  cells below.

# Your Code goes here:

# LOADING AND SPLITTING DATA

# Load the dataset
data_path = os.path.join(os.getcwd(), 'datasets', 'Images_C_Train.csv')
train = pd.read_csv(data_path, delimiter = ',')
data_path = os.path.join(os.getcwd(), 'datasets', 'Images_C_Validate.csv')
val = pd.read_csv(data_path, delimiter = ',')
data_path = os.path.join(os.getcwd(), 'datasets', 'Images_C_Test.csv')
test = pd.read_csv(data_path, delimiter = ',')
# Drop undesired labels, keeping only the Visual Features and the 'is_person' column
undesired_labels = ['imgId','is_aeroplane','is_bicycle','is_bird','is_boat','is_bottle','is_bus',
                    'is_car','is_cat','is_chair','is_cow','is_diningtable','is_dog','is_horse',
                    'is_motorbike','is_pottedplant','is_sheep','is_sofa','is_tvmonitor']

# Create train splits
train.drop(columns = undesired_labels, inplace=True, axis=1)
X_train = train.drop(columns=["is_person"], axis = 1)
y_train = train["is_person"]

# Create Validation splits
val.drop(columns = undesired_labels, inplace=True, axis=1)
X_val = val.drop(columns=["is_person"], axis = 1)
y_val = val["is_person"]

# Join Train and Validation datasets so we can use GridSearchCV
train_val = pd.concat([train,val])
X_train_val = train_val.drop(columns=["is_person"], axis = 1)
y_train_val = train_val["is_person"]
train_idx = np.empty(2113)
train_idx.fill(-1)
test_idx = np.empty(1113)
test_idx.fill(0)
train_val_fold = np.append(train_idx,test_idx)
ps = PredefinedSplit(test_fold=train_val_fold)

In [156]:
# PREPROCCESSING
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_prep = scaler.transform(X_train)
X_val_prep = scaler.transform(X_val)
X_train_val_prep = scaler.transform(X_train_val)

In [157]:
# K NEAREST NEIGHBORS
clf = KNeighborsClassifier()
# Select possible hyper-parameters
parameters = {'n_neighbors':[1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,40,50,100]}
# Set and perform GridSearch 
optimal = GridSearchCV(estimator=clf, param_grid=parameters, cv=ps, scoring='neg_log_loss', n_jobs =-1)
optimal.fit(X_train_val_prep,y_train_val)
# Given the best parameters train the classifier
best_params = (list(optimal.best_params_.values()))
best_clf = KNeighborsClassifier(best_params[0])
best_clf.fit(X_train_prep,y_train)
# Calculate the log_loss and accuracy
y_pred_prob = best_clf.predict_proba(X_val_prep)
y_pred = best_clf.predict(X_val_prep)
# Report the classifier performance
print('K Nearest Neighbors with {} gave the lowest log-loss of {:.3f} and an accuracy of {:.3f}'
      .format(optimal.best_params_,log_loss(y_val, y_pred_prob),accuracy_score(y_val,y_pred)))

K Nearest Neighbors with {'n_neighbors': 25} gave the lowest log-loss of 0.619 and an accuracy of 0.638


In [162]:
# RANDOM FOREST
clf = RandomForestClassifier()
# Select possible hyper-parameters
parameters = {'n_estimators':[150,200,250,300,350], 
              'max_depth':[40,50,60], 
              'min_samples_split':[2,3,4,5]}
# Set and perform GridSearch 
optimal = GridSearchCV(estimator=clf, param_grid=parameters, cv=ps, scoring='neg_log_loss', n_jobs =-1)
optimal.fit(X_train_val_prep,y_train_val)
# Given the best parameters train the classifier
best_params = (list(optimal.best_params_.values()))
best_clf = RandomForestClassifier(n_estimators=best_params[2],max_depth=best_params[0], min_samples_split=best_params[1])
best_clf.fit(X_train_prep,y_train)
# Calculate the log_loss and accuracy
y_pred_prob = best_clf.predict_proba(X_val_prep)
y_pred = best_clf.predict(X_val_prep)
# Report the classifier performance
print('Random Forest Classifier with {} gave the lowest log-loss of {:.3f} and an accuracy of {:.3f}'
      .format(optimal.best_params_,log_loss(y_val, y_pred_prob),accuracy_score(y_val,y_pred)))

Random Forest Classifier with {'max_depth': 50, 'min_samples_split': 3, 'n_estimators': 150} gave the lowest log-loss of 0.587 and an accuracy of 0.701


In [163]:
# LOGISTIC REGRESSION
clf = LogisticRegression(solver='lbfgs')
# Select possible hyper-parameters
parameters = {'C':[0.0001,0.0005,0.001,0.005,0.01,0.1,1,10,25,50,75,100,150]}
# Set and perform GridSearch 
optimal = GridSearchCV(estimator=clf, param_grid=parameters, cv=ps, scoring='neg_log_loss', n_jobs =-1)
optimal.fit(X_train_val_prep,y_train_val)
# Given the best parameters train the classifier
best_params = (list(optimal.best_params_.values()))
best_clf = LogisticRegression(C=best_params[0], solver='lbfgs')
best_clf.fit(X_train_prep,y_train)
# Calculate the log_loss and accuracy
y_pred_prob = best_clf.predict_proba(X_val_prep)
y_pred = best_clf.predict(X_val_prep)
# Report the classifier performance
print('Logistic Regression with {} gave the lowest log-loss of {:.3f} and an accuracy of {:.3f}'
      .format(optimal.best_params_,log_loss(y_val, y_pred_prob),accuracy_score(y_val,y_pred)))

Logistic Regression with {'C': 75} gave the lowest log-loss of 0.590 and an accuracy of 0.677


In [160]:
# SUPPORT VECTOR MACHINE
clf = SVC(probability=True)
# Select possible hyper-parameters
parameters = {'C':[0.001,0.01,0.1,10,50,75],
              'kernel':['linear','rbf']}
# Set and perform GridSearch 
optimal = GridSearchCV(estimator=clf, param_grid=parameters, cv=ps, scoring='neg_log_loss', n_jobs =-1)
optimal.fit(X_train_val_prep,y_train_val)
# Given the best parameters train the classifier
best_params = (list(optimal.best_params_.values()))
best_clf = SVC(C=best_params[0], kernel='rbf', probability=True)
best_clf.fit(X_train_prep,y_train)
# Calculate the log_loss and accuracy
y_pred_prob = best_clf.predict_proba(X_val_prep)
y_pred = best_clf.predict(X_val_prep)
# Report the classifier performance
print('Support Vector Machine with {} gave the lowest log-loss of {:.3f} and an accuracy of {:.3f}'
      .format(optimal.best_params_,log_loss(y_val, y_pred_prob),accuracy_score(y_val,y_pred)))

Support Vector Machine with {'C': 75, 'kernel': 'rbf'} gave the lowest log-loss of 0.627 and an accuracy of 0.527


#ANSWER_TEXT#

***Your answer goes here:***

As my first step separate the Training/Validation/Testing datasets into the feature sets (X) and target sets(y, except for test). Then I will apply the same preprocessing technique we applied in part A so that our data has **zero mean and unit variance**.

As I want to test many classifiers and see which one performs the best and why, I will be using a sort of "template" code I created, which will allow me to use any classifier that implements the 'predict_proba' method. For tuning my hyper-parameters I will be looking to use the **GridSearchCV** method to speed up my search. I begin by picking and creating the classifier object that I am looking to use and I selecting which hyper-parameters I want to optimise. After that I create the GridSearchCV object which will search for the most optimal set of parameters, using neg_log_loss as a scoring function (log_loss is being deprecated). In order to make GridSearchCV cross-validation work with the predefined splits we have been given (Train/Val/Test), I will need to join the Training and Validation sets into one big set (train_val) and specify which samples are for training and which for validation (train_val_fold) and pass it using the PredefinedSplit method (this was done in the "Loading and Splitting data" section). As the next step I fit my conjoined dataset to the GridSearchCV object and wait for the best parameters to be calculated. Afterwards I fit our classifier, now with the best parameters, to our training dataset and predict classes and their probabilities for the validation set. Finally I report the best parameter settings together with the log-loss value and accuracy (also on the validation set).

I will start by using a **KNN** classifier as it is one of the simpler models. There is only 1 parameter which I will be looking to optimise and that is the number of nearest neighbors. After optimising I found the best results with K = 25 which I think is a fairly high amount, but understandable considering my scoring function is log-loss (a less complicated boundary will have lower log-loss but also lower accuracy). Next I consider **Random Forest** classifier and here I will be looking to optimise 3 parameters: number of estimators (trees used), max depth (of a tree) and min samples to split (a branch on). After lengthy running and rerunning I arrived at values of 150 estimators, 50 max depth and 3 min samples to split on. All of this results in an accuracy of 70% and log-loss of around 0.587. I tried using **Logistic Regression** classifier again after using it in the earlier part of the assignment, but this time I ended up with different results. Optimising C resulted in a value of 75, which is basically on the other side of spectrum compared to 0.001 we picked last time, however we are looking to maximise a different score function so different results are expected. Finally I looked at **Support Vector Machines**, setting their probability flag to True so we could get probabilities with our classifications. Unsurprisingly even with optimising the C value and the kernel, we ended up with the worst performance out of any classifier, which can be attributed to SVM's not being a naturally probabilistic classifier.

Ultimately I decided to go with is ***Random Forest*** classifier. Not only did it perform the best in the metrics we measured, but it also has the most parameters that could be optimised and overall offers greatest versatility from all the classifiers (also offering the state-of-the-art performance in many domains). Previously mentioned preprocessing is also applied.

In [165]:
#ANSWER_PROB#
# Run this cell when you are ready to submit your test-set probabilities. This cell will generate some
# warning messages if something is not right: make sure to address them!

# Preproccess data
test_prep = scaler.transform(test.drop(columns=['is_person']))
# Train chosen classifier
best_clf = RandomForestClassifier(n_estimators=150, max_depth=50, min_samples_split=3)
best_clf.fit(X_train_prep,y_train)
pred_probabilities = best_clf.predict_proba(test_prep)

if pred_probabilities.shape != (1114, 2):
    print('Array is of incorrect shape. Rectify this before submitting.')
elif (pred_probabilities.sum(axis=1) != 1.0).all():
    print('Submitted values are not correct probabilities. Rectify this before submitting.')
else:
    for _prob in pred_probabilities:
        print('{:.8f}, {:.8f}'.format(_prob[0], _prob[1]))

0.77333333, 0.22666667
0.75666667, 0.24333333
0.51333333, 0.48666667
0.42888889, 0.57111111
0.47000000, 0.53000000
0.31555556, 0.68444444
0.39000000, 0.61000000
0.24666667, 0.75333333
0.87833333, 0.12166667
0.68666667, 0.31333333
0.63111111, 0.36888889
0.70444444, 0.29555556
0.60444444, 0.39555556
0.61000000, 0.39000000
0.14266667, 0.85733333
0.64222222, 0.35777778
0.24666667, 0.75333333
0.50888889, 0.49111111
0.70666667, 0.29333333
0.42000000, 0.58000000
0.69777778, 0.30222222
0.53888889, 0.46111111
0.40500000, 0.59500000
0.52000000, 0.48000000
0.75111111, 0.24888889
0.27555556, 0.72444444
0.36066667, 0.63933333
0.70000000, 0.30000000
0.40000000, 0.60000000
0.50000000, 0.50000000
0.81666667, 0.18333333
0.49444444, 0.50555556
0.78444444, 0.21555556
0.59888889, 0.40111111
0.57388889, 0.42611111
0.36555556, 0.63444444
0.21666667, 0.78333333
0.54444444, 0.45555556
0.82333333, 0.17666667
0.67111111, 0.32888889
0.77111111, 0.22888889
0.51333333, 0.48666667
0.54000000, 0.46000000
0.79333333,