# Introductory applied machine learning (INFR10069)
# Assignment 3 (Part B): Mini-Challenge [25%]

## Important Instructions

**It is important that you follow the instructions below to the letter - we will not be responsible for incorrect marking due to non-standard practices.**

1. <font color='red'>We have split Assignment 3 into two parts to make it easier for you to work on them separately and for the markers to give you feedback. This is part B of Assignment 3 - Part A is an introduction to Object Recognition. Both Assignments together are still worth 50% of CourseWork 2. **Remember to submit both notebooks (you can submit them separately).**</font>

1. You *MUST* have your environment set up as in the [README](https://github.com/michael-camilleri/IAML2018) and you *must activate this environment before running this notebook*:
```
source activate py3iaml
cd [DIRECTORY CONTAINING GIT REPOSITORY]
jupyter notebook
# Navigate to this file
```

1. Read the instructions carefully, especially where asked to name variables with a specific name. Wherever you are required to produce code you should use code cells, otherwise you should use markdown cells to report results and explain answers. In most cases we indicate the nature of answer we are expecting (code/text), and also provide the code/markdown cell where to put it

1. This part of the Assignment is the same for all students i.e. irrespective of whether you are taking the Level 10 version (INFR10069) or the Level-11 version of the course (INFR11182 and INFR11152).

1. The .csv files that you will be using are located at `./datasets` (i.e. use the `datasets` directory **adjacent** to this file).

1. In the textual answer, you are given a word-count limit of 600 words: exceeding this will lead to penalisation.

1. Make sure to distinguish between **attributes** (columns of the data) and **features** (which typically refers only to the independent variables, i.e. excluding the target variables).

1. Make sure to show **all** your code/working. 

1. Write readable code. While we do not expect you to follow [PEP8](https://www.python.org/dev/peps/pep-0008/) to the letter, the code should be adequately understandable, with plots/visualisations correctly labelled. **Do** use inline comments when doing something non-standard. When asked to present numerical values, make sure to represent real numbers in the appropriate precision to exemplify your answer. Marks *WILL* be deducted if the marker cannot understand your logic/results.

1. **Collaboration:** You may discuss the assignment with your colleagues, provided that the writing that you submit is entirely your own. That is, you must NOT borrow actual text or code from others. We ask that you provide a list of the people who you've had discussions with (if any). Please refer to the [Academic Misconduct](http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct) page for what consistutes a breach of the above.


### SUBMISSION Mechanics

**IMPORTANT:** You must submit this assignment by **Thursday 15/11/2018 at 16:00**. 

**Late submissions:** The policy stated in the School of Informatics is that normally you will not be allowed to submit coursework late. See the [ITO webpage](http://web.inf.ed.ac.uk/infweb/student-services/ito/admin/coursework-projects/late-coursework-extension-requests) for exceptions to this, e.g. in case of serious medical illness or serious personal problems.

**Resubmission:** If you submit your file(s) again, the previous submission is **overwritten**. We will mark the version that is in the submission folder at the deadline.

**N.B.**: This Assignment requires submitting **two files (electronically as described below)**:
 1. This Jupyter Notebook (Part B), *and*
 1. The Jupyter Notebook for Part A
 
All submissions happen electronically. To submit:

1. Fill out this notebook (as well as Part A), making sure to:
   1. save it with **all code/text and visualisations**: markers are NOT expected to run any cells,
   1. keep the name of the file **UNCHANGED**, *and*
   1. **keep the same structure**: retain the questions, **DO NOT** delete any cells and **avoid** adding unnecessary cells unless absolutely necessary, as this makes the job harder for the markers. This is especially important for the textual description and probability output (below).

1. Submit it using the `submit` functionality. To do this, you must be on a DICE environment. Open a Terminal, and:
   1. **On-Campus Students**: navigate to the location of this notebook and execute the following command:
   
      ```submit iaml cw2 03_A_ObjectRecognition.ipynb 03_B_MiniChallenge.ipynb```
      
   1. **Distance Learners:** These instructions also apply to those students who work on their own computer. First you need to copy your work onto DICE (so that you can use the `submit` command). For this, you can use `scp` or `rsync` (you may need to install these yourself). You can copy files to `student.ssh.inf.ed.ac.uk`, then ssh into it in order to submit. The following is an example. Replace entries in `[square brackets]` with your specific details: i.e. if your student number is for example s1234567, then `[YOUR USERNAME]` becomes `s1234567`.
   
    ```
    scp -r [FULL PATH TO 03_A_ObjectRecognition.ipynb] [YOUR USERNAME]@student.ssh.inf.ed.ac.uk:03_A_ObjectRecognition.ipynb
    scp -r [FULL PATH TO 03_B_MiniChallenge.ipynb] [YOUR USERNAME]@student.ssh.inf.ed.ac.uk:03_B_MiniChallenge.ipynb
    ssh [YOUR USERNAME]@student.ssh.inf.ed.ac.uk
    ssh student.login
    submit iaml cw2 03_A_ObjectRecognition.ipynb 03_B_MiniChallenge.ipynb
    ```
    
   What actually happens in the background is that your file is placed in a folder available to markers. If you submit a file with the same name into the same location, **it will *overwrite* your previous submission**. You should receive an automatic email confirmation after submission.
  


### Marking Breakdown

The Level 10 and Level 11 points are marked out of different totals, however these are all normalised to 100%. Note that Part A (Object Recognition) is worth 75% of the total Mark for Assignment 3, while Part B (this notebook) is worth 25%. Keep this in mind when allocating time for this assignment.

**70-100%** results/answer correct plus extra achievement at understanding or analysis of results. Clear explanations, evidence of creative or deeper thought will contribute to a higher grade.

**60-69%** results/answer correct or nearly correct and well explained.

**50-59%** results/answer in right direction but significant errors.

**40-49%** some evidence that the student has gained some understanding, but not answered the questions
properly.

**0-39%** serious error or slack work.

Note that while this is not a programming assignment, in questions which involve visualisation of results and/or long cold snippets, some marks may be deducted if the code is not adequately readable.

## Imports

Use the cell below to include any imports you deem necessary.

In [2]:
# Nice Formatting within Jupyter Notebook
%matplotlib inline
from IPython.display import display # Allows multiple displays from a single code-cell

# System functionality
import sys
sys.path.append('..')

# Import Here any Additional modules you use. To import utilities we provide, use something like:
#   from utils.plotter import plot_hinton

# Your Code goes here:
import os
import pandas as pd
import sklearn
import numpy as np
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.svm import LinearSVC, SVC
from sklearn.cross_validation import KFold,cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import log_loss

# Mini challenge

In this second part of the assignment we will have a mini object-recognition challenge. Using the same type of data as in Part A, you are asked to find the best classifier for the person/no person classification task. You can apply any preprocessing steps to the data that you think fit and employ any classifier you like (with the provision that you can explain what the classifier is/preprocessing steps are doing). You can also employ any lessons learnt during the course, either from previous Assignments, the Labs or the lecture material to try and squeeze out as much performance as you possibly can. The only restriction is that all steps must be performed in `Python` by using the `numpy`, `pandas` and `sklearn` packages. You can also make use of `matplotlib` and `seaborn` for visualisation.

### DataSet Description

The datasets we use here are similar in composition but not the same as the ones used in Part A: *it will be useful to revise the description in that notebook*. Specifically, you have access to three new datasets: a training set (`Images_C_Train.csv`), a validation set (`Images_C_Validate.csv`), and a test set (`Images_C_Test.csv`). You must use the former two for training and evaluating your models (as you see fit). As before, the full data-set has 520 attributes (dimensions). Of these you only have access to the 500 features (`dim1` through `dim500`) to test your model on: i.e. the test set does not have any of the class labels.

### Model Evaluation

Your results will be evaluated in terms of the logarithmic loss metric, specifically the [logloss](http://scikit-learn.org/0.19/modules/model_evaluation.html#log-loss) function from SKLearn. You should familiarise yourself with this. To estimate this metric you will need to provide probability outputs, as opposed to discrete predictions which we have used so far to compute classification accuracies. Most models in `sklearn` implement a `predict_proba()` method which returns the probabilities for each class. For instance, if your test set consists of `N` datapoints and there are `K` class-labels, the method will return an `N` x `K` matrix (with rows summing to 1).

### Submission and Scoring

This part of Assignment 3 carries 25% of the total marks. Within this, you will be scored on two criteria:
 1. 80% of the mark will depend on the thoroughness of the exploration of various approaches. This will be assessed through your code, as well as a brief description (<600 words) justifying the approaches you considered, your exploration pattern and your suggested final approach (and why you chose it).
 1. 20% of the mark will depend on the quality of your predictions: this will be evaluated based on the logarithmic loss metric.
Note here that just getting exceptional performance is not enough: in fact, you should focus more on analysing your results that just getting the best score!

You have to submit the following:
 1. **All Code-Cells** which show your **working** with necessary output/plots already generated.
 1. In **TEXT** cell `#ANSWER_TEXT#` you are to write your explanation (<600 words) as described above. Keep this brief and to the point. **Make sure** to keep the token `#ANSWER_TEXT#` as the first line of the cell!
 1. In **CODE** cell `#ANSWER_PROB#` you are to submit your predictions. To do this:
    1. Once you have chosen your favourite model (and pre-processing steps) apply it to the test-set and estimate the posterior proabilities for the data points in the test set.
    1. Store these probabilities in a 2D numpy array named `pred_probabilities`, with predictions along the rows i.e. each row should be a complete probability distribution over whether the image contains a person or not. Note that due to the encoding of the `is_person` class, the negative case (i.e. there is no person) comes first.
    1. Execute the `#ANSWER_PROB#` code cell, making sure to not change anything. This cell will do some checks to ensure that you are submitting the right shape of array.

You may create as many code cells as you need (within reason) for training your models, evaluating the data etc: however, the text cell `#ANSWER_TEXT#` and code-cell `#ANSWER_PROB#` showing your answers must be the last two cells in the notebook.

In [3]:
# This is where your working code should start. Fell free to add as many code-cells as necessary.
#  Make sure however that all working code cells come BEFORE the #ANSWER_TEXT# and #ANSWER_PROB#
#  cells below.

# Your Code goes here:
# Loading the data
train_C_data = os.path.join(os.getcwd(), 'datasets','Images_C_Train.csv')
train_C = pd.read_csv(train_C_data, delimiter = ',')
valid_C_data = os.path.join(os.getcwd(), 'datasets','Images_C_Validate.csv')
valid_C = pd.read_csv(valid_C_data, delimiter = ',')
test_C_data = os.path.join(os.getcwd(), 'datasets','Images_C_Test.csv')
test_C = pd.read_csv(test_C_data, delimiter = ',')

attributes_name_C = []
for i in range(1, 501):
    attributes_name_C.append("dim"+str(i))

# Storing the data
X_tr_C = train_C[attributes_name_C]
y_tr_C = train_C["is_person"]
X_val_C = valid_C[attributes_name_C]
y_val_C = valid_C["is_person"]
X_test_C = test_C[attributes_name_C]

# Standardlize the data
ss_C = StandardScaler()
ss_C.fit(X_tr_C)
X_tr_C_ss = ss_C.transform(X_tr_C)
X_val_C_ss = ss_C.transform(X_val_C)

In [4]:
## Feature Selection
rfc_ss_C = RandomForestClassifier(n_estimators=500,random_state=31,criterion='entropy').fit(X_tr_C_ss,y_tr_C)
feature_importance_C = rfc_ss_C.feature_importances_
names = X_tr_C.columns.values

keep_feature_50 = []
keep_feature_100 = []
keep_feature_200 = []
keep_feature_300 = []
keep_feature_400 = []

for i in range(0,50):
    max_feature_C = np.argmax(feature_importance_C)
    feature_importance_C[max_feature_C] = 0
    keep_feature_50.append(names[max_feature_C])
    keep_feature_100.append(names[max_feature_C])
    keep_feature_200.append(names[max_feature_C])
    keep_feature_300.append(names[max_feature_C])
    keep_feature_400.append(names[max_feature_C])
for i in range(50,100):
    max_feature_C = np.argmax(feature_importance_C)
    feature_importance_C[max_feature_C] = 0
    keep_feature_100.append(names[max_feature_C])
    keep_feature_200.append(names[max_feature_C])
    keep_feature_300.append(names[max_feature_C])
    keep_feature_400.append(names[max_feature_C])
for i in range(100,200):
    max_feature_C = np.argmax(feature_importance_C)
    feature_importance_C[max_feature_C] = 0
    keep_feature_200.append(names[max_feature_C])
    keep_feature_300.append(names[max_feature_C])
    keep_feature_400.append(names[max_feature_C])
for i in range(200,300):
    max_feature_C = np.argmax(feature_importance_C)
    feature_importance_C[max_feature_C] = 0
    keep_feature_300.append(names[max_feature_C])
    keep_feature_400.append(names[max_feature_C])
for i in range(300,400):
    max_feature_C = np.argmax(feature_importance_C)
    feature_importance_C[max_feature_C] = 0
    keep_feature_400.append(names[max_feature_C])

X_tr_C_50 = X_tr_C[keep_feature_50]
X_tr_C_100 = X_tr_C[keep_feature_100]
X_tr_C_200 = X_tr_C[keep_feature_200]
X_tr_C_300 = X_tr_C[keep_feature_300]
X_tr_C_400 = X_tr_C[keep_feature_400]
X_tr_C_all = X_tr_C.copy()

X_val_C_50 = X_val_C[keep_feature_50]
X_val_C_100 = X_val_C[keep_feature_100]
X_val_C_200 = X_val_C[keep_feature_200]
X_val_C_300 = X_val_C[keep_feature_300]
X_val_C_400 = X_val_C[keep_feature_400]
X_val_C_all = X_val_C.copy()

## Standardlise Data
ss_C = StandardScaler()
ss_C.fit(X_tr_C_50)
X_tr_C_50_trans = ss_C.transform(X_tr_C_50)
X_val_C_50_trans = ss_C.transform(X_val_C_50)

ss_C.fit(X_tr_C_100)
X_tr_C_100_trans = ss_C.transform(X_tr_C_100)
X_val_C_100_trans = ss_C.transform(X_val_C_100)

ss_C.fit(X_tr_C_200)
X_tr_C_200_trans = ss_C.transform(X_tr_C_200)
X_val_C_200_trans = ss_C.transform(X_val_C_200)

ss_C.fit(X_tr_C_300)
X_tr_C_300_trans = ss_C.transform(X_tr_C_300)
X_val_C_300_trans = ss_C.transform(X_val_C_300)

ss_C.fit(X_tr_C_400)
X_tr_C_400_trans = ss_C.transform(X_tr_C_400)
X_val_C_400_trans = ss_C.transform(X_val_C_400)

ss_C.fit(X_tr_C_all)
X_tr_C_all_trans = ss_C.transform(X_tr_C_all)
X_val_C_all_trans = ss_C.transform(X_val_C_all)

lr_feature = LogisticRegression(solver='lbfgs')
lr_feature.fit(X_tr_C_50_trans,y_tr_C)
print("50 Feature log loss:",log_loss(  y_tr_C,lr_feature.predict_proba(X_tr_C_50_trans)), log_loss(  y_val_C,lr_feature.predict_proba(X_val_C_50_trans))   )
print("50 Feature Accuracy:",lr_feature.score(X_tr_C_50_trans,y_tr_C),",",lr_feature.score(X_val_C_50_trans,y_val_C))
print()
lr_feature.fit(X_tr_C_100_trans,y_tr_C)
print("100 Feature log loss:",log_loss(  y_tr_C,lr_feature.predict_proba(X_tr_C_100_trans)), log_loss(  y_val_C,lr_feature.predict_proba(X_val_C_100_trans)))
print("100 Feature Accuracy:",lr_feature.score(X_tr_C_100_trans,y_tr_C),",",lr_feature.score(X_val_C_100_trans,y_val_C))
print()
lr_feature.fit(X_tr_C_200_trans,y_tr_C)
print("200 Feature log loss:",log_loss(  y_tr_C,lr_feature.predict_proba(X_tr_C_200_trans)), log_loss(  y_val_C,lr_feature.predict_proba(X_val_C_200_trans)))
print("200 Feature Accuracy:",lr_feature.score(X_tr_C_200_trans,y_tr_C),",",lr_feature.score(X_val_C_200_trans,y_val_C))
print()
lr_feature.fit(X_tr_C_300_trans,y_tr_C)
print("300 Feature log loss:",log_loss(  y_tr_C,lr_feature.predict_proba(X_tr_C_300_trans)), log_loss(  y_val_C,lr_feature.predict_proba(X_val_C_300_trans)))
print("300 Feature Accuracy:",lr_feature.score(X_tr_C_300_trans,y_tr_C),",",lr_feature.score(X_val_C_300_trans,y_val_C))
print()
lr_feature.fit(X_tr_C_400_trans,y_tr_C)
print("400 Feature log loss:",log_loss(  y_tr_C,lr_feature.predict_proba(X_tr_C_400_trans)), log_loss(  y_val_C,lr_feature.predict_proba(X_val_C_400_trans)))
print("400 Feature Accuracy:",lr_feature.score(X_tr_C_400_trans,y_tr_C),",",lr_feature.score(X_val_C_400_trans,y_val_C))
print()
lr_feature.fit(X_tr_C_all_trans,y_tr_C)
print("all Feature log loss:",log_loss(  y_tr_C,lr_feature.predict_proba(X_tr_C_all_trans)), log_loss(  y_val_C,lr_feature.predict_proba(X_val_C_all_trans)))
print("All Feature Accuracy:",lr_feature.score(X_tr_C_all_trans,y_tr_C),",",lr_feature.score(X_val_C_all_trans,y_val_C))

50 Feature log loss: 0.6758113685542032 0.6877746623690721
50 Feature Accuracy: 0.5551348793185045 , 0.5265049415992812

100 Feature log loss: 0.6706015489462164 0.6822589069767703
100 Feature Accuracy: 0.5551348793185045 , 0.5265049415992812

200 Feature log loss: 0.6659583127996654 0.6786300473704638
200 Feature Accuracy: 0.5603407477520114 , 0.527403414195867

300 Feature log loss: 0.661021088128191 0.6734924388879356
300 Feature Accuracy: 0.5806909607193563 , 0.5462713387241689

400 Feature log loss: 0.6565865203451939 0.6684388756749751
400 Feature Accuracy: 0.5948887837198297 , 0.5741239892183289

all Feature log loss: 0.6518814179821367 0.6634424903958274
All Feature Accuracy: 0.6218646474207288 , 0.6082659478885895


In [5]:
# Outlier Removing
q1 = np.quantile(X_tr_C, 0.25)
q3 = np.quantile(X_tr_C, 0.75)
IQR = q3 - q1
# For normal outlier removing, we multiply IQR with 1.5, but the data seems unusual and quite noisy and we decided
# to increase the scaler of the IQR to ensure that this only removes the outlier of the original data
out = np.asarray(X_tr_C < (q1 - 3000 * IQR)) + np.asarray(X_tr_C > (q3 + 3000 * IQR))
X_tr_C_out = X_tr_C[np.any(out, axis=1)==False]
y_tr_C_out = y_tr_C[np.any(out, axis=1)==False]
print(len(X_tr_C))
print(len(X_tr_C_out))
X_tr_C_out_ss = ss_C.fit(X_tr_C_out).transform(X_tr_C_out)
X_val_C_ss = ss_C.transform(X_val_C)

lr_feature.fit(X_tr_C_out_ss,y_tr_C_out)
print("All Feature",log_loss(y_tr_C_out,lr_feature.predict_proba(X_tr_C_out_ss)), log_loss(y_val_C,lr_feature.predict_proba(X_val_C_ss)))
print("All Feature Accuracy:",lr_feature.score(X_tr_C_out_ss,y_tr_C_out),",",lr_feature.score(X_val_C_ss,y_val_C))

2113
2093
All Feature 0.39074645905683414 0.8353514279819095
All Feature Accuracy: 0.8107978977544195 , 0.6433063791554358


In [6]:
X_tr_C_trans = X_tr_C_out_ss.copy()
X_val_C_trans = X_val_C_ss.copy()

## Selecting Model
lgr = LogisticRegression(solver="lbfgs").fit(X_tr_C_trans,y_tr_C_out)
y_tr_lgr_C = lgr.score(X_tr_C_trans,y_tr_C_out)
y_val_lgr_C = lgr.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,lgr.predict_proba(X_tr_C_trans)), log_loss(y_val_C,lgr.predict_proba(X_val_C_trans)))
print("Logistic:",y_tr_lgr_C,",",y_val_lgr_C)

log loss 0.3907464590568342 0.8353514279819094
Logistic: 0.8107978977544195 , 0.6433063791554358


In [7]:
svc_linear_C = SVC(kernel="linear",probability=True).fit(X_tr_C_trans,y_tr_C_out)
y_tr_linear_C = svc_linear_C.score(X_tr_C_trans,y_tr_C_out)
y_val_linear_C = svc_linear_C.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,svc_linear_C.predict_proba(X_tr_C_trans)), log_loss(y_val_C,svc_linear_C.predict_proba(X_val_C_trans)))
print("SVC Linear:",y_tr_linear_C,",",y_val_linear_C)

log loss 0.5675673166446202 0.6320625135213229
SVC Linear: 0.8423315814620163 , 0.6388140161725068


In [8]:
svc_rbf_C = SVC(kernel="rbf",probability=True).fit(X_tr_C_trans,y_tr_C_out)
y_tr_rbf_C = svc_rbf_C.score(X_tr_C_trans,y_tr_C_out)
y_val_rbf_C = svc_rbf_C.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,svc_rbf_C.predict_proba(X_tr_C_trans)), log_loss(y_val_C,svc_rbf_C.predict_proba(X_val_C_trans)))
print("SVC RBF:",y_tr_rbf_C,",",y_val_rbf_C)

log loss 0.29193157871657843 0.5419569518019958
SVC RBF: 0.9106545628284759 , 0.7295597484276729


In [9]:
svc_poly_C = SVC(kernel="poly",probability=True).fit(X_tr_C_trans,y_tr_C_out)
y_tr_poly_C = svc_poly_C.score(X_tr_C_trans,y_tr_C_out)
y_val_poly_C = svc_poly_C.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,svc_poly_C.predict_proba(X_tr_C_trans)), log_loss(y_val_C,svc_poly_C.predict_proba(X_val_C_trans)))
print("SVC Poly:",y_tr_poly_C,",",y_val_poly_C)

log loss 0.26217385994657544 0.6001573677778198
SVC Poly: 0.9909221213569039 , 0.701707097933513


In [10]:
kf_C = KFold(len(X_tr_C_trans),shuffle=True,random_state=0,n_folds=3)
c_C = np.logspace(-10,10,50)
ga_C = np.logspace(-10,10,50)
lgr_cv = LogisticRegressionCV(Cs=c_C,cv=kf_C,solver="lbfgs").fit(X_tr_C_trans,y_tr_C_out)
y_tr_lgr_cv_C = lgr_cv.score(X_tr_C_trans,y_tr_C_out)
y_val_lgr_cv_C = lgr_cv.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,lgr_cv.predict_proba(X_tr_C_trans)), log_loss(y_val_C,lgr_cv.predict_proba(X_val_C_trans)))
print("Logistic CV:",y_tr_lgr_cv_C,",",y_val_lgr_cv_C)

log loss 0.5006982302024461 0.579557926878059
Logistic CV: 0.7601528905876732 , 0.6999101527403414


In [11]:
cross_score_c_C = np.empty(len(c_C))
for i in range(0,len(c_C)):
    svc_rbf_c_C = SVC(kernel="rbf",probability=True,gamma="auto",C=c_C[i]).fit(X_tr_C_trans,y_tr_C_out)
    cross_score_c_C[i] = cross_val_score(svc_rbf_c_C,X_tr_C_trans,y_tr_C_out,cv=kf_C).mean()
max_score_c_C = np.argmax(cross_score_c_C)
svc_rbf_c_C = SVC(kernel="rbf",probability=True,gamma="auto",C=c_C[max_score_c_C]).fit(X_tr_C_trans,y_tr_C_out)
y_tr_svc_c_C = svc_rbf_c_C.score(X_tr_C_trans,y_tr_C_out)
y_val_svc_c_C = svc_rbf_c_C.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,svc_rbf_c_C.predict_proba(X_tr_C_trans)), log_loss(y_val_C,svc_rbf_c_C.predict_proba(X_val_C_trans)))
print("SVC RBF with C:",y_tr_svc_c_C,",",y_val_svc_c_C)

log loss 0.36639869600809755 0.5449595749768159
SVC RBF with C: 0.8523650262780698 , 0.7151841868823001


In [12]:
cross_score_ga_C = np.empty(len(c_C))
for i in range(0,len(ga_C)):
    svc_rbf_ga_C = SVC(kernel="rbf",probability=True,gamma=ga_C[i],C=1).fit(X_tr_C_trans,y_tr_C_out)
    cross_score_ga_C[i] = cross_val_score(svc_rbf_ga_C,X_tr_C_trans,y_tr_C_out,cv=kf_C).mean()
max_score_ga_C = np.argmax(cross_score_ga_C)
svc_rbf_ga_C = SVC(kernel="rbf",probability=True,gamma=ga_C[max_score_ga_C],C=1).fit(X_tr_C_trans,y_tr_C_out)
y_tr_svc_ga_C = svc_rbf_ga_C.score(X_tr_C_trans,y_tr_C_out)
y_val_svc_ga_C = svc_rbf_ga_C.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,svc_rbf_ga_C.predict_proba(X_tr_C_trans)), log_loss(y_val_C,svc_rbf_ga_C.predict_proba(X_val_C_trans)))
print("SVC RBF with gama:",y_tr_svc_ga_C,",",y_val_svc_ga_C)

log loss 0.27949362888681206 0.5407400726482782
SVC RBF with gama: 0.924988055422838 , 0.7268643306379156


In [13]:
svc_rbf_ga_c_C = SVC(kernel="rbf",probability=True,gamma=ga_C[max_score_ga_C],C=c_C[max_score_c_C]).fit(X_tr_C_trans,y_tr_C_out)
y_tr_svc_ga_c_C = svc_rbf_ga_c_C.score(X_tr_C_trans,y_tr_C_out)
y_val_svc_ga_c_C = svc_rbf_ga_c_C.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,svc_rbf_ga_c_C.predict_proba(X_tr_C_trans)), log_loss(y_val_C,svc_rbf_ga_c_C.predict_proba(X_val_C_trans)))
print("SVC RBF with C and gama:",y_tr_svc_ga_c_C,",",y_val_svc_ga_c_C)

log loss 0.3585621248138951 0.5445026838701567
SVC RBF with C and gama: 0.8595317725752508 , 0.7115902964959568


In [14]:
degree_C = np.arange(1,8)
degree_cross_score_C = np.empty([7,3])
for i in range(0,len(degree_C)):
    svc_poly_deg_C = SVC(kernel="poly",probability=True,degree=degree_C[i]).fit(X_tr_C_trans,y_tr_C_out)
    degree_cross_score_C[i,:] = cross_val_score(svc_poly_deg_C,X_tr_C_trans,y_tr_C_out,cv=kf_C)
mean_cross_score_de_C = degree_cross_score_C.mean(axis=1)
max_score_de_C = np.argmax(mean_cross_score_de_C)
svc_poly_de_C = SVC(kernel="poly",probability=True,degree=degree_C[max_score_de_C]).fit(X_tr_C_trans,y_tr_C_out)
y_tr_poly_de_C = svc_poly_de_C.score(X_tr_C_trans,y_tr_C_out)
y_val_poly_de_C = svc_poly_de_C.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,svc_poly_de_C.predict_proba(X_tr_C_trans)), log_loss(y_val_C,svc_poly_de_C.predict_proba(X_val_C_trans)))
print("SVC Poly with Degree",y_tr_poly_de_C,y_val_poly_de_C)

log loss 0.48688492699669317 0.5876468311534229
SVC Poly with Degree 0.7835642618251314 0.6891284815813118


In [15]:
classifier = MLPClassifier(hidden_layer_sizes=(2000,2000,2000), max_iter=10000, early_stopping=True)
classifier.fit(X_tr_C_trans,y_tr_C_out)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(2000, 2000, 2000), learning_rate='constant',
       learning_rate_init=0.001, max_iter=10000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [16]:
y_tr_mlp = classifier.score(X_tr_C_trans,y_tr_C_out)
y_val_mlp = classifier.score(X_val_C_trans,y_val_C)
print("log loss",log_loss(y_tr_C_out,classifier.predict_proba(X_tr_C_trans)), log_loss(y_val_C,classifier.predict_proba(X_val_C_trans)))
print("mlp",y_tr_mlp,y_val_mlp)

log loss 0.4207422567213147 0.5937083733090047
mlp 0.8036311514572384 0.7070979335130279


In [19]:
X_test_C_ss = ss_C.transform(X_test_C)
svc_rbf_C = SVC(kernel="rbf", probability=True, random_state=0).fit(X_tr_C_trans,y_tr_C_out)
pred_probabilities = svc_rbf_C.predict_proba(X_test_C_ss)

#ANSWER_TEXT#

***Your answer goes here:***
Data Pre-Processing Using StandardScaler to removing the mean and scaling to unit variance, which prevents the model to behave badly if the individual feature do not more or less look like standard normally distributed data.

Feature Selection By removing irrelevant features — that is, by setting the corresponding coefficient estimates to zero — we can obtain a model that is more easily interpreted. RF is my choice to do the feature selection. RF are among the most popular machine learning methods thanks to their relatively good accuracy, robustness and ease of use. They also provide two straightforward methods for feature selection: mean decrease impurity and mean decrease accuracy. For my case, i am using Mean Decrease Impurity, which is feature importance. I have used top 50, 100,200,300,400, and all the dataset to train the model and compare the result of the train and valid set on each model. As the result printed out, we can see that with all the features, the prediction performs the best.

As the feature selection does not work well, to further cleaning the data, I have decided to try outlier removing. I am using IQR, IQR can be used as a measure of how spread-out the values are, to remove the outlier. As the noisy level of the data is out of my expected, using normal 1.5 scaler with IQR would remove all the data. Thus, I decided to increase the scaler until only reasonable amount of the data is removed. As what is shown above, using the clean data, the result of the validation has improved.

Model Selection Model selection methods are an essential tool for data analysis, especially for big datasets involving many features. After feature selection, i have to choose the best model to fit the dataset. I decided to compare among Logistic Regression, SVC linear, SVC RBF, SVC Poly at the beginning. We can notice the results of these four models, SVC RBF performs the best. However, i would like to do a further investigation, which leads to optimisation of the model. I have tried to find the best C for logistic regression, SVC RBF model. Moreover, I have tested out the best gamma for SVC RBF and the combination of the best C and gamma for SVC RBF. I have tested different degree of the SVC Poly as well. We can see the results that with the change of degree, SVC Poly still performs slightly better than before. Moreover, I have tried the state-of-the-art model, deep neural network, which shows great performance in many field in recent years. However, the result of the dnn does not meet my expection, as i have tried different number of hidden layer and hidden node, the result does not outperform the linear regression models. In the code block above, i only show the best model of the dnn i have investigated so far. The reasons of why the dnn has perform badly may due to the noiseness of the data and the simple feed-forward neural network does not do well in image as many paper has proved that convolutional neural network shows good result. As comparing the rest of the three model, we can see that SVC BRF is among the best.

In [20]:
#ANSWER_PROB#
# Run this cell when you are ready to submit your test-set probabilities. This cell will generate some
# warning messages if something is not right: make sure to address them!
if pred_probabilities.shape != (1114, 2):
    print('Array is of incorrect shape. Rectify this before submitting.')
elif (pred_probabilities.sum(axis=1) != 1.0).all():
    print('Submitted values are not correct probabilities. Rectify this before submitting.')
else:
    for _prob in pred_probabilities:
        print('{:.8f}, {:.8f}'.format(_prob[0], _prob[1]))

0.78048000, 0.21952000
0.86793488, 0.13206512
0.40015268, 0.59984732
0.38379435, 0.61620565
0.79148536, 0.20851464
0.10613728, 0.89386272
0.24466341, 0.75533659
0.09001504, 0.90998496
0.86953651, 0.13046349
0.74871666, 0.25128334
0.52888704, 0.47111296
0.83147867, 0.16852133
0.60005572, 0.39994428
0.64607477, 0.35392523
0.02263243, 0.97736757
0.58388629, 0.41611371
0.05716327, 0.94283673
0.53427537, 0.46572463
0.87184875, 0.12815125
0.44437230, 0.55562770
0.86944105, 0.13055895
0.42077032, 0.57922968
0.71826849, 0.28173151
0.46097158, 0.53902842
0.75528611, 0.24471389
0.07587773, 0.92412227
0.22602223, 0.77397777
0.82415898, 0.17584102
0.32911327, 0.67088673
0.48567839, 0.51432161
0.78385854, 0.21614146
0.39925341, 0.60074659
0.88588183, 0.11411817
0.36523905, 0.63476095
0.74037965, 0.25962035
0.48807484, 0.51192516
0.19790424, 0.80209576
0.72820727, 0.27179273
0.84289017, 0.15710983
0.51032898, 0.48967102
0.77235548, 0.22764452
0.63445689, 0.36554311
0.40893529, 0.59106471
0.90217693,