# IML - Task 3
The goal of this task is to predict if mutations are active or not.

Mutations are encoded in the form DHGE (sequence of 4 oredered letters) and their activeness is encoded by 0 (inactive) and 1 (active).

A quick analysis of the training data shows that approx. \$ 3 \\% \$ of the mutations are active. (see below)

Analyzing the data (see below) shows that there are 20 different amino acids, there are therefore \$ 20^4 = 160'000 \$ possible mutations. We are given \$ 112'000 \$ to train on, and \$ 48'000 \$ to test on (which gives the score), for a total of \$ 160'000 \$ mutations. Our model will therefore see all of the possible data.

## Loading & analyzing the data
First importing the usual libraries:

In [1]:
import pandas as pd
import numpy as np
from sklearn import *

Loading the data thanks to Pandas, and a quick overview

In [2]:
train_data = pd.read_csv("train.csv")
train_data

Unnamed: 0,Sequence,Active
0,DKWL,0
1,FCHN,0
2,KDQP,0
3,FNWI,0
4,NKRM,0
...,...,...
111995,GSME,0
111996,DLPT,0
111997,SGHC,0
111998,KIGT,0


Splitting them into X and Y components:

In [3]:
train_X = np.array(train_data.loc[: , "Sequence"])
train_Y = np.array(train_data.loc[: , "Active"])

Computing the percentage of active mutations:

In [4]:
train_Y.mean()

0.03761607142857143

So \$ \sim 3\\% \$ of the mutations are active.

This means that we have a very unbalanced classification task!

Let's now load and preview the test data:

In [5]:
test_data = pd.read_csv("test.csv")
test_X = np.array(test_data.loc[: , "Sequence"])
test_data

Unnamed: 0,Sequence
0,HWFK
1,MWPW
2,ALDV
3,NTLG
4,LHYY
...,...
47995,NRWM
47996,MMMK
47997,AFNM
47998,CRYI


Let's see how many amino acids, and which ones, there are (in the combination of train and test data):

In [6]:
amino_acids = set(''.join(np.concatenate((train_X, test_X))));
print(amino_acids)
print(len(amino_acids))

{'T', 'I', 'Y', 'G', 'L', 'C', 'Q', 'W', 'K', 'P', 'E', 'N', 'R', 'M', 'D', 'A', 'H', 'V', 'S', 'F'}
20


## Model considerations
How to represent the data?
* The data consist of a sequence of 4 chars - but their value doesn't really mean anything (A is not more similar to C than to V)
* The order in which they appear might be more more important (for example \_HQI and HQI_ might be very similar because of the HQI sequence, indepently of what come beofre or after)

This suggests one-hot encoding, for example AAMK could be encoded as
[(1, 0, ..., 0), (1, 0, ..., 0), (0, 0, ... , 1 , ... 0), (0, 0, ... , 1 , ... 0) ] 

Then we can either:
* use some supervised dimensionality reduction technique
* feed it to a neural network designed to do some dimentionality reduction (bottleneck?)
    


### One-hot encoding
Let's use the OneHotEncoder from sklearn to do so.

Documentation : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [7]:
hot = preprocessing.OneHotEncoder(sparse = False);

For the encoder to work, we need to separate the string into arrays of 4 characters:

In [8]:
X_matrix = [list(mut) for mut in train_X]
X_matrix[1:5] #first few rows

[['F', 'C', 'H', 'N'],
 ['K', 'D', 'Q', 'P'],
 ['F', 'N', 'W', 'I'],
 ['N', 'K', 'R', 'M']]

In [9]:
hot_train_X = hot.fit_transform(X_matrix, train_Y)

In [10]:
hot_train_X.shape

(112000, 80)

Use the same encoder to encode the test data to be used later :

In [11]:
X_test_matrix = [list(mut) for mut in test_X];
hot_test_X = hot.transform(X_test_matrix);
hot_test_X.shape

(48000, 80)

### Neural Network

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier

In [12]:
NNclf = neural_network.MLPClassifier(verbose=1);

In [13]:
NNclf.fit(hot_train_X, train_Y);

Iteration 1, loss = 0.14409176
Iteration 2, loss = 0.06110186
Iteration 3, loss = 0.05133877
Iteration 4, loss = 0.04376542
Iteration 5, loss = 0.03792876
Iteration 6, loss = 0.03341289
Iteration 7, loss = 0.02993459
Iteration 8, loss = 0.02725370
Iteration 9, loss = 0.02531310
Iteration 10, loss = 0.02366806
Iteration 11, loss = 0.02224966
Iteration 12, loss = 0.02101115
Iteration 13, loss = 0.01993232
Iteration 14, loss = 0.01902639
Iteration 15, loss = 0.01817944
Iteration 16, loss = 0.01742867
Iteration 17, loss = 0.01666030
Iteration 18, loss = 0.01604607
Iteration 19, loss = 0.01532609
Iteration 20, loss = 0.01498834
Iteration 21, loss = 0.01441349
Iteration 22, loss = 0.01376433
Iteration 23, loss = 0.01343032
Iteration 24, loss = 0.01294980
Iteration 25, loss = 0.01258534
Iteration 26, loss = 0.01206816
Iteration 27, loss = 0.01179570
Iteration 28, loss = 0.01138903
Iteration 29, loss = 0.01111106
Iteration 30, loss = 0.01066836
Iteration 31, loss = 0.01036612
Iteration 32, los

In [14]:
test_Y = NNclf.predict(hot_test_X)

In [15]:
test_Y

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

In [16]:
test_Y.mean()

0.03816666666666667

Sanity check : we also get about 3% positiveness in the test data.

## Writing the output

In [17]:
pd.DataFrame(test_Y).to_csv("output.csv", header=None, index=None)