# TKO_2096 Applications of Data Analysis 2021
## Exercise 4

Complete the tasks given to you in the letter below. There are cells at the end of this notebook to which you are expected to write your code. Insert markdown cells as needed to describe your solution.

The deadline of this exercise is **28.2.2021, 23:59 PM**.


---

Student name: 

Student number: 

Student email: 

---


Dear Data Scientist,

I have a task for you that concerns drug molecules and their targets. I have spent a lot of time in a laboratory to measure how strongly potential drug molecules bind to putative target molecules. I do not have enough resources to measure all possible drug-target pairs, so I would like to first predict their affinities and then measure only the most promising ones. I have already managed to create a model which I believe is good for this purpose. Its details are below.

- algorithm: K-nearest neighbours regressor
- parameters: K=20
- training data: full data set

The full data set is available as the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

I am not able to evaluate how well my model will perform when I will use it to predict the affinities of new drug-target pairs. I need you to evaluate the model for me. There are three distinct situations in which I want to use this model in the future.

1. I did not have the resources to measure the affinities of all the known drug-target pairs in the laboratory, so I want to use the model to predict the affinities of the remaining pairs.
2. I am confident that I will discover new potential drug molecules in the future, so I will want to use the model to predict their affinities to the currently known targets.
3. Because new putative target molecules, too, will likely be identified in the future, I will also want to use the model to predict the affinities between the drug molecules I will discover and the target molecules somebody else will discover in the future.

I need to get evaluation results from leave-one-out cross-validation with C-index. Please evaluate the generalisation performance of my model in the three situations and explain why your cross-validation methods are suitable for them.


Yours sincerely, \
Bio Scientist


PS. Follow all the general exercise guidelines stated in Moodle.

---

#### Import libraries

In [108]:
# Import the libraries you need.
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from IPython.display import display

#### Load datasets

In [109]:
# Read the data files (input.data, output.data, pairs.data).
features = np.genfromtxt('./data/input.data', dtype=float)
labels = np.genfromtxt('./data/output.data')
pairs = np.genfromtxt('./data/pairs.data', dtype=str)
# Visualize data for better understanding
display(features[:5])
display(features.shape)
display(pairs[:5])
display(pairs.shape)
display(labels[:5])
display(labels.shape)

array([[6.53771, 7.04273, 7.30593, ..., 8.13992, 7.36155, 7.9893 ],
       [4.26878, 4.05945, 4.40541, ..., 8.38097, 6.80756, 7.12181],
       [7.24802, 5.96468, 7.02855, ..., 6.75104, 5.72958, 6.73456],
       [3.00092, 3.33087, 3.57794, ..., 2.74684, 2.93389, 2.76753],
       [4.34096, 3.79832, 5.67286, ..., 2.70133, 2.87879, 2.64117]])

(1500, 2500)

array([['"D23"', '"T194"'],
       ['"D9"', '"T270"'],
       ['"D3"', '"T47"'],
       ['"D49"', '"T222"'],
       ['"D37"', '"T28"']], dtype='<U6')

(1500, 2)

array([10000., 10000., 10000., 10000.,   270.])

(1500,)

In [110]:
# Standardize features
sc = StandardScaler()
features_sc = np.asarray(sc.fit_transform(features))
# Visualize to check for changes
display(features_sc[:5])

array([[-0.07254638, -0.0444449 ,  0.01692479, ...,  0.30551428,
         0.12096093,  0.25245676],
       [-0.45635292, -0.52673984, -0.4564305 , ...,  0.34221223,
         0.04160485,  0.12508838],
       [ 0.04760788, -0.21872893, -0.02834271, ...,  0.09406829,
        -0.11280997,  0.06823077],
       [-0.67082095, -0.64452645, -0.59147088, ..., -0.51553947,
        -0.51327748, -0.51422458],
       [-0.44414313, -0.56895568, -0.24958685, ..., -0.52246801,
        -0.52117026, -0.53277727]])

#### Write functions

In [111]:
# C-index score function from previous exercises
def cindex(true_labels, pred_labels):
    n = 0
    h_num = 0 
    for i in range(0, len(true_labels)):
        t = true_labels[i]
        p = pred_labels[i]
        for j in range(i+1, len(true_labels)):
            nt = true_labels[j]
            np = pred_labels[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5 
    cindx = h_num  /n
    return cindx

In [112]:
# Function that calculates C-index score for LOO CV
# Expects: X = data; y = labels; n = neighbors for kNN
def loo_cv_cindex(X, y, n):
    # List for predicted values
    preds = []
    # For all rows in data
    for i in range(len(X)):
        # Progress printer used in testing and for large datasets
        # if (i % 100 == 0):
            # print(i)
        # Choose the ith row as test set and exclude ith row from train sets
        X_test = X[i].reshape(1, -1)
        X_train = np.delete(X, i, axis = 0) 
        y_train = np.delete(y, i, axis = 0)
        # Basic kNN prediction making and saving to our predictions list
        knn = KNeighborsRegressor(n_neighbors = n, metric = 'euclidean')
        knn.fit(X_train, y_train)
        pred = knn.predict(X_test)
        preds.append(pred)
    # Return C-index score
    return cindex(y, preds)

In [113]:
# Function that calculates C-index score for LOO CV when known targets are included
# Expects: X = data; y = labels; targets = targets as column; n = neighbors for kNN
def lpo_cv_index_target(X, y, targets, n):
    # List for predicted values
    preds = []
    # For all rows in data
    for i in range(len(X)):
        # Progress printer used in testing and for large datasets
        # if (i % 100 == 0):
            # print(i)
        # Save the ith target for train set selection
        target = targets[i]
        # Choose the ith row as test set
        X_test = X[i].reshape(1, -1)
        # Choose train sets from rows that don't include our ith target
        X_train = X[(target != targets)]
        y_train = y[(target != targets)]
        # Basic kNN prediction making and saving to our predictions list
        knn = KNeighborsRegressor(n_neighbors = n, metric='euclidean')
        knn.fit(X_train, y_train)
        pred = knn.predict(X_test)
        preds.append(pred)
    # Return C-index score
    return cindex(y, preds)

In [114]:
# Function that calculates C-index score for LOO CV when known targets and identifiers are included
# Expects: X = data; y = labels; pairs = identifiers and targets as pairs; n = neighbors for kNN
def lpo_cv_cindex_identify(X, y, pairs, n):
    # List for predicted values
    preds = []
    # Extract columns from pairs
    a = pairs[:, 0]
    b = pairs[:, 1]
    # For all rows in data
    for i in range(len(X)):
        # Progress printer used in testing and for large datasets
        # if (i % 100 == 0):
            # print(i)
        # Save the ith pair of elements for train set selection
        pair = pairs[i]
        # Choose the ith row as test set
        X_test = X[i].reshape(1, -1)
        # Choose train sets from rows that don't include either of the elements in our ith pair
        X_train = X[(pair[0] != a) & (pair[0] != b) & (pair[1] != a) & (pair[1] != b)]
        y_train = y[(pair[0] != a) & (pair[0] != b) & (pair[1] != a) & (pair[1] != b)]
        # Basic kNN prediction making and saving to our predictions list
        knn = KNeighborsRegressor(n_neighbors = n, metric='euclidean')
        knn.fit(X_train, y_train)
        pred = knn.predict(X_test)
        preds.append(pred)
    # Return C-index score
    return cindex(y, preds)

#### Run cross-validations

In [118]:
# Run the requested cross-validations and print the results.
k = 20
loocv_cindex_score = loo_cv_cindex(features_sc, labels, k)
print("C-index score for predicting affinities of remaining pairs", loocv_cindex_score)
lpocv_target_cindex_score = lpo_cv_index_target(features_sc, labels, pairs[:, 1], k)
print("C-index score for predicting affinities to known targets", lpocv_target_cindex_score)
lpocv_identify_cindex_score = lpo_cv_cindex_identify(features_sc, labels, pairs, k)
print("C-index score for predicting affinities to unknown targets", lpocv_identify_cindex_score)

C-index score for predicting affinities of remaining pairs 0.7761225361589472
C-index score for predicting affinities to known targets 0.7663808418539928
C-index score for predicting affinities to unknown targets 0.6771307148920411


#### Interpret results

- The model works fine for predicting the affinities of remaining pairs (accuracy ~78% is realistic)
- For predicting the affinities of known targets the model works fine too (accuracy ~77% is realistic)
- For predicting the affinities of unknown targets the model works alright but there could be some improvements (accuracy ~68%)



- Maybe the k-parameter could be different for the different predictions

- The cross-validation methods used in this document work because of maths