# TKO_7092 Evaluation of Machine Learning Methods 2024

---

Student name: Lauri Reima

Student number: 2109673

Student email: loreim@utu.fi

---

## Exercise 4

Complete the tasks given to you in the letter below. In your submission, explain clearly, precisely, and comprehensively why the cross-validation described in the letter failed, how cross-validation should be performed in the given scenario and why  your cross-validation will give a reliable estimate of the generalisation performance. Then implement the correct cross-validation for the scenario and report its results.

Remember to follow all the general exercise guidelines that are stated in Moodle. Full points (2p) will be given for a submission that demonstrates a deep understanding of cross-validation on pair-input data and implements the requested cross-validation correctly (incl. reporting the results). Partial points (1p) will be given if there are small error(s) but the overall approach is correct. No points will be given if there are significant error(s).

The deadline of this exercise is **Wednesday 21 February 2024 at 11:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.

---


Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. Currently I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have a list of over 100.000 potential drug molecules, but their affinities still need to be verified in the lab. Obviously I do not have the resources to measure all the possible drug-target pairs, so I need to prioritise. I have decided to do this with the use of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had made in the lab, which comprise of all the 77 target proteins of interest but only 59 different drug molecules. Then I performed a leave-one-out cross-validation with this same data to estimate the generalisation performance of the model. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of the remaining drug molecules. The problem is: when I selected the highest predicted affinities and tried to verify them in the lab, I found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get a reliable estimate. Also, implement the correct leave-one-out cross-validation and report its results. I need to know whether I am wasting my lab resources by using my model.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist

---

#### Answer the questions about cross-validation on pair-input data

In [1]:
# Why did the estimation described in the letter fail?
# How should leave-one-out cross-validation be performed in the given scenario and why?
# Remember to provide comprehensive and precise arguments.

First of all I tried this and didn't get near to 0,9 c-index value, but the mistake you are making, dear Bio Scientist
, is that you are using all of the data. When using leave-one-out you are using one sample as test data, and all the rest data as training material. There are essentially 4 ways to do a study like this: 

A: like you did 

B: separate the data with same compounds as the test sample from the training data

C: separate the data with same proteins as the test sample from the training data

D: separate the data with same compounds or same proteins as the test sample from the training data

Below I have done a function which you can use to calculate C-indexes with a parameter A, B, C or D. 

But which one to use in your case? 

You have a lot of potential drugs and you don't really want to train your data with the same drugs affinities towards other proteins. You really want a 'new' drug on every iteration so you use opiton B. This way we avoid overfitting and get a more generalized performance.

#### Import libraries

In [2]:
# Import the libraries you need.
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import LeaveOneOut
from sklearn.preprocessing import StandardScaler


#### Write utility functions

In [3]:
# Write the utility functions you need in your analysis.
def cindex(y, yp):
    n = 0
    h_num = 0 
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    if n == 0:
        return 1
    return h_num/n
    
# a function to calculate the indices needed in training
def training_indices(data, ind):
    # drugs are the same
    cond_1 = data[0] == data.values[ind][0]
    # targets are the same 
    cond_2 = data[1] == data.values[ind][1]
    # 
    same_drug = data[cond_1]
    same_target = data[cond_2]
    # the indeces of the SAME drugs, targets and both
    drug = np.array(same_drug.index)
    target = np.array(same_target.index)
    both = np.unique(np.concatenate((drug, target)))
    # training indeces which are indeces other than those up
    train_ind_all = np.delete(np.arange(len(data)), ind)
    train_ind_drug = np.delete(np.arange(len(data)), drug)
    train_ind_target = np.delete(np.arange(len(data)), target)
    train_ind_both = np.delete(np.arange(len(data)), both)
    # return a dictionary so we can use our wanted type
    return {'A': train_ind_all, 'B':train_ind_drug, 'C':train_ind_target, 'D': train_ind_both}

# basically a leave-one-out crossvalidation function with a little twist. We can choose the observation type as well
# observation_type is by default: 'A'. Other choises are 'B', 'C' and 'D'
def type_choice(input_data: np.array, output_data: np.array, pairs: pd.DataFrame, observation_type='A'): 
    knn = KNeighborsRegressor(n_neighbors=10)
    predictions = []
    true_values = []
    # loop trough input_data
    for i in range(len(input_data)):
        # use the selfmade function to choose the training indeces, this might be a little expensive 
        train_ind = training_indices(pairs, i)[observation_type] 
        # make the training data
        X_train, y_train = input_data[train_ind], output_data[train_ind]        
        # fit the model
        knn.fit(X_train, y_train)
        # prediction made with the test material
        y_pred = knn.predict(input_data[i].reshape(1, -1))
        # add the predictions and true values to separate arrays
        predictions.extend(y_pred) 
        true_values.extend(output_data[i]) 
    # and calculate the c-index value from the arrays    
    c_index = cindex(true_values,predictions)

    return c_index

#### Load datasets

In [4]:
# Read the data files (input.data, output.data, pairs.data).
input = pd.read_csv('input.data', sep=' ', header=None)
output = pd.read_csv('output.data', header=None)
pairs = pd.read_csv('pairs.data', sep=' ', header=None)


#### Implement and run cross-validation

In [5]:
# Implement and run the requested cross-validation. Report and interpret its results.
input_data = input.values # numpy
output_data = output.values # numpy
                            # pairs is a dataframe
#print(f'Type A c-index: {type_choice(input_data, output_data, pairs)}')
print(f'Type B c-index: {type_choice(input_data, output_data, pairs, "B")}')
#print(f'Type C c-index: {type_choice(input_data, output_data, pairs, "C")}')
#print(f'Type D c-index: {type_choice(input_data, output_data, pairs, "D")}')


Type B c-index: 0.51968671679198


C-index value of 0,52 indicates the algorithm is not very good at making predictions. 
It's just barely better than guessing. The model might be better with more data, but still it generalizes the data well.