# TKO_7092 Evaluation of Machine Learning Methods 2025

---

Student name: Saara Mäkelä

Student number: 2203834

Student email: sahanm@utu.fi

---

## Exercise 3

Complete the tasks given to you in the letter below. In your submission, explain clearly, precisely, and comprehensively why the cross-validation described in the letter failed, what is the correct way to perform cross-validation in the given scenario, and why the correct cross-validation method will give a reliable estimate of the generalisation performance. Then implement the correct cross-validation for the scenario and report its results.

Remember to follow all the general exercise guidelines that are stated in Moodle. Full points (2p) will be given for a submission that demonstrates a deep understanding of cross-validation on pair-input data and implements the requested cross-validation correctly (incl. reporting the results). Partial points (1p) will be given if there are small error(s) but the overall approach is correct. No points will be given if there are significant error(s).

The deadline of this exercise is **Wednesday 19 February 2025 at 11:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.

---


Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have already made laboratory experiments to measure the affinities between some proteins and drug molecules.

My colleague is working on another set of proteins, and the objectives of his project are similar to mine. He has recently discovered thousands of new potential drug molecules. He asked me if I could find the pairs that have the strongest affinities among his proteins and drug molecules. Obviously I do not have the resources to measure all the possible pairs in my laboratory, so I need to prioritise. I decided to do this with the help of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had already made in the laboratory with my proteins and drug molecules. They comprise of 77 target proteins and 59 drug molecules. Then I performed a leave-one-out cross-validation with this same data to estimate the generalisation performance of the model. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of my colleague's proteins and drug molecules. The problem is: when I selected the highest predicted affinities and tried to verify them in the lab, I found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get a reliable estimate. Also, implement the correct leave-one-out cross-validation and report its results. I need to know whether it would be a waste of my resources if I were to use my model any further.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist

---

#### Answer the questions about cross-validation on pair-input data

In [77]:
# Why did the estimation described in the letter fail?
# How should leave-one-out cross-validation be performed in the given scenario and why?
# Remember to provide comprehensive and precise arguments.


__<font>Why did the estimation described in the letter fail? </font>__

The estimation failed because it didn't take into consideration that it was done to pairs. The scientist made a normal cross-validation and assumed that the test set is independent of the training set, which is not the case with pair-input data. The normal cross-validation doesn't take into consideration the different out-of-sample observation types. It needs to be modified to match the dependencies of different types. Depending on the type of the out-of-sample, observations that share the first or second pair member with the test observation must not be used for training. In other words, the dependencies between the test and training observations must reflect those between the out-of-sample observations and the sample.

__How should leave-one-out cross-validation be performed in the given scenario and why?__

The leave-one-out cross-validation should consider the dependencies of the pairs. If the out-of-sample is type B, the observation that share the first pair member with the test observation should not be used for training. And if the out-of-sample is type C, the observations that share the second pair member with the observation should not be used for training etc. So the leave-one-out should be performed in the way that checks if either of the pairs is in the training set. Then they should be left out of the training set. By doing it the correct way the cross-validation doesn't produce falsely high results as it doesn't learn too much of the training set.

#### Import libraries

In [7]:
# Import the libraries you need.
import pandas as pd
from lifelines.utils import concordance_index
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import LeaveOneOut

#### Write utility functions

In [8]:
# Write the utility functions you need in your analysis.
def cindex(true, pred):
    c_index = concordance_index(true, pred)
    return c_index

def knn_regression(X_train, y_train, X_test, k):
    knn = KNeighborsRegressor(n_neighbors=k, metric='euclidean')
    knn.fit(X_train, y_train)
    prediction = knn.predict(X_test)
    return prediction

def LOOCV(X, y, pairs, k):
    loo = LeaveOneOut()
    y_true, y_pred = [], [] 
    for test_idx in range(len(X)):
        test_drug, test_target = pairs.iloc[test_idx] 
        train_indices = [i for i in range(len(X))
                         if pairs.iloc[i,0] != test_drug and pairs.iloc[i,1] != test_target]
        X_train, X_test = X.iloc[train_indices], X.iloc[[test_idx]]
        y_train, y_test = y.iloc[train_indices], y.iloc[[test_idx]]
        y_pred.append(knn_regression(X_train, y_train, X_test, k)[0])
        y_true.append(y_test.values[0])
    return y_true, y_pred


#### Load datasets

In [9]:
# Read the data files (input.data, output.data, pairs.data).
input_data = pd.read_csv('input.data', delimiter=" ", header=None)
output_data = pd.read_csv('output.data', delimiter=" ", header=None)
pairs_data = pd.read_csv('pairs.data', delimiter=" ", header=None)
display(input_data)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,57,58,59,60,61,62,63,64,65,66
0,0.759222,0.709585,0.253151,0.421082,0.727780,0.404487,0.709027,0.242963,0.407292,0.379971,...,0.838616,0.165050,0.515334,0.332678,0.577533,0.678125,0.463608,0.538938,0.460883,0.345251
1,0.034584,0.304720,0.688257,0.296396,0.151878,0.830755,0.270656,0.705392,0.186120,0.085594,...,0.472762,0.730013,0.639373,0.445218,0.455680,0.090737,0.308432,0.079023,0.603089,0.197008
2,0.737867,0.236079,0.905987,0.163612,0.801455,0.789823,0.393999,0.522067,0.411352,0.781861,...,0.595468,0.582292,0.836193,0.281514,0.791790,0.081695,0.583450,0.422539,0.076437,0.299662
3,0.406913,0.607740,0.235365,0.888679,0.150347,0.598991,0.130108,0.465818,0.799953,0.906878,...,0.453880,0.311799,0.534668,0.563793,0.727767,0.172686,0.908368,0.786892,0.790459,0.666388
4,0.697707,0.432565,0.650329,0.886065,0.328660,0.576926,0.523100,0.080463,0.131349,0.913496,...,0.583892,0.444141,0.249423,0.110690,0.420770,0.250148,0.196350,0.427255,0.166715,0.919720
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,0.496498,0.454389,0.353502,0.696922,0.876419,0.379429,0.733514,0.839360,0.212366,0.530528,...,0.054742,0.212866,0.657035,0.483128,0.807080,0.566457,0.379042,0.566572,0.512170,0.421929
396,0.188616,0.554824,0.609247,0.371482,0.588356,0.667919,0.297278,0.269298,0.856952,0.697523,...,0.549288,0.843150,0.739872,0.481870,0.359285,0.593446,0.788714,0.142521,0.819989,0.637718
397,0.095054,0.452918,0.942931,0.576332,0.411317,0.561792,0.837251,0.806083,0.581221,0.829656,...,0.900376,0.378294,0.360243,0.259965,0.716623,0.817797,0.792314,0.736228,0.233031,0.597934
398,0.166109,0.471535,0.509825,0.415422,0.620681,0.786712,0.150722,0.282159,0.809963,0.809090,...,0.587865,0.955456,0.714566,0.520387,0.397173,0.575056,0.822135,0.096667,0.946201,0.604271


#### Implement and run cross-validation

In [11]:
# Implement and run the requested cross-validation. Report and interpret its results.
k = 10
y_true, y_pred = LOOCV(input_data, output_data, pairs_data, k)
result = cindex(y_true, y_pred)
print(f"The estimate of how well the model worked: {result:.3f}")

The estimate of how well the model worked: 0.522


_Dear Bio Scientist,_

0.5 is the worst result we can get from cross-validation. Based on the result 0.522 the model doesn't work well for this case. I would say it would be waste of your resources if you were to use the model any further. The models results are only slightly better than a guess. It is 50% of the time correct and the other times wrong. 
