# TKO_7092 Evaluation of Machine Learning Methods 2025

---

Student name: Emil Hellberg

Student number: 1901299

Student email: ephell@utu.fi

---

## Exercise 3

Complete the tasks given to you in the letter below. In your submission, explain clearly, precisely, and comprehensively why the cross-validation described in the letter failed, what is the correct way to perform cross-validation in the given scenario, and why the correct cross-validation method will give a reliable estimate of the generalisation performance. Then implement the correct cross-validation for the scenario and report its results.

Remember to follow all the general exercise guidelines that are stated in Moodle. Full points (2p) will be given for a submission that demonstrates a deep understanding of cross-validation on pair-input data and implements the requested cross-validation correctly (incl. reporting the results). Partial points (1p) will be given if there are small error(s) but the overall approach is correct. No points will be given if there are significant error(s).

The deadline of this exercise is **Wednesday 19 February 2025 at 11:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.

---


# Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have already made laboratory experiments to measure the affinities between some proteins and drug molecules.

My colleague is working on another set of proteins, and the objectives of his project are similar to mine. He has recently discovered thousands of new potential drug molecules. He asked me if I could find the pairs that have the strongest affinities among his proteins and drug molecules. Obviously I do not have the resources to measure all the possible pairs in my laboratory, so I need to prioritise. I decided to do this with the help of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had already made in the laboratory with my proteins and drug molecules. They comprise of 77 target proteins and 59 drug molecules. Then I performed a leave-one-out cross-validation with this same data to estimate the generalisation performance of the model. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of my colleague's proteins and drug molecules. The problem is: when I selected the highest predicted affinities and tried to verify them in the lab, I found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get a reliable estimate. Also, implement the correct leave-one-out cross-validation and report its results. I need to know whether it would be a waste of my resources if I were to use my model any further.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist

---

#### Answer the questions about cross-validation on pair-input data

#### Import libraries

In [64]:
# Import the libraries you need.

import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsRegressor

#### Write utility functions

In [68]:
# Write the utility functions you need in your analysis.
# Input data seems to already be normalized and pre-processed

# Returns data points that share the second member but not the first

def B(inputs, n, pairs):
    B_inputs = pd.DataFrame(columns = inputs.columns)
    B_inputs.loc[len(B_inputs)] = 0
    
    for i in range(0, len(inputs)):
        if pairs.iloc[i, 0] != pairs.iloc[n, 0] and pairs.iloc[i, 1] == pairs.iloc[n, 1]:
            new_row = inputs.iloc[[i]]
            B_inputs = pd.concat([B_inputs, new_row], ignore_index = True)

    return B_inputs

# Returns data points that share the first mebmer but not the second

def C(inputs, n, pairs):
    C_inputs = pd.DataFrame(columns = inputs.columns)
    C_inputs.loc[len(C_inputs)] = 0
    
    for i in range(0, len(inputs)):
        if pairs.iloc[i, 1] != pairs.iloc[n, 1] and pairs.iloc[i, 0] == pairs.iloc[n, 0]:
            new_row = inputs.iloc[[i]]
            C_inputs = pd.concat([C_inputs, new_row], ignore_index = True)

    return C_inputs

# Returns data points that share neither member

def D(inputs, n, pairs):
    D_inputs = pd.DataFrame(columns = inputs.columns)
    D_inputs.loc[len(D_inputs)] = 0
    for i in range(0, len(inputs)):
        if pairs.iloc[i, 0] != pairs.iloc[n, 0] and pairs.iloc[i, 1] != pairs.iloc[n, 1]:
            new_row = inputs.iloc[[i]]
            D_inputs = pd.concat([D_inputs, new_row], ignore_index = True)

    return D_inputs

def cindex(y, yp):
    n = 0
    h_num = 0 
    
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i + 1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt):
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num / n    

def scross_validate(x, y, pairs):
    columns = ['A', 'B', 'C', 'D']
    c_index = pd.DataFrame(columns = columns)

    y_true = []
    A_predictions = []
    B_predictions = []
    C_predictions = []
    D_predictions = []

    for n in range(0, len(x)):
        test_x = x.iloc[n]
        true_y = y.iloc[n, 0]

        A_train_x = x.drop(x.index[n])
        A_train_y = y.drop(y.index[n])
        B_train_x = B(x, n, pairs)
        B_train_y = B(y, n, pairs)
        C_train_x = C(x, n, pairs)
        C_train_y = C(y, n, pairs)
        D_train_x = D(x, n, pairs)
        D_train_y = D(y, n, pairs)

        neigh_A = KNeighborsRegressor(n_neighbors = 10)
        neigh_A.fit(A_train_x, A_train_y)
        A_pred = neigh_A.predict(test_x.to_numpy().reshape(1, -1))
        A_predictions.append(A_pred)
        
        neigh_B = KNeighborsRegressor(n_neighbors = 10)
        if B_train_x.shape[0] > 10:
            neigh_B.fit(B_train_x, B_train_y)
            B_pred = neigh_B.predict(test_x.to_numpy().reshape(1, -1))
            B_predictions.append(B_pred)
        else:
            B_predictions.append(0.5)
        
        neigh_C = KNeighborsRegressor(n_neighbors = 10)
        if C_train_x.shape[0] > 10:
            neigh_C.fit(C_train_x, C_train_y)
            C_pred = neigh_C.predict(test_x.to_numpy().reshape(1, -1))
            C_predictions.append(C_pred)
        else:
            C_predictions.append(0.5)
        
        neigh_D = KNeighborsRegressor(n_neighbors = 10)
        if D_train_x.shape[0] > 10:
            neigh_D.fit(D_train_x, D_train_y)
            D_pred = neigh_D.predict(test_x.to_numpy().reshape(1, -1))
            D_predictions.append(D_pred)
        else:
            D_predictions.append(0.5)

        y_true.append(true_y)
    print(A_predictions)
    print(B_predictions)
    print(C_predictions)
    print(D_predictions)
    A_cindex = cindex(A_predictions, y_true)
    B_cindex = cindex(B_predictions, y_true)
    C_cindex = cindex(C_predictions, y_true)
    D_cindex = cindex(D_predictions, y_true)
    new_row_df = pd.DataFrame({'A': [A_cindex], 'B': [B_cindex], 'C': [C_cindex], 'D': [D_cindex]})
    c_index = pd.concat([c_index, new_row_df], ignore_index = True)
    
    return c_index

#### Load datasets

In [14]:
# Read the data files (input.data, output.data, pairs.data).
x = pd.read_csv('./input.data', header = None)
y = pd.read_csv('./output.data', header = None)
pairs = pd.read_csv('./pairs.data', header = None)

x = x.iloc[:, 0].str.split(expand = True).astype(float)
pairs = pairs[0].str.split(expand = True)
pairs[1] = pairs[1].str.replace('"', '', regex = False)

print(x)
print(y)
print(pairs)
print(pairs.value_counts())

           0         1         2         3         4         5         6   \
0    0.759222  0.709585  0.253151  0.421082  0.727780  0.404487  0.709027   
1    0.034584  0.304720  0.688257  0.296396  0.151878  0.830755  0.270656   
2    0.737867  0.236079  0.905987  0.163612  0.801455  0.789823  0.393999   
3    0.406913  0.607740  0.235365  0.888679  0.150347  0.598991  0.130108   
4    0.697707  0.432565  0.650329  0.886065  0.328660  0.576926  0.523100   
..        ...       ...       ...       ...       ...       ...       ...   
395  0.496498  0.454389  0.353502  0.696922  0.876419  0.379429  0.733514   
396  0.188616  0.554824  0.609247  0.371482  0.588356  0.667919  0.297278   
397  0.095054  0.452918  0.942931  0.576332  0.411317  0.561792  0.837251   
398  0.166109  0.471535  0.509825  0.415422  0.620681  0.786712  0.150722   
399  0.817295  0.326707  0.500573  0.022480  0.266418  0.463136  0.720725   

           7         8         9   ...        57        58        59  \
0  

#### Implement and run cross-validation

In [69]:
# Implement and run the requested cross-validation. Report and interpret its results.
c_index = scross_validate(x, y, pairs)
print(c_index)

[array([[0.6657572]]), array([[0.4630602]]), array([[0.8235482]]), array([[0.488935]]), array([[0.6697487]]), array([[0.5514578]]), array([[0.7620587]]), array([[0.46079093]]), array([[0.6853358]]), array([[0.5665465]]), array([[0.2088952]]), array([[0.7457613]]), array([[0.6200399]]), array([[0.24197293]]), array([[0.6997526]]), array([[0.598196]]), array([[0.2431589]]), array([[0.18462497]]), array([[0.28464363]]), array([[0.7568015]]), array([[0.433913]]), array([[0.9113759]]), array([[0.5618109]]), array([[0.2114425]]), array([[0.758541]]), array([[0.4841588]]), array([[0.4795423]]), array([[0.3803126]]), array([[0.545538]]), array([[0.8933951]]), array([[0.3701456]]), array([[0.6506979]]), array([[0.8938211]]), array([[0.7585732]]), array([[0.7230043]]), array([[0.6305093]]), array([[0.6057068]]), array([[0.7560015]]), array([[0.7247979]]), array([[0.28397673]]), array([[0.8267603]]), array([[0.552607]]), array([[0.5225022]]), array([[0.3505481]]), array([[0.7416419]]), array([[0.

  c_index = pd.concat([c_index, new_row_df], ignore_index = True)
