# TKO_7092 Evaluation of Machine Learning Methods 2025

## Exercise 3

Complete the tasks given to you in the letter below. In your submission, explain clearly, precisely, and comprehensively why the cross-validation described in the letter failed, what is the correct way to perform cross-validation in the given scenario, and why the correct cross-validation method will give a reliable estimate of the generalisation performance. Then implement the correct cross-validation for the scenario and report its results.

---


Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have already made laboratory experiments to measure the affinities between some proteins and drug molecules.

My colleague is working on another set of proteins, and the objectives of his project are similar to mine. He has recently discovered thousands of new potential drug molecules. He asked me if I could find the pairs that have the strongest affinities among his proteins and drug molecules. Obviously I do not have the resources to measure all the possible pairs in my laboratory, so I need to prioritise. I decided to do this with the help of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had already made in the laboratory with my proteins and drug molecules. They comprise of 77 target proteins and 59 drug molecules. Then I performed a leave-one-out cross-validation with this same data to estimate the generalisation performance of the model. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of my colleague's proteins and drug molecules. The problem is: when I selected the highest predicted affinities and tried to verify them in the lab, I found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get a reliable estimate. Also, implement the correct leave-one-out cross-validation and report its results. I need to know whether it would be a waste of my resources if I were to use my model any further.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist

---

#### Answer the questions about cross-validation on pair-input data

## Why did the estimation described in the letter fail?

It allowed the model to “peek” at proteins and drugs that appear in both training and test sets. This led to data leakage and an overestimation of the model’s performance.

The original approach used standard leave‐one‐out cross‐validation (LOOCV) where one pair (i.e. one specific protein–drug combination) was held out as a test sample while the model was trained on the remaining 399 pairs. This procedure failed for two important reasons:

Data Leakage:

In the dataset, each sample (row) represents a pair formed by one protein and one drug. Because many proteins and drugs appear in multiple pairs, even if you leave one pair out, its constituent protein or drug molecules are almost certainly present in other training examples. This means that when the model is tested on a held‐out pair, it has already “seen” its protein or drug in other contexts. As a consequence, the test sample is not independent from the training data. Thats why the performance metric gives an overly optimistic view of how well the model generalizes.

Mismatch:

In practical, the aim was to predict affinities for pairs involving entirely new proteins and drugs (from the colleague's dataset) that were never measured before. Standard LOOCV on pairs does not simulate this kind of situation because it does not ensure that the test pairs are composed of molecules that were not in the training set. Thus, while LOOCV on pairs can show excellent performance, it does not reflect the challenge of predicting on unseen molecules and that's why may lead to a model that performs poorly in real-world use.

## How should leave-one-out cross-validation be performed in the given scenario and why?

When working with pair‐input data, each fold of cross‐validation must replicate the dependency structure that we will face when applying the model in practice. The key is to exclude from the training set those in‐sample observations that share a pair member with the test observation if that sharing does not occur in the real prediction scenario.

In the lecture example there was provided four types of out‐of‐sample pair-input observations:

Type A:
The out‐of‐sample observation shares both pair members with the sample. (Standard LOOCV is acceptable here because the dependencies (both members seen in training) are the same as when using LOOCV.)

Type B:
The out‐of‐sample observation has a novel first pair member but shares the second pair member with the sample. (For proper evaluation, the training set must exclude observations that share the novel (first) pair member with the test observation)

Type C:
The test observation has a novel second pair member but shares the first pair member. (Thus, exclude training observations that share the novel (second) pair member)

Type D:
The test observation shares neither pair member with the sample (completely unseen). (To mimic this scenario, we must remove all training observations that share either pair member with the test observation)

Because the colleague’s proteins and  drugs are entirely new, we are effectively interested in one of the scenarios where at least one, if not both, pair members are unseen. If we expect both components to be new (Type D), then the evaluation should exclude any training observation that shares either the protein or the drug with the test pair. 

Why this way?

By excluding all training pairs that share either component with the test pair, we simulate the real‐world situation where neither the protein nor the drug has been seen before. This eliminates any dependency that might artificially inflate the performance estimate. With the dependent observations removed, the evaluation reflects the true generalization ability of the model on entirely new pairs. 

Each fold should leave out all pairs that contain a specific protein. This ensures that when predicting for a new protein, the model has never seen it before, leading to a realistic generalization estimate. This is crucial because in the actual application, the model will be making predictions on new proteins and new drug molecules, not just new pairs.

#### Import libraries

In [1]:
# Import the libraries you need.

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from scipy.stats import somersd
from sklearn.preprocessing import StandardScaler

#### Write utility functions

In [2]:
# Write the utility functions you need in your analysis.

def cindex(true, pred):
    s_d = somersd(true, y=pred, alternative='two-sided')
    c_index = (s_d.statistic + 1.0) / 2.0
    return c_index

def leave_one_group_out_cv(input_features, output_affinities, pairs, group_index, n_neighbors=10):
    """
    Performs leave-one-group-out cross-validation.
    
    Parameters:
    - input_features: numpy array of shape (n_samples, n_features)
    - output_affinities: numpy array of shape (n_samples,)
    - pairs: numpy array of shape (n_samples, 2), where each row is [drug_id, protein_id]
    - group_index: int, 0 for leave-one-drug-out, 1 for leave-one-protein-out
    - n_neighbors: int, number of neighbors for KNN
    
    Returns:
    - c_index: float, concordance index for the CV
    """
    unique_groups = np.unique(pairs[:, group_index])
    all_true = []
    all_pred = []

    for group in unique_groups:
        test_mask = (pairs[:, group_index] == group)
        train_mask = ~test_mask

        if np.sum(test_mask) == 0:
            continue  # Skip if no test samples (shouldn't happen)

        X_train = input_features[train_mask]
        y_train = output_affinities[train_mask]
        X_test = input_features[test_mask]
        y_test = output_affinities[test_mask]

        # Train model
        model = KNeighborsRegressor(n_neighbors=n_neighbors)
        model.fit(X_train, y_train)
        pred = model.predict(X_test)

        all_true.append(y_test)
        all_pred.append(pred)

    all_true = np.concatenate(all_true)
    all_pred = np.concatenate(all_pred)
    return cindex(all_true, all_pred)


#### Load datasets

In [3]:
# Read the data files (input.data, output.data, pairs.data).

input_features = np.loadtxt("input.data")
output_affinities = np.loadtxt("output.data")
pairs = np.genfromtxt("pairs.data", dtype=str)  # Each row: [drug_id, protein_id]

assert len(input_features) == len(output_affinities) == len(pairs), "Error: the rows don't match"

print("First 5 rows in input.data:")
print(input_features[:5])
print("\nFirst 5 rows in output.data:")
print(output_affinities[:5])
print("\nFirst 5 rows in pairs.data:")
print(pairs[:5])

First 5 rows in input.data:
[[0.759222  0.709585  0.253151  0.421082  0.72778   0.404487  0.709027
  0.242963  0.407292  0.379971  0.412465  0.284844  0.425915  0.747606
  0.222227  0.445811  0.667796  0.684103  0.787706  0.336596  0.824543
  0.672308  0.310471  0.56949   0.797567  0.313177  0.311688  0.452033
  0.624945  0.581985  0.676889  0.813303  0.813624  0.113869  0.191247
  0.457698  0.197378  0.278575  0.72427   0.626547  0.292438  0.484609
  0.605063  0.868699  0.916641  0.938985  0.264985  0.463426  0.754346
  0.128309  0.473564  0.450811  0.0700555 0.851803  0.70897   0.210471
  0.225433  0.838616  0.16505   0.515334  0.332678  0.577533  0.678125
  0.463608  0.538938  0.460883  0.345251 ]
 [0.0345836 0.30472   0.688257  0.296396  0.151878  0.830755  0.270656
  0.705392  0.18612   0.0855935 0.285097  0.43646   0.372679  0.342203
  0.619907  0.402184  0.802182  0.252357  0.102975  0.361315  0.832553
  0.377971  0.520338  0.952467  0.950084  0.274851  0.510368  0.241743
  0.47

#### Implement and run cross-validation

In [4]:
# Implement and run the requested cross-validation. Report and interpret its results.

#I wasn't sure if the input data should have been normalized because one row included many values.
#scaler = StandardScaler()
#input_features = scaler.fit_transform(input_features)

protein_based_cindex = leave_one_group_out_cv(input_features, output_affinities, pairs, group_index=1)
print("Leave-one-protein-out C-index: {:.3f}".format(protein_based_cindex)) #if input_features are normalized C-index is 0.829

drug_based_cindex = leave_one_group_out_cv(input_features, output_affinities, pairs, group_index=0)
print("Leave-one-drug-out C-index: {:.3f}".format(drug_based_cindex)) #if input_features are normalized C-index is 0.513

Leave-one-protein-out C-index: 0.830
Leave-one-drug-out C-index: 0.520


The C-index (Concordance Index) is a measure of ranking quality, ranging from 0.5 (random chance) to 1.0 (perfect ranking). A higher C-index indicates that the model can correctly rank pairs by affinity.

Leave-one-protein-out (C-index = 0.830)

This is a reasonably high score, suggesting that the model generalizes well to new proteins.The model is still able to make relatively accurate affinity predictions when entirely new proteins are introduced.

Leave-one-drug-out (C-index = 0.520)

This is only slightly better than random (0.5), meaning the model struggles to predict affinities for entirely new drugs.
This suggests that drug features may not be as informative as protein features, or that the model relies heavily on drug identity rather than true chemical properties. The model might be memorizing the drugs in the dataset rather than learning general principles about their interactions.

