<a href="https://colab.research.google.com/github/Sergei-N-Fedorov/Data_Analysis/blob/main/EMLM_Exercise3_Sergei_Fedorov_(sefedo).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TKO_7092 Evaluation of Machine Learning Methods 2026

## Exercise 3

---

Student name: **Sergei Fedorov**

Student number: **2511405**

Student email: sefedo@utu.fi

---

## IMPORTANT

This exercise involves using AI. Use only the *Study Chat* service provided by the University of Turku at [https://ai.utu.fi/en](https://ai.utu.fi/en). **Do not use any other AI service!** Before starting, remember to carefully read the guidelines for using AI ([https://intranet.utu.fi/en/sites/ai/guidelines/Pages/default.aspx](https://intranet.utu.fi/en/sites/ai/guidelines/Pages/default.aspx)) and the terms of use. **Do not share any personal information or copyrighted material with AI.**

Save all your discussions (including your prompts and AI's output) as well as the name of the model you used.

## Instructions

The deadline of this exercise is **Wednesday 18 February 2026 at 11:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about the exercise. Remember to follow all the general exercise guidelines that are stated in Moodle.

The exercise has several parts, all of which concern the letter below. You will take the role of a data scientist who has been assigned to solve the problem described in the letter. You have an AI tool to assist you, but you alone are responsible for the quality of the solution.

#### 1

Ask AI to write code to solve the task. Analyse which parts of the AI-generated code are correct and which are incorrect. Pay particular attention to the key parts of the cross-validation. You may ask AI to improve the code as many times as you like, as long as you keep analysing its output.

#### 2

Implement the required leave-one-out cross-validations and run your code to get the estimates you were asked to obtain. Here it is okay to use any amount of the AI-generated code you produced above. You can use a complete, fully correct AI-written solution, you can write the implementation from scratch by yourself, or you can take some AI-generated code and complete the implementation manually.

#### 3

Write a report in which you discuss the following:

- Why did the cross-validation described in the letter fail? What is the correct way to do cross-validation here?

- Which parts of the task was AI able to code correctly and which not? Focus particularly on the core of the cross-validation.

- Which parts of the AI-generated code did you use in your implementation? Why? Explain why the selected pieces of code work correctly in your implementation.

- What results did you get with your implementation? Report the estimates and interpret the results in terms of how well the model will work in the situations described in the letter. Explain in detail why your cross-validation is the correct way to estimate the generalisation performance.

Write the report in your own words and explain everything clearly, precisely, and comprehensively. **You are not allowed to use AI to write the report for you** because this is where you show that you have understood the theory and are able to apply it. If you use AI as a teacher (i.e. to explain things to you for learning purposes), you must attach the discussions and clearly state what and how you learnt from the AI. **If there is uncertainty about how the text was produced, you may be required to explain the content of your report in a face-to-face meeting.**

#### 4

Submit the following documents to Moodle:

- The discussions with AI (including your prompts and AI's output), as PDF.

- The implementation of your cross-validation, as PDF and as a Jupyter notebook.

- The report, as PDF. (It is okay to integrate the report to the Jupyter notebook.)

Note that it is not enough to just implement the cross-validation correctly to pass this exercise. You must also explain in plain words what you have done and demonstrate that you understand how cross-validation should be performed on pair-input data. Small errors are acceptable, but you will fail this exercise if there are significant error(s) or omission(s) in the report or in the implementation.

## Letter from your client

Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have already made laboratory experiments to measure the affinities between some proteins and drug molecules.

My colleague is working on another set of proteins, and the objectives of his project are similar to mine. He has recently discovered thousands of new potential drug molecules. He asked me if I could find the pairs that have the strongest affinities among his proteins and drug molecules. Obviously I do not have the resources to measure all the possible pairs in my laboratory, so I need to prioritise. I decided to do this with the help of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had already made in the laboratory with my proteins and drug molecules. They comprise of 77 target proteins and 59 drug molecules. To estimate the generalisation performance of the model, I then performed a leave-one-out cross-validation. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of my colleague's proteins and drug molecules. The problem is that when I selected the highest predicted affinities and my colleague tried to verify them in the lab, we found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score. We also tested the model with my proteins against my colleague's drugs (which is another task I would like to use my model for), but the model did not work there either.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get reliable estimates. Also, please implement the leave-one-out cross-validation correctly and report the numbers I need. I want to know whether it would be a waste of my colleague's and my resources if we were to use my model any further.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist


# Mistake in the client's approach

The data in question is pair-input where each observation (measurement) corresponds to two objects, a drug molecule and a protein. The bioscientist who has ordered the work didn't take into consideration this fact while implementing cross-validation. As a result, the performance estimation turned out to be too optimistic because each time the training set and the test one contained observations with shared objects, drugs or proteins, and, hence, there was data leakage: some information about test datapoints was in the training set. Evidently, when used for that colleague's data the model got absolutely new objects without any relation with the data it had been trained on and its prediction ability occured to be much worse than it was expected.

The correct approach is to exclude from each training set those observations that contain the same drug or the same protein as in the test observation (the latter is the only datapoint in the test set because we use Leave-_**One**_-Out CV).

# Initial AI-generated code

Here is the code that the AI agent (Mistral Large) wrote for the task. It was its second attempt with minor correction (the very first version didn't work because I hadn't explained to the agent what the data in the files look like).

In [1]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import LeaveOneOut
from sklearn.preprocessing import StandardScaler

def calculate_c_index(y_true, y_pred):
    """
    Calculate the Concordance Index (C-index) for regression problems.
    This implementation counts concordant, discordant, and tied pairs.
    """
    n = len(y_true)
    concordant = 0
    discordant = 0
    tied_y = 0
    tied_y_pred = 0

    for i in range(n):
        for j in range(i+1, n):
            if y_true[i] == y_true[j]:
                tied_y += 1
            else:
                if (y_true[i] > y_true[j] and y_pred[i] > y_pred[j]) or (y_true[i] < y_true[j] and y_pred[i] < y_pred[j]):
                    concordant += 1
                elif (y_true[i] > y_true[j] and y_pred[i] < y_pred[j]) or (y_true[i] < y_true[j] and y_pred[i] > y_pred[j]):
                    discordant += 1
                else:
                    tied_y_pred += 1

    total = n * (n - 1) / 2
    if total == 0:
        return 0.5  # default value if no pairs can be compared

    c_index = (concordant + 0.5 * tied_y_pred) / (concordant + discordant + tied_y_pred + tied_y)
    return c_index

# Load the data with correct formats
# Features (space-separated numbers, no header)
X = np.loadtxt('input.data')

# Target values (one number per line, no header)
y = np.loadtxt('output.data')

# Pairs (two strings per line separated by space)
pairs = np.loadtxt('pairs.data', dtype=str)

# Data preparation
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initialize the KNN regressor with K=10
knn = KNeighborsRegressor(n_neighbors=10)

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Store predictions for C-index calculation
all_predictions = np.zeros(len(y))

# Perform LOOCV
for train_index, test_index in loo.split(X_scaled):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train = y[train_index]

    # Fit the model
    knn.fit(X_train, y_train)

    # Make prediction
    y_pred = knn.predict(X_test)[0]
    all_predictions[test_index] = y_pred

# Calculate C-index
c_index = calculate_c_index(y, all_predictions)
print(f"C-index: {c_index:.4f}")

C-index: 0.8293


In [2]:
# Definition of C-index function from Exercise 2

### Function for calculating C-index ###
# yt: array containing true label values.
# yp: array containing the predicted label values.
def cindex(yt, yp):
    n = 0
    h_num = 0
    for i in range(0, len(yt)):
        t = yt[i]
        p = yp[i]
        for j in range(i+1, len(yt)):
            nt = yt[j]
            np = yp[j]
            if (t != nt):
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt):
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n

c_ind = cindex(y, all_predictions)
print(f"C-index in terms of Exercise 2: {c_ind:.4f}")

C-index in terms of Exercise 2: 0.8293


As we can see, the code doesn't take into account the nature of the data at all — just like the bioscientist did: each cross-validation split yields one test observation and the training data which includes all other observations without exceptions.

However, the agent result is the C-index value of 0.83 whereas the bioscientist told about the score above 0.9. I'm not sure why this could be so.

Another remark is that standardization of the data shouldn't be made before splitting the data into folds, as far as I know. The proper procedure implies that the scaler is fitted on the training data and then transform both the training and test sets.

In other parts, code is appropriate. In particular, C-index is calculated correctly, although it may be overly complicated (with redundant variables).
The calculation slightly differs from the function used in Exercise 2: there, the pairs with `y_true[i] == y_true[j]` were not counted. However, both variants yield the same result on this dataset.

# The AI-generated code revisited

The code suggested by the agent after pointing out the mistakes follows.

In [3]:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

def calculate_c_index(y_true, y_pred):
    """
    Calculate the Concordance Index (C-index) for regression problems.
    """
    n = len(y_true)
    concordant = 0
    discordant = 0
    tied_y = 0
    tied_y_pred = 0

    for i in range(n):
        for j in range(i+1, n):
            if y_true[i] == y_true[j]:
                tied_y += 1
            else:
                if (y_true[i] > y_true[j] and y_pred[i] > y_pred[j]) or (y_true[i] < y_true[j] and y_pred[i] < y_pred[j]):
                    concordant += 1
                elif (y_true[i] > y_true[j] and y_pred[i] < y_pred[j]) or (y_true[i] < y_true[j] and y_pred[i] > y_pred[j]):
                    discordant += 1
                else:
                    tied_y_pred += 1

    total = n * (n - 1) / 2
    if total == 0:
        return 0.5  # default value if no pairs can be compared

    c_index = (concordant + 0.5 * tied_y_pred) / (concordant + discordant + tied_y_pred + tied_y)
    return c_index

# Load the data
X = np.loadtxt('input.data')  # Features
y = np.loadtxt('output.data')  # Target values
pairs = np.loadtxt('pairs.data', dtype=str)  # Drug-target pairs

# Create a dictionary to map drugs and targets to their indices
drug_indices = {}
target_indices = {}

for idx, pair in enumerate(pairs):
    drug, target = pair
    if drug not in drug_indices:
        drug_indices[drug] = []
    if target not in target_indices:
        target_indices[target] = []

    drug_indices[drug].append(idx)
    target_indices[target].append(idx)

# Initialize the KNN regressor with K=10
knn = KNeighborsRegressor(n_neighbors=10)

# Store predictions for C-index calculation
all_predictions = np.zeros(len(y))

# Perform LOOCV with proper exclusion of dependent observations
for test_idx in range(len(y)):
    # Get the current test pair
    test_drug, test_target = pairs[test_idx]

    # Find all indices that share either the drug or target with the test pair
    dependent_indices = set(drug_indices[test_drug] + target_indices[test_target])

    # Create training indices by excluding the test index and all dependent indices
    train_indices = [i for i in range(len(y)) if i not in dependent_indices]

    # Get the training and test data
    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_idx].reshape(1, -1)

    # Scale the data (fit on training only)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Fit the model
    knn.fit(X_train_scaled, y_train)

    # Make prediction
    y_pred = knn.predict(X_test_scaled)[0]
    all_predictions[test_idx] = y_pred

# Calculate C-index
c_index = calculate_c_index(y, all_predictions)
print(f"C-index: {c_index:.4f}")

C-index: 0.5130


Now, the resulting C-index value is 0.51 which confirms the suggestion about too optimistic initial estimation.

# Final version of the code

I take the AI-generated code as a starting point and change it to make the steps more clear (at least to me). I also add comments for the same purpose.

_The code below is independent of all the previous and can be run separately._

In [4]:
# Importing necessary libraries and components
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import LeaveOneOut

I use the function for C-index from Exercise 2. It is simpler and clearer than AI's one.

In [5]:
# Definition of C-index function from Exercise 2

# yt: array containing true label values.
# yp: array containing the predicted label values.
def c_index(yt, yp):
    n = 0
    h_num = 0
    for i in range(0, len(yt)):
        t = yt[i]
        p = yp[i]
        for j in range(i+1, len(yt)):
            nt = yt[j]
            np = yp[j]
            if (t != nt):
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt):
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n

In arranging the data, I use the AI's idea to create two dictionaries into which observations for each drug and each target are collected.

In [6]:
# Loading the data to DataFrames

X = pd.read_csv('input.data', sep=' ', header=None, index_col=False)
y = pd.read_csv('output.data', sep=' ', header=None, index_col=False)
y_true = y.to_numpy()   # get the array of true labels for future calculations
pairs = pd.read_csv('pairs.data', sep=' ', names=['drug', 'target'], index_col=False)

# Create two dictionaries, one for drugs and one for targets, so that
# each drug/target is a key and its value is an array of indices
# of those observations that concern this drug/target

drugs_in_observ = {}
targets_in_observ = {}

for idx in range(pairs.shape[0]):
    drug = pairs['drug'].iloc[idx]   # names of the drug and target in each row
    target = pairs['target'].iloc[idx]
    if drug not in drugs_in_observ:  # initialize the index array if not yet
        drugs_in_observ[drug] = []
    if target not in targets_in_observ:
        targets_in_observ[target] = []
    drugs_in_observ[drug].append(idx)  # add the index to the object's array
    targets_in_observ[target].append(idx)

# Check the data
print(f"Input sizes: {X.shape}")
display(X.head())
print(f"\nOutput size: {y.shape}")
display(y.head())
print(f"\nPairs' table sizes: {pairs.shape}")
display(pairs.head())
print("\nDrugs in observations: ")
i = 0
for k in drugs_in_observ.keys():
    print(f"{k}: {drugs_in_observ[k]}")
    i += 1
    if i > 5:
        break
print(f"\nTargets in observations: ")
i = 0
for k in targets_in_observ.keys():
    print(f"{k}: {targets_in_observ[k]}")
    i += 1
    if i > 5:
        break


Input sizes: (400, 67)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,57,58,59,60,61,62,63,64,65,66
0,0.759222,0.709585,0.253151,0.421082,0.72778,0.404487,0.709027,0.242963,0.407292,0.379971,...,0.838616,0.16505,0.515334,0.332678,0.577533,0.678125,0.463608,0.538938,0.460883,0.345251
1,0.034584,0.30472,0.688257,0.296396,0.151878,0.830755,0.270656,0.705392,0.18612,0.085594,...,0.472762,0.730013,0.639373,0.445218,0.45568,0.090737,0.308432,0.079023,0.603089,0.197008
2,0.737867,0.236079,0.905987,0.163612,0.801455,0.789823,0.393999,0.522067,0.411352,0.781861,...,0.595468,0.582292,0.836193,0.281514,0.79179,0.081695,0.58345,0.422539,0.076437,0.299662
3,0.406913,0.60774,0.235365,0.888679,0.150347,0.598991,0.130108,0.465818,0.799953,0.906878,...,0.45388,0.311799,0.534668,0.563793,0.727767,0.172686,0.908368,0.786892,0.790459,0.666388
4,0.697707,0.432565,0.650329,0.886065,0.32866,0.576926,0.5231,0.080463,0.131349,0.913496,...,0.583892,0.444141,0.249423,0.11069,0.42077,0.250148,0.19635,0.427255,0.166715,0.91972



Output size: (400, 1)


Unnamed: 0,0
0,0.733933
1,0.569419
2,0.832588
3,0.389664
4,0.725953



Pairs' table sizes: (400, 2)


Unnamed: 0,drug,target
0,D40,T2
1,D31,T64
2,D6,T58
3,D56,T49
4,D20,T28



Drugs in observations: 
D40: [0, 8, 60, 65, 89, 97, 266, 334, 379]
D31: [1, 26, 64, 218, 252, 279, 343, 384]
D6: [2, 11, 24, 37, 40, 302, 327]
D56: [3, 95, 248, 391]
D20: [4, 137, 170, 180, 196, 319, 374]
D46: [5, 141, 149, 202, 215, 351, 390]

Targets in observations: 
T2: [0, 30, 133, 183]
T64: [1, 36, 108, 110, 215, 388]
T58: [2, 53, 191, 253]
T49: [3, 85, 104]
T28: [4, 48, 70, 97, 234, 254, 281, 323]
T33: [5, 230, 355, 372, 391]


In [7]:
# Define the model
knn = KNeighborsRegressor(n_neighbors=10)

# Initialize the array of the model's predictions for all observations
y_pred = np.zeros(len(y_true))

The LOOCV splitting with dropping out the objects that participate in the test observation is implemented in a slightly different way than the agent proposed while the logic remains the same. The following approach seems to be more straightforward: it starts with the regular LOO split and then remove specified observations from the training fold.

In [8]:
# LOOCV with excluding the test objects from the training fold
loo = LeaveOneOut()

for train_idx, test_idx_arr in loo.split(X):
  test_idx = test_idx_arr.item()    # just one point in the test set
  X_train, X_test = X.iloc[train_idx], X.iloc[test_idx_arr]
  y_train, y_test = y.iloc[train_idx], y.iloc[test_idx] # the regular LOO
  drug = pairs['drug'].iloc[test_idx]   # which drug is in the test observ.
  target = pairs['target'].iloc[test_idx] # which target is there
  to_remove = np.union1d(drugs_in_observ[drug], targets_in_observ[target])
  to_remove = np.setdiff1d(to_remove, [test_idx])  # exclude the test index

  X_train = X_train.drop(to_remove)     # cleaned up version of the training set
  y_train = y_train.drop(to_remove).to_numpy()

  # Standardize the data
  scaler = StandardScaler()
  X_train_scaled = scaler.fit_transform(X_train)
  X_test_scaled = scaler.transform(X_test)

  # Fit the model
  knn.fit(X_train_scaled, y_train)

  # Make prediction
  y_pred[test_idx] = knn.predict(X_test_scaled).item()

print(f"C-index: {c_index(y_true, y_pred):.4f}")


C-index: 0.5130


Here, we exclude from each training fold those observations that share their drug or protein with the test observation. This helps to avoid data leakage in each CV iteration since its training fold and test fold have no common information.

We got the same value of C-index of 0.51 as AI's code did. No surprises, fortunately.

This means that the 10NN regressor is just a bit better than a random guesser, which score would be 0.5. Therefore, there is no reason to use it for such datasets.

# Summary

Here is an outline of the report, a more detailed argumentation could be found in the above text and code.

* The task was to evaluate an ML model's performance on pair-input data. The data for training and   evaluation encompass 400 observation measuring some pairwise affinities between 77 target proteins and 59 drug molecules. The model was $K$-Nearest Neighbours regressor with $K = 10$. The evaluation had to be implemented with Leave-One-Out Cross-Validation (LOOCV). The performance score supposed to be expressed by C-index.

* The initial approach to the model performance estimation was faulty because it didn't take into account the "paired nature" of the data. Every time the CV splitting of data took place, some objects (a drug or a protein) could be and most probably were in the training data and in the test datum at the same time. This caused data leakage and overestimation of the model's performance because, trying to predict the affinity value for a test observation, the model already had some information about objects participating in this observation.

* The initial code was generated by Mistrale Large AI agent (link to the discussion: https://studychat.ai.utu.fi/share/cs8KdQzNF-97pXGajRDOP). The prompt contained the task description only. The agent implemented the same faulty approach with generic LOOCV.
Another, smaller issue of the code was in the point when and how scaling of data was performed: before splitting and fitted on all dataset instead of fitting on the training part only, hence, after splitting. Most probably, it wouldn't affect the result in this specific task, but still.

* That code yielded the C-index of 0.83 which is, again, too optimistic, for the reason mentioned above.
After fixing the code, AI proposed a variant that apparently worked in a quite correct way.

* The final code (see [here](#scrollTo=Final_version_of_the_code)) for the task is a somewhat "rephrased" version of the AI's suggestion, which, honestly, wasn't worth it since the new version seems to be no better then the code corrected by AI.

* Lastly, the estimate of the model's prediction performance quality expressed with C-index occured to be about 0.51, which is extremely low value (random guessing provides C-index = ROC AUC of 0.50).
The obtained results indicate that the model is very bad for predicting drug-protein affinity on this dataset. Hence, it shouldn't be recommended for using by the colleague for the similar problem on their data, too.