# Exercise 4 | TKO_7092 Evaluation of Machine Learning Methods
---

Anton Teerioja<br>
2214231<br>
asteer@utu.fi<br>

---

The deadline for returning this exercise is **25.2.2026**.

If you have any questions about this exercise, please contact Riikka Numminen (rimanu@utu.fi) in good time before the deadline.

## Nested cross-validation for feature selection
In this exercise, the task is to use leave-one-out cross-validation for model selection to understand the effect of the winner's curse. 
This is demonstrated by using greedy forward selection and a random binary data set.
The data set is a balanced sample of size 60 (i.e. 30 positives and 30 negatives) with a hundred features. The data are i.i.d., and every feature follows a Bernoulli distribution with $p=0.5$. Thus, there is no signal in the data. 

The model to be used is 1-nearest neighbour with 10 features, and the greedy forward selection is used to select the best 10 features among all the features.
Leave-one-out cross-validation is used for performance evaluation, and the prediction performance is measured as accuracy. 

### Greedy forward feature selection
Greedy forward feature selection is an iterative feature selection process, where the features are selected one by one, avoiding a need to iterate through every possible combination of features. The features are selected as follows:
- First, every feature is tested solely and the best is selected.
- Then the selected feature is tested together with any other remaining feature and the best such a set of two features is selected.
- Then that set of the selected two features is tested together with any other remaining feature and the best set of three features is selected.
- The process is then continued accordingly until the desired amount of features is selected.

### Implement the following tasks to complete this exercise:
1. Use leave-one-out cross-validation to select the best 10 features. Report the optimal set of features and the corresponding accuracy.
2. Use nested leave-one-out cross-validation (leave-one-out on both layers of cross-validation) to obtain an estimate of the prediction accuracy on unseen data, when the final hypotheses are obtained according to the procedure in the first step.
3. Explain the difference in the obtained accuracies.

### Import libraries

In [6]:
# In this cell, import all the libraries that you will use in this notebook. For example:
import numpy as np 
import pandas as pd
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier

### Load the data
The labels are saved in a file *y_generated.csv*, and the features in file *X_generated.csv*.

In [7]:
# Read the data files. Verify that the data dimensions are as expected.
X = pd.read_csv('X_generated.csv', header=None)
y = pd.read_csv('y_generated.csv', header=None)

#Ensure that both groups have 60 instances and that there are 100 features
print(X.shape, y.shape)

(60, 100) (60, 1)


### Leave-one-out cross-validation

In [None]:
# Write your implementation for the first part of the exercise here.
knn = KNeighborsClassifier(n_neighbors=1, n_jobs=-1)

sfs = SequentialFeatureSelector(knn, n_features_to_select=10, direction='forward', cv=LeaveOneOut(), n_jobs=-1)

sfs.fit(X, y.values.ravel())

# Select the features
selected_features_1 = sfs.get_support(indices=True)
X_selected = X.iloc[:, selected_features_1]

# Evaluate accuracy without LOOCV cross-validation
accuracy_1 = cross_val_score(knn, X_selected, y.values.ravel(), scoring='accuracy', n_jobs=-1).mean()

print(f"Selected features: {selected_features_1}")
print(f"Accuracy: {accuracy_1}")

Selected features: [ 0  1  6 24 35 55 64 65 73 94]
Accuracy: 0.7666666666666666


### Nested leave-one-out cross-validation

In [None]:

# Write your implementation for the second part of the exercise here.
knn = KNeighborsClassifier(n_neighbors=1, n_jobs=-1)

sfs = SequentialFeatureSelector(knn, n_features_to_select=10, direction='forward', cv=LeaveOneOut(), n_jobs=-1)

sfs.fit(X, y.values.ravel())

# Select the features
selected_features_2 = sfs.get_support(indices=True)
X_selected = X.iloc[:, selected_features_2]

# Evaluate accuracy using LOOCV cross-validation
accuracy_2 = cross_val_score(knn, X_selected, y.values.ravel(), cv=LeaveOneOut(), scoring='accuracy', n_jobs=-1).mean()

print(f"Selected features: {selected_features_2}")
print(f"Accuracy: {accuracy_2}")


Selected features: [ 0  1  6 24 35 55 64 65 73 94]
Accuracy: 0.8333333333333334


### Analyse the results

Why are the results as they are? Why is the nested cross-validation needed?

Both methods selected the same features with the first method producing a worse accuracy score than the second method.

In the first method the LOOCV cross-validation was used to select the 10 best features. In the second method the LOOCV cross-validation was also used in calculating the accuracy of the greedy forward feature selection. Because the greedy algorithm doesn't always produce the highest accuracy result, the second layer of cross-validation can help with finding the optimal solution.

### AI usage

In [11]:
# In case AI was used when solving the exercise, please explain how and in which parts it was used.