<a href="https://colab.research.google.com/github/Sergei-N-Fedorov/Data_Analysis/blob/main/EMLM_Exercise4_Sergei_Fedorov_(sefedo).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 4 | TKO_7092 Evaluation of Machine Learning Methods
---

Student name: **Sergei Fedorov**<br>
Student number: **2511405**<br>
Student email: sefedo@utu.fi<br>

---

The deadline for returning this exercise is **25.2.2026**.

If you have any questions about this exercise, please contact Riikka Numminen (rimanu@utu.fi) in good time before the deadline.

## Nested cross-validation for feature selection
In this exercise, the task is to use **leave-one-out cross-validation** for model selection to understand the effect of the winner's curse.
This is demonstrated by using **greedy forward selection** and a random binary data set.
The data set is a balanced sample of size 60 (i.e. 30 positives and 30 negatives) with a hundred features. The data are i.i.d., and every feature follows a Bernoulli distribution with $p=0.5$. Thus, there is no signal in the data.

The model to be used is **1-nearest neighbour** with **10 features**, and the greedy forward selection is used to select the best 10 features among all the features.
Leave-one-out cross-validation is used for performance evaluation, and the prediction performance is measured as **accuracy**.

### Greedy forward feature selection
Greedy forward feature selection is an iterative feature selection process, where the features are selected one by one, avoiding a need to iterate through every possible combination of features. The features are selected as follows:
- First, every feature is tested solely and the best is selected.
- Then the selected feature is tested together with any other remaining feature and the best such a set of two features is selected.
- Then that set of the selected two features is tested together with any other remaining feature and the best set of three features is selected.
- The process is then continued accordingly until the desired amount of features is selected.

### Implement the following tasks to complete this exercise:
1. Use leave-one-out cross-validation to select the best 10 features. Report the optimal set of features and the corresponding accuracy.
2. Use nested leave-one-out cross-validation (leave-one-out on both layers of cross-validation) to obtain an estimate of the prediction accuracy on unseen data, when the final hypotheses are obtained according to the procedure in the first step.
3. Explain the difference in the obtained accuracies.

### Import libraries

In [1]:
# Import the libraries needed
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


### Load the data
The labels are saved in a file *y_generated.csv*, and the features in file *X_generated.csv*.

In [8]:
# Read the data files. Verify the data dimensions
X = pd.read_csv('X_generated.csv', header = None, index_col = False)
y = pd.read_csv('y_generated.csv', header = None, index_col = False)
print(X.shape)
print(y.shape)

(60, 100)
(60, 1)


The loaded dataset has 60 rows (datapoint) and 100 columns (features).

### Leave-one-out cross-validation

In [6]:
# Function for the greedy forward feature selection

def greedy_forward(X_input, y_input, selected):
  '''
  Function that returns the next best feature in addition to `selected`
  and the accuracy of the model with the extended feature subset.
  Selecting is managed by the cross-validation method `cv`.
  '''
  n_features = X_input.shape[1]   # total number of features (100)
  acc = np.zeros(n_features)      # accuracies of the model along all the data
                                  # when each feature is tried as
                                  # complementary to the given subset `selected`
  for f in range(n_features):
    if f not in selected:             # choose one new feature,
      subset = selected + [f]         # add it to the subset,
      X_current = X_input[:, subset]  # and crop the data accordingly
      y_pred = np.zeros(len(y_input))

      loo = LeaveOneOut()             # LOOCV splits for the input data
      for train_idx, test_idx in loo.split(X_current):
        X_train, X_test = X_current[train_idx], X_current[test_idx]
        y_train, y_test = y_input[train_idx], y_input[test_idx]

        knn = KNeighborsClassifier(n_neighbors=1)
        knn.fit(X_train, y_train)
        y_pred[test_idx] = knn.predict(X_test)

      acc[f] = accuracy_score(y_input, y_pred)

  #with np.printoptions(precision=2):
  #  print(acc)

  best = int(np.argmax(acc)) # index of the feature that yielded the best score
  # print(f"Best accuracy when adding {len(selected) + 1} feature (namely {best}): {acc[best]:.2f}")
  return best, acc[best]

In [9]:
features = X.columns  # list of integers
X = X.values          # store the data as numpy arrays
y = y.values.ravel()

In [10]:
# Greedy forward feature selection via regular LOOCV

selected = []       # features (indexes) selected by the method
acc_10 = 0          # resulting accuracy for the selected 10-feature subset
for i in range(10):
  best = greedy_forward(X, y, selected)
  selected.append(best[0])
  if i == 9:
    acc_10 = best[1]

print(f"\nSelected subset of the features: {selected}")
print(f"Accuracy for this subset: {acc_10:.4f}")



Selected subset of the features: [0, 1, 94, 65, 6, 64, 73, 55, 24, 35]
Accuracy for this subset: 0.8333


### Nested leave-one-out cross-validation

In [12]:
# Nested LOOCV for feature selection

resulting_accuracies = []  # accuracy values for the models over outer splits
split_count = 0

cv = LeaveOneOut()
for train_idx, test_idx in cv.split(X):       # outer split
  X_train, X_test = X[train_idx], X[test_idx]
  y_train, y_test = y[train_idx], y[test_idx]

  fold_selected = []    # the current feature subset selected on the outer fold
  for i in range(10):
    best = greedy_forward(X_train, y_train, fold_selected)  # inner CV inside
    fold_selected.append(best[0])

  model = KNeighborsClassifier(n_neighbors=1)
  model.fit(X_train[:, fold_selected], y_train)    # fitting the model with
                               # the selected features on the training data
  y_pred = model.predict(X_test[:, fold_selected]) # predicting on unseen data
  fold_acc = accuracy_score(y_test, y_pred)        # 0 / 1 (on one test point)
  resulting_accuracies.append(fold_acc)

  split_count += 1
  print(f"\nOuter split number: {split_count}")
  print(f"Selected subset of the features: {fold_selected}")
  print(f"Accuracy for this subset: {fold_acc:.4f}")

print(f"\nAverage accuracy over outer splits: {np.mean(resulting_accuracies):.4f}")



Outer split number: 1
Selected subset of the features: [0, 1, 94, 65, 6, 64, 73, 55, 24, 35]
Accuracy for this subset: 1.0000

Outer split number: 2
Selected subset of the features: [0, 1, 94, 65, 6, 64, 73, 55, 9, 3]
Accuracy for this subset: 0.0000

Outer split number: 3
Selected subset of the features: [0, 1, 94, 65, 6, 55, 36, 26, 16, 17]
Accuracy for this subset: 1.0000

Outer split number: 4
Selected subset of the features: [0, 1, 94, 65, 6, 64, 76, 51, 55, 83]
Accuracy for this subset: 1.0000

Outer split number: 5
Selected subset of the features: [0, 1, 57, 33, 18, 72, 65, 70, 4, 61]
Accuracy for this subset: 0.0000

Outer split number: 6
Selected subset of the features: [0, 1, 94, 26, 75, 35, 6, 11, 64, 28]
Accuracy for this subset: 1.0000

Outer split number: 7
Selected subset of the features: [0, 1, 94, 65, 6, 64, 73, 24, 10, 72]
Accuracy for this subset: 1.0000

Outer split number: 8
Selected subset of the features: [0, 1, 94, 65, 6, 64, 73, 24, 31, 16]
Accuracy for this s

_Note:_  In the output above, the accuracy for each subset and, hence, each outer split is calculated on a single test datapoint. It's, therefore, 0 if the prediction was wrong, and 1 if it was correct.
The resulting accuracy is then the average value of them.

### Analysis of the results

In the case of the plane LOOCV, we select an optimal feature subset and, at the same time, estimate the model performance quality on the same data. This means the model has been adjusted to the dataset before testing. More exactly, we chose the model that, out of many other models, accidentaly showed a good result on this dataset which was not because of the high quality of this model or of such models on average. Therefore, the estimation is too optimistic.

On the opposite, the nested LOOCV leaves a test datapoint safe and makes possible to evaluate models on the truely unseen data. It doesn't build the estimation on one (maybe lucky) model but consider multiple models constructed with one method. So, it estimates behaviour of such models in general. Due to this fact, we get more fair assessment of the quality of the feature selection method for 1NN classifier.

The resulting accuracy value of 0.5, provided by Nested CV, reflects the real situation on this random dataset: no model is able to predict well the labels from the data with no signal. The initial estimation (0.83) obtained from LOOCV is obviously not relevant in this context.

### AI usage

AI was used to explain me

- the key idea of using the nested CV: what its results exactly tell us (that this is not about the model itself but rather the method of building this model),

- some technical stuff about numpy, in particular computational efficiency,

- reasons of errors in the code.