# Exercise 2 | TKO_7092 Evaluation of Machine Learning Methods 2025
## deadline: 12.2.2025 - 23:59

Regarding any questions about this exercise, please contact course assistant Jonne Pohjankukka (jjepoh@utu.fi)

********************************************

Student name: Emil Hellberg

Student number: 1901299

Student email: ephell@utu.fi

********************************************

## Water permeability prediction in forestry <br>

In this task, the client wants you to estimate the spatial prediction performance of K-nearest neighbor regression model with K=7 (7NN), using spatial leave-one-out cross-validation (i.e. SKCV, with number of folds == number of data points). The client wants you to use the C-index as the performance measure.  

In other words, the client wants you to answer the question: "What happens to the prediction performance of water permeability using 7-nearest neighbor regression model, when the geographical distance between known data and unknown data increases?".

In this task, you have three data files available (with 1691 data points): 

- input.csv, contains the 75 predictor features. 
- output.csv, contains the water permebility values. 
- coordinates.csv, contains the corresponding geographical coordinate locations of the data points. The unit of the coordinates is metre, and you can use Euclidean distance to calculate distances between the coordinate points. 

Implement the following tasks to complete this exercise:

********************************************

#### 1. Z-score standardize the predictor features (input.csv). 

#### 2. Perform spatial leave-one-out cross-validation with 7NN model for the provided data set (refer to the lectures 3.1.3 and 3.1.4 in 'Evaluating spatial models with spatial cross-validation' for help). Estimate the water permeability prediction performance (using 7NN model and C-index) with the following distance parameter values: d = 0, 20, 40, ..., 300 (that is, 20 meter intervals from 0m to 300m). 

#### 3. When you have calculated the C-index performance measure for each value of d, visualize the results with the C-index (y-axis) as a function of d (x-axis).

********************************************

Your .ipynb-file must include the following: 

- Your own implementation of the spatial leave-one-out cross-validation for the current task. You can use third-party libraries (e.g. Scikit-learn) if you want for implementing e.g. the 7-nearest neighbor. Also, try to follow good programming practices and add comments to relevant parts of your code explaining what you are doing and why.


- Plot of the graph C-index vs. distance parameter value. 


<br><br><br>
-- START IMPLEMENTING YOUR EXERCISE AFTER THIS LINE --
<br><br><br>

### Import necessary libraries

In [1]:
# In this cell, import all the libraries that you need. For example: 
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

### Read in the datasets

In [2]:
# In this cell, read the files input.csv, output.csv and coordinates.csv.
# Print out the dataset dimesions (i.e. number of rows and columns).

x = pd.read_csv('./input.csv')
y = pd.read_csv('./output.csv')
coordinates = pd.read_csv('./coordinates.csv')

print('x:', x.shape)
print('y:', y.shape)
print('coordinates', coordinates.shape)

x: (1690, 75)
y: (1690, 1)
coordinates (1690, 2)


### Standardization of the predictor features (input.csv)

In [3]:
# Standardize the predictor features (input.csv) by removing the mean and scaling to unit variance. 
# In other words, z-score the predictor features. You are allowed to use third-party libraries for doing this.

scaler = StandardScaler()
standardized_x = scaler.fit_transform(x)

### Functions and analysis code

In [4]:
# Include here all the functions and other relevant code that you need in order to implement the task.

# Note! Utilize the following two functions in your implementation:

### Function for calculating C-index ###
# y: array containing true label values.
# yp: array containing the predicted label values.
def cindex(y, yp):
    n = 0
    h_num = 0 
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n


### Function for calculating the pairwise spatial distances between the data points ###
# The function will return a n-by-n matrix of Euclidean distances. For example, the
# distance_matrix element at indices i,j will contain the spatial distance between 
# data point i and j. Note that the element value is 0 always when i==j.
# coordinate_array: n-by-2 array containing the coordinates of the exercise data points.
def cdists(coordinate_array):
    number_of_observations = coordinate_array.shape[0]
    distance_matrix = np.zeros((number_of_observations, number_of_observations))
    for i in range(0, number_of_observations):
        distance_matrix[i, :] = np.sqrt(np.sum((coordinate_array - coordinate_array[i])**2, axis=1))
    return distance_matrix

### Results for spatial leave-one-out cross-validation with 7-nearest neighbor regression model

In [5]:
# In this cell, run your script for the Spatial leave-One-Out cross-validation 
# with 7-nearest neighbor regression model and visualize the results as 
# requested in the task assignment.

distances = cdists(coordinates.to_numpy())

standardized_x = pd.DataFrame(standardized_x)
results = pd.DataFrame(columns = ['d', 'c-index'])

for d in range(0, 16):
    d = d * 20
    c_index = 0
    y_true = []
    y_pred = []

    for n in range(0, len(standardized_x)):
        n_coordinates = coordinates.iloc[n]
        n_x = n_coordinates.iloc[0]
        n_y = n_coordinates.iloc[1]
        x_test = standardized_x.iloc[n]
        x_train = standardized_x
        y_train = y

        y_test = y.iloc[[n]]

        for i in range(0, len(standardized_x)):
            x_coordinates = coordinates.iloc[i]
            x_x = x_coordinates.iloc[0]
            x_y = x_coordinates.iloc[1]

            if distances[n, i] < d:
                x_train = x_train.drop(i)
                y_train = y_train.drop(i)
                
        y_train = y_train.to_numpy()
        
        x_train = x_train.to_numpy()
        
        model = KNeighborsRegressor(n_neighbors = 7)
        model.fit(x_train, y_train)

        x_test = x_test.to_numpy()
        x_test = x_test.reshape(1, -1)
        y_test = y_test.to_numpy()
        y_test = y_test[0][0]
        prediction = model.predict(x_test)[0][0]
        y_true.append(y_test)
        y_pred.append(prediction)

    c_index = cindex(y_true, y_pred)
    print(c_index)
    new_row = pd.DataFrame({'d': [d], 'c-index': [c_index]})
    results = pd.concat([results, new_row], ignore_index = True)
    print(results)

0.7667673696957465
   d   c-index
0  0  0.766767


  results = pd.concat([results, new_row], ignore_index = True)


0.7074146021494525
    d   c-index
0   0  0.766767
1  20  0.707415
0.7030574616950065
    d   c-index
0   0  0.766767
1  20  0.707415
2  40  0.703057
0.69518657741449
    d   c-index
0   0  0.766767
1  20  0.707415
2  40  0.703057
3  60  0.695187
0.686781693773091
    d   c-index
0   0  0.766767
1  20  0.707415
2  40  0.703057
3  60  0.695187
4  80  0.686782
0.6819616837938902
     d   c-index
0    0  0.766767
1   20  0.707415
2   40  0.703057
3   60  0.695187
4   80  0.686782
5  100  0.681962
0.6155008381407292
     d   c-index
0    0  0.766767
1   20  0.707415
2   40  0.703057
3   60  0.695187
4   80  0.686782
5  100  0.681962
6  120  0.615501
0.5994945226522248
     d   c-index
0    0  0.766767
1   20  0.707415
2   40  0.703057
3   60  0.695187
4   80  0.686782
5  100  0.681962
6  120  0.615501
7  140  0.599495
0.5949225911162689
     d   c-index
0    0  0.766767
1   20  0.707415
2   40  0.703057
3   60  0.695187
4   80  0.686782
5  100  0.681962
6  120  0.615501
7  140  0.599495
8 

In [8]:
d = [0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300]
c_index = [0.766767, 0.707415, 0.703057, 0.695187, 0.686782, 0.681962, 0.615501, 0.599495, 0.594923, 0.593153, 0.589681, 0.586355, 0.584087, 0.584437, 0.584774, 0.584025]

result_df = pd.DataFrame({'d': d, 'C-index': c_index})

plt.scatter(result_df['d'], result_df['C-index'], color='black')

plt.xlabel('Distance (X-axis)')
plt.ylabel('C-index (Y-axis)')
plt.title('Scatter plot showing concordance index as a function of distance parameter')

plt.show()

KeyError: 'c-index'

## Analysis of the results

### In this cell, you need to answer the client's questions:


1. What happens to the 7NN performance as the prediction distance increases?


2. Do you think the results behave as was somewhat expected? Do they make sense, why?


3. If we require that the 7NN must have at least C-index performance of 0.68, then up to what distance should we trust the 7NN predictions, based on the results?