# 3.4 - How to choose the best model?

## Selecting the best hyperparameters
How to decide on the best hyperparameters while ensuring that the models are generalizable?


### Train/Test Split
A first naive approach would be to train the model using the whole dataset and then just compute an accuracy measure on the dataset (e.g. the MSE). An then to choose the model's hyperparameters such that the accuracy is maximized. 
However, there is a strong possibility that the model's parameters are then chosen to maximize accuracy on that specific dataset, leading to *overfitting*. If that model is applied to another dataset, there is no guarantee it will perform well.

An improvement to this is to split the dataset into two parts: one used for training, another for testing the model's performance. But overfitting can still occur here, as we'd choose the hyperparameters based on performance on the same test set.

To avoid overfitting, and *ensure better generalizability* of the dataset, we can split it into three disjointed parts: 
- training part: a part for training the model (e.g., estimating the coefficients for regression),
- validation part: for evaluating the model's performance and tuning hyperparameters,
- test part: after deciding on the hyperparameters, used to estimate the final accuracy of the model.

![Alt text](images/validation.png)


These parts are usually chosen randomly from the dataset. This can lead to biased results if, by bad luck, either of them don't represent the full dataset. Further, it is, in a way, a shame that we have to use the validation part, because it could be used in training to provide a more accurate model. A way to to solve this is k-fold cross validation.

### K-fold Cross-validation

Now we first split the dataset into two disjointed parts:
- training+validation
- test

Then, we subdivide the training+validation into k equally sized chunks (called folds). Of these k folds, k-1 are used for training, and 1 is used for validation (like in the train/test split). But the idea here is that we perform this procedure $k$ times, assigning each time the validation set to one of the folds. Then, the final accuracy estimation is an average over the accuracies of each of the $k$ iterations. 

![Alt text](images/validation-folds.png)

See https://scikit-learn.org/stable/modules/cross_validation.html for examples!


In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor

# Example dataset (you'll replace this with your actual data)
X = np.array([[1, 2], [3, 5], [5, 4], [8, 2]])  # Features
y = np.array([3, 8, 6, 10])  # Target values

# Split into training and testing sets (optional for evaluation)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

knn = KNeighborsRegressor(n_neighbors=3)  # Set k to 3 

knn.fit(X_train, y_train)

new_points = np.array([[2, 3], [6, 3]])  # Example new data for prediction
predictions = knn.predict(new_points)
print(predictions)

In [None]:
# Example with cross-validation