In [45]:
#importing the needed libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import StratifiedKFold

#loading the training dataframe
data = pd.read_csv('train.csv')

#determining the feature-values and target-value
X = data.drop("stroke", axis=1)
y = data["stroke"]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#standardizing the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#initiating KNN and using using gridsearch to find the right k-value for the best f1-score
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': range(1, 15)}
stratified_kfold = StratifiedKFold(n_splits=5)
grid_search = GridSearchCV(knn, param_grid, cv=stratified_kfold, scoring='f1')
grid_search.fit(X_train, y_train)

# Print parameters and best score found by GridSearchCV
print("Best k-value:", grid_search.best_params_)
print("Best cross-validation f1-score:", grid_search.best_score_)

# Use the best model to make predictions on the test set
best_knn = grid_search.best_estimator_
y_pred = best_knn.predict(X_test)

# Show the model performance
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Best k-value: {'n_neighbors': 1}
Best cross-validation f1-score: 0.05698086468104273
Test Accuracy: 0.9703427719821163

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.99      0.98      6600
           1       0.03      0.03      0.03       110

    accuracy                           0.97      6710
   macro avg       0.51      0.51      0.51      6710
weighted avg       0.97      0.97      0.97      6710



The K_Nearest_neighbors (kNN) model is a model that uses distances to make predictions. Those are the distances between the data points. These distances can be calculated in several ways. 2 of those are the Euclidean distance and the Manhattan distance.


### Euclidean Distance:

$$d(x, y) = \sqrt{(x_1 - x_2) + (y_1 - y_2)^2}$$

For the Euclidean distance the length of the straight line between 2 data points is calculated. 


### Manhattan Distance:

$$d(x, y) =|x_1 - x_2| + |y_1 - y_2|$$

This is the sum of the absolute differences between the coordinates of two points.

As the distances are calculated, for each point it can be determined what other points are closest. If you want to classify the category of a certain point, let’s call it point 1, then the model looks at the k amount of closest points and determines the modus of the categories of these points. The modus will be the category of point 1. This is done for all the point that have to be classified. 


### Standardization for kNN
Standardization is of great importance for this model. kNN depends on calculating distances between data points. The distances metrics Euclidean distance and Manhattan distance are highly influenced by the scale of  each feature. For example, if on feature would be in much larger scale than the others, it will have a greater influence on the model than the other features. To prevent this, standardization is a good solution. It makes sure all the features are the same scale.

### Regularization for kNN
For kNN the choice for the value of k can be seen as a form of regularization. Choosing a smaller k will make the model more sensitive to noise. Noise in data means random or unwanted variation in the data which doesn't add to the classifcation of the model. This means, when choosen a smal value for k, the model reacts strongly to individual points, eventhough they might be outliers, which could lead to overfitting.
Choosing a higher value for k, can help reduce the effect of the noise. On the other hand, if k is too large, the model can be too averaged, which could lead to underfitting.
By selecting the optimal k value, it is possible finding the right balance between over- and underfitting and in that way regularizing the model.