In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml

from sklearn.preprocessing import StandardScaler

## Function to fetch any dataset available on OpenML. You can freely use that to try these methods over different datasets

This example is based on the Spambase dataset to detect spam in emails.

Other datasets to try (Suggestion):

- **phoneme**
- **australian**
- **diabetes**
- **wdbc**
- **letter**

Whole list of avaliable datasets as well as its descriptions can be found in the original webpage: www.openml.org

In [None]:
X, y = fetch_openml('spambase', return_X_y=True) # 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Preparing Techniques

Note The hyperparameters setting like the n_neighbors (k value), and max_depth for the DecisionTreeClassifier. See how changing these configurations can significantly change the performance of your algorithm.

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
tree = DecisionTreeClassifier(max_depth=5, criterion='entropy')
bayes = GaussianNB()

classifiers = [knn, tree, bayes]
names = ['3-NN', 'Decision Tree', 'Gaussian NB']

In [None]:
for clf, name in zip(classifiers, names):
    clf.fit(X_train, y_train)
    print(f"Score {name}: {clf.score(X_test, y_test)}")

## Scaling the data

It is very important to normalize your input data (in particular before using a KNN classifier) since each feature can be in a different scale. So, the Euclidean distance is dominated by features with larger ranges (e.g., [10, 1000] compared to others between [0, 1].

Scaling is a transformation that also "learns from the data" (learns the maximum and minimum values for instance. For a serious experiment avoid all sort of biases it is essential to lear this transformation using the training data only. Then, it can be applied to the test data. Here is an example:

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train) 
X_test = scaler.transform(X_test)

## Running after scaling

In [None]:
for clf, name in zip(classifiers, names):
    clf.fit(X_train, y_train)
    print(f"Score {name}: {clf.score(X_test, y_test)}")

See now how the peformance of the KNN classifier is much better!