## Evaluating Model Performance

In the previous lesson, we implemented the K-Nearest Neighbors algorithm and trained a classifier to predict if a bank client would subscribe to the bank's product.

In this lesson, we'll build and train the classifier using scikit-learn and try to improve upon our previous model's performance.

We'll use the [same dataset](../data/subscription_prediction.csv) as last time.

Let's start by loading the dataset.

In [19]:
# Import the libraries
import pandas as pd


In [20]:
# Load the dataset
banking_df = pd.read_csv("../data/subscription_prediction.csv")

# Display the first 5 rows of the dataframe
banking_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,57,housemaid,divorced,basic.4y,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,39,management,single,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [21]:
# Convert the target column ['y'] to 1 and no to 0.
banking_df["y"] = banking_df["y"].apply(lambda x: 1 if x == "yes" else 0)

# Display the first 5 rows of the dataframe
banking_df.head()


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,57,housemaid,divorced,basic.4y,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,39,management,single,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


## 2. Validation Set

We learned in the previous lesson that our data set doesn't contain any missing values. We don't currently need to wrangle our data any further. We can move on to preparing it for training our model by splitting it into training and test sets. Our goal with training our model is to see how well it can perform on the test set or on unseen data. However, if we repeatedly evaluate the model on the test set and re-train it, we are introducing bias.

Our model will start to indirectly learn from our test set. In this situation, we won't be able to effectively judge how well our model performs on data it hasn't seen before.

That's why we need a buffer between our training and test sets. We want to evaluate our model and improve upon it without having to use the test set. We'll create a validation set, sometimes referred to as a development set or dev set. We can then train our model and evaluate it on the validation set. Depending on its performance on the validation set, we can re-train it with some tweaks and evaluate it again.

Once we're satisfied with the model's performance on the validation set, we can evaluate it one last time on the test set.

We'll split that data set into three parts:

1. Training Set (60%)
2. Validation Set (20%)
3. Test Set (20%)

We'll use scikit-learn's train_test_split() to split the data set twice.

In [22]:
# Import the train_test_split function for splitting the data set
from sklearn.model_selection import train_test_split

# Split the data set into features and target
# features are all columns except the target column ['y']
X = banking_df.drop(["y"], axis=1)

# target is the 'y' column
y = banking_df["y"]

# Split the data set into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state = 417)

# Split the training set into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.20*X.shape[0]/X_train.shape[0], random_state = 417)

## 3. Building and training a KNN

Now that we have our training set, we can build a classifier and fit the model to the data.

We learned in the first lesson that fitting a model is the same as training the model. The model learns from the data. However, when we implemented our model from scratch in the previous lesson, we learned that k-NNs don't really have a training phase. 

So how would we use scikit-learn to "fit" our model?

When we implemented a k-NN from scratch, we calculated the distance between observations. Across a large number of features and observations, this can be a computationally expensive task. Instead of a brute force approach, we can use different data structures to help speed that up.

scikit-learn uses the training phase to set up such a data structure. For different algorithms, scikit-learn handles the training phase differently. This again brings up the point of experimenting without understanding an algorithm's inner workings. We wouldn't have learned about this distinction if we hadn't implemented a k-NN from scratch!

Let's build our model!

In [23]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)

ValueError: could not convert string to float: 'admin.'

## 4. Feature Engineering

Calling fit() on our training data produced an error on the previous screen.

This is because we have categorical columns in our dataset that haven't yet been converted into dummy variables. Since a k-NN uses a distance metric, it can't work with string data.

We will next:

1. Convert all our categorical columns into dummy variables.
2. Normalize the features by scaling their values to the range [0, 1].

In the previous lesson, we implemented the normalization from scratch. Now we'll use scikit-learn's MinMaxScaler method.

From a functional perspective, it works similar to the way we'd instantiate a model in scikit-learn and then call fit() on the data.

However, fitting to the data is just the first step. sklearn.preprocessing.MinMaxScaler.fit() calculates the minimum and maximum values for each feature we input. We then need to transform those features, using sklearn.preprocessing.MinMaxScaler.transform() to normalize those features.

scikit-learn provides us with a single function, sklearn.preprocessing.MinMaxScaler.fit_transform(), that allows us to carry out both operations.

In [24]:
from sklearn.preprocessing import MinMaxScaler

X_train = pd.get_dummies(data = X_train, columns = ["marital", "default"], drop_first = True)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train[["marital_married", "marital_single", "marital_unknown", "default_unknown", "age", "duration"]])

## 5. Evaluating the model on the validation set
We can now build and train our model again. On this screen, we'll also evaluate our model on our validation set.

Since we transformed some of the features in our training data in the previous screen, we need to make sure we transform those same features in our validation set.

We don't need to use sklearn.preprocessing.MinMaxScaler.fit() again. Our scaler has already "learned" how to scale the training data and we can directly transform our validation (or test) data set using the already-defined scaler.

In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier

banking_df = pd.read_csv("../data/subscription_prediction.csv")
X = banking_df.drop(["y"], axis=1)
y = banking_df["y"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state = 417)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.20*X.shape[0]/X_train.shape[0], random_state = 417)

X_train = pd.get_dummies(data = X_train, columns = ["marital", "default"], drop_first = True)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train[["marital_married", "marital_single", "marital_unknown", "default_unknown", "age", "duration"]])


knn = KNeighborsClassifier(n_neighbors = 1)
knn.fit(X_train_scaled, y_train)

X_val = pd.get_dummies(data = X_val, columns = ["marital", "default"], drop_first = True)

X_val_scaled = scaler.transform(X_val[["marital_married", "marital_single", "marital_unknown", "default_unknown", "age", "duration"]])

val_accuracy = knn.score(X_val_scaled, y_val)
print(f"Accuracy of model evaluated on validation set with K = 1: {val_accuracy*100:.2f}%")

knn = KNeighborsClassifier(n_neighbors = 2000)
knn.fit(X_train_scaled, y_train)

val_accuracy = knn.score(X_val_scaled, y_val)
print(f"Accuracy of model evaluated on validation set with K = 2000: {val_accuracy*100:.2f}%")

Accuracy of model evaluated on validation set with K = 1: 69.19%
Accuracy of model evaluated on validation set with K = 2000: 59.46%


## 6. Overfitting and Underfitting
In the previous output, selecting two drastically different values for K results in different model accuracies. One performs much better than the other.

![overfitting](../images/6.1-m738.svg)


This difference in performance highlights a crucial concept in machine learning: the balance between underfitting and overfitting. In machine learning, our goal is to create models that generalize well to new, unseen data. However, finding the right balance can be challenging, and two common issues that can arise are underfitting and overfitting.

Let's explore these concepts using the K-Nearest Neighbors (KNN) algorithm, focusing on how different values of 
$\kappa$ can lead to these issues:

1. When $\kappa$ is too small (e.g., $\kappa = 1$), the model may overfit the data.             
2. When $\kappa$ is too large (e.g., $\kappa = 100$), the model may underfit the data.
3. An appropriate middle value of $\kappa$ can lead to a good fit that generalizes well.

```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Create a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=42)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Function to evaluate model
def evaluate_model(k):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    train_accuracy = knn.score(X_train, y_train)
    test_accuracy = knn.score(X_test, y_test)
    return train_accuracy, test_accuracy

# Evaluate models with different k values
k_1 = evaluate_model(1)
k_5 = evaluate_model(5)
k_100 = evaluate_model(100)

print(f"K=1: Train Accuracy: {k_1[0]:.4f}, Test Accuracy: {k_1[1]:.4f}")
print(f"K=5: Train Accuracy: {k_5[0]:.4f}, Test Accuracy: {k_5[1]:.4f}")
print(f"K=100: Train Accuracy: {k_100[0]:.4f}, Test Accuracy: {k_100[1]:.4f}")
```

```
K=1: Train Accuracy: 1.0000, Test Accuracy: 0.9000
K=5: Train Accuracy: 0.9313, Test Accuracy: 0.9350
K=100: Train Accuracy: 0.8750, Test Accuracy: 0.9100
```

This code demonstrates how different values of K affect the model's performance:

Overfitting (K=1): Train Accuracy: 1.0000, Test Accuracy: 0.9000

The model has a perfect score on the training data (1.0000) but a lower score on the test data (0.9000).
This indicates overfitting: the model has memorized the training data perfectly but doesn't generalize as well to new data.
Good Fit (K=5): Train Accuracy: 0.9313, Test Accuracy: 0.9350

The model performs similarly on both training and test data, with the test accuracy even slightly higher.
This suggests a good balance: the model generalizes well to new data.
Slight Underfitting (K=100):Train Accuracy: 0.8750, Test Accuracy: 0.9100

Both accuracies are lower than K=5, indicating the model might be too simple.
Interestingly, the test accuracy is higher than the training accuracy, which can happen with high K values as the model becomes very generalized.
Visual representation of decision boundaries:

![decision_boundaries](../images/6.2-m738.svg)

Good fit: Balanced decision boundaries

![overfitting](../images/6.3-m738.svg)

Overfitting: Overly complex decision boundaries

![overfitting](../images/6.4-m738.svg)

Underfitting: Simplistic decision boundaries

In practice, we aim to find the optimal K value that balances between overfitting and underfitting, resulting in a model that generalizes well to new data. This process often involves techniques like cross-validation, which we'll explore in future lessons.


## Evaluating the model on the test set

Let's evaluate our model on the test set.


In [29]:
# Import the libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier

# Load the dataset
banking_df = pd.read_csv("../data/subscription_prediction.csv")

# Split the data set into features and target
X = banking_df.drop(["y"], axis=1)
y = banking_df["y"]

# Split the data set into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state = 417)

# Split the training set into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.20*X.shape[0]/X_train.shape[0], random_state = 417)

# Convert the categorical columns into dummy variables
X_train = pd.get_dummies(data = X_train, columns = ["marital", "default"], drop_first = True)
X_val = pd.get_dummies(data = X_val, columns = ["marital", "default"], drop_first = True)

# Normalize the features
scaler = MinMaxScaler()

# Transform the training features
X_train_scaled = scaler.fit_transform(X_train[["marital_married", "marital_single", "marital_unknown", "default_unknown", "age", "duration"]])

# Transform the validation features
X_val_scaled = scaler.transform(X_val[["marital_married", "marital_single", "marital_unknown", "default_unknown", "age", "duration"]])
print(0, 1, sep="\n")

# Train the model
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train_scaled, y_train)

# Evaluate the model on the validation set
val_accuracy = knn.score(X_val_scaled, y_val)
print(f"Accuracy of model evaluated on validation set with K = 5: {val_accuracy*100:.2f}%")

# Transform the test features
X_test = pd.get_dummies(data = X_test, columns = ["marital", "default"], drop_first = True)

# Normalize the test features
X_test_scaled = scaler.transform(X_test[["marital_married", "marital_single", "marital_unknown", "default_unknown", "age", "duration"]])

# Evaluate the model on the test set
test_accuracy = knn.score(X_test_scaled, y_test)

# Print the accuracy of the model evaluated on the test set
print(f"Accuracy of model evaluated on test set with K = 5: {test_accuracy*100:.2f}%")

0
1
Accuracy of model evaluated on validation set with K = 5: 74.81%
Accuracy of model evaluated on test set with K = 5: 75.51%


In this lesson, we learned:

- How to build and train the K-Nearest Neighbors algorithm using scikit-learn.
- What a validation set is used for.
- What overfitting and underfitting are.

In the next lesson, we will learn more about improving our model's performance.