# Implementing a Classifier

## Reading

[Chapter 17: 17.4, 17.5](https://inferentialthinking.com/chapters/17/4/Implementing_the_Classifier.html)

In this notebook we learn to build a classifier that uses k nearest neighbors to classify data.

In the previous notebook we saw that sometimes the groups in the dataset overlap each other in the scatterplot. The overlap makes it difficult to determine the class of a new dataset based on one closest neighbor, so we use k nearest neighbors and we go with the group that the majority of the k nearest neighbors have.

## The Case for k Nearest Neighbors

We work with the same dataset as in the textbook, which contains characteristics of different banknotes, such as a $10 bill. Using these characteristics we can determine whether a banknote is real or fake (counterfeit).

We read in data to inspect them first.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

url = 'https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/banknote.csv'
banknotes = pd.read_csv(url)
print("First 5 rows:")
banknotes.head()

The banknote attributes or features are in the first 4 columns. The last column is the `Class` column, which contains 0 for counterfeit and 1 for real money. Since there are 2 outcomes for `Class` this is a binary classification problem.

In [None]:
banknotes.Class.value_counts()

We take a look at the scatterplot for `WaveletVar` and `WaveletCurt`.

In [None]:
plt.figure(figsize=(5,4))

groups = banknotes.groupby('Class')
for name, group in groups:
    plt.scatter(group['WaveletVar'], group['WaveletCurt'], label=name, alpha=0.3)
plt.xlabel('WaveletVar')
plt.ylabel('WaveletCurt')
plt.grid()
plt.plot()

We can see that there's a region of overlap between the 2 classes, which means that it would be difficult to accurately predict the class for a new data value if it falls in the overlapping region. This is where using k neighbors could help. We would determine the class of the new data based on which class of the majority of the k nearest neighbors.

---

## The Case for Multiple Attributes

In the initial investigation of classification we've only looked at 2 features of the data at a time, but typically each data record has many features. We want to see if using more than 2 relevant features can help the classifier to be more accurate.

Using the same `banknotes` dataset, we want to include all 3 attributes `WaveletVar`, `WaveletSkew`, and `WaveletCurt`.

We use a 3D scatterplot to show the 3 attributes. Just like when we work with 2 attributes and plot the data on the 2D x-y plane, we plot the 3 attributes in 3D space with x, y, and z coordinates.

In [None]:
# You don't need to write this code.
# The code is used to demo the difference between using 2 attributes
# and using 3 attributes of the banknotes data.

from mpl_toolkits.mplot3d import Axes3D

plt.figure(figsize=(6,7))
# add 3D axes
ax = plt.axes(projection='3d')

# group data by 'Class` and plot each group as before
groups = banknotes.groupby('Class')
for name, group in groups:
    ax.scatter3D(group['WaveletSkew'], group['WaveletVar'], group['WaveletCurt'], label=name, alpha=0.4)

ax.set_xlabel('WaveletVar')
ax.set_ylabel('WaveletSkew')
ax.set_zlabel('WaveletCurt')
ax.legend()
ax.view_init(elev=25, azim=290)
plt.show()

There is almost no overlap in the 3D plot. When we used 2 features before, there was a region of overlap between the two clusters, which means the classifier could make a mistake when determining the group of the new data. But when we use 3 features, the two clusters have almost no overlap, making it more clear-cut which class the new data belongs. The third features's contribution is to separate the 2 clusters.

In this case we see that a classifier that uses the 3 data features will be more accurate than a classifier that only uses 2 features.

What if the data has 4, 5 or more relevant features? In general it's better to use all the relevant features because:
- Each features can potentially help to define the clusters better, just like in the example above.
- Each features describes the data in some way, and we don't want to miss important information about the data. This is similar to when we want to see if someone walking ahead of us is a friend. We don't want to just look at the person's hair and clothes from the back only, while ignoring their facial features.

While using 4, 5 or more relevant features could mean better accuracy for our classifier, there is a downside:
- We can no longer plot the features to see the clusters, since we can only plot up to 3D space.<br> However, even though we can no longer visualize the clusters, the mathematics and algorithm have no problem with multidimensional space. What works in 2D space can be extended to 5D or 12D space.
- The computation cost for using multiple features will be higher since calculation with 2 features is simpler than calculation with 5 features. This is the reason why data scientists and machine learning / AI engineers need to work with powerful computers.

Last but not least, it's also important to note that we only want to include _relevant_ features in the classification calculations. Adding unnecessary features will, at best, cause the computation cost to be higher, and at worst, confuse the classifier.

---

## Implementing a Classifier

We now have the background and tool necessary to create a simple classifier. We work with the dataset of wine features that is also used in the textbook.

### 1. Prepare Data

In [None]:
url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/wine.csv"
wine = pd.read_csv(url)
rows, columns = wine.shape
print("Number of rows:", rows)
print("Number of columns:", columns)
print("First 5 rows:")
wine.head()

The data in this dataset have 14 features and we'll use all of them. Our goal is to predict the type of wine, which is shown in the `Class` column. There are 3 `Class` values.

In [None]:
wine.Class.value_counts()

Since we are doing binary classification, we will concentrate on whether we can predict whether a wine is in class 1 or not in class 1.

We'll change the `Class` column so that rows with `Class` value of 2 or 3 will end up with 0 for the value.

In [None]:
# define a function to change 2 or 3 into 0
def change(x):
    if x == 2 or x == 3:
        return 0
    else:
        return x

# use apply with our change function to change the Class column
wine['Class'] = wine['Class'].apply(change)
wine.Class.value_counts()

We see that the count for 1 is still 59, and the count for 0 is the sum of the counts of 2 and 3.

As discussed in the previous notebook, we'll divide the dataset into training data and testing data. We use the same `shuffle` function that we wrote in the previous notebook.

In [None]:
shuffled_data = wine.sample(frac=1)
training_set = shuffled_data.iloc[:len(shuffled_data)//2].copy()
testing_set = shuffled_data.iloc[len(shuffled_data)//2:].copy()

We take a look at the first 5 rows of the `training_set`.

In [None]:
training_set.head()

We see that the rows have been shuffled randomly.

We recall from Module 10 Classification class notes that the `training_set` will be used by the classification algorithm to find the association (the scatterplot equivalent) of all the features of the data.

The `testing_set` will be used as the "Alice" data points for prediction.

---

### 2. Classification Algorithm

To write the code for the classification algorithm, we follow the steps that we've used in Module 10 Classification class notes:
1. Find the distance between the new data point and all the existing data points in the dataset.
2. Use these distances to find the k points that are closest (the shortest distance) to the new data point.
3. Find the class that occurs the most in the k nearest data points. This will be our predicted class for the new data point.




#### 1. Find Distances

<u>1a. Distance between the new data point and _one_ existing data point</u>

We recall that the distance formula is:
$$
D = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}
$$
for 2 points in the 2D x-y plane.

If the 2 points have multiple features, the distance formula still works in 4D, 5D, 6D or higher dimensional space. We simply extend the formula by continuing the subtraction of the corresponding values along the additional dimensions.

For example, if the 2 points have 4 attributes, then the distance formula will be:
$$
D = \sqrt{(x_1 - x_2)^2  + (y_1 - y_2)^2 + (z_1 - z_2)^2 + (w_1 - w_2)^2}
$$

Conveniently, numpy arrays come in handy in this situation. If the 2 points with 4 attributes are represented as 2 numpy arrays with 4 values in each array, then numpy will automatically subtract the corresponding values in the 2 arrays. This makes our coding quite a bit shorter than if we have to write out each subtraction.

In [None]:
# new_point is the new data point, with multiple features
# row is one row of the DataFrame, which is one existing data point
def distance_from_point(new_point, row):
  # convert each data point into an array
  new_point = np.array(new_point)
  row = np.array(row)
  # subtract the 2 arrays
  # this step subtracts corresponding x, y, z, w values if there are 4 features
  difference = new_point - row
  # square each difference
  squared_difference = difference ** 2
  # add the squared differences and take the square root
  D = np.sqrt(np.sum(squared_difference))
  return D

<u>1b. Distance between the new data point and all existing data points</u>

We now use the `distance_from_point` function above and the `apply` function to find the distance between the new data point and all existing data points or all rows of the DataFrame.



In [None]:
# new_point is the new data point
# dataset is the DataFrame of all data
def all_distances(new_point, dataset):
    # remove the 'Class' column so only the features remain
    features = dataset.drop(columns='Class')
    # use 'apply' to run 'distance_from_point' with each row of the DataFrame
    distances = features.apply(distance_from_point, axis=1, args=(new_point,))
                               # args=(new_point,)  is used to send new_point to
                               # distance_from_point
    return distances

---

#### k Nearest Neighbors

After we find all the distances, we store them as a column in the DataFrame. Then we sort the DataFrame by the distances, and choose the top k rows, which are the data with the smallest distances.

In [None]:
def find_k_nearest_neighbors(new_point, dataset, k):
    # find all distances
    distances = all_distances(new_point, dataset)
    # create a DataFrame which is the dataset and the distances
    results = dataset.join(pd.DataFrame({'Distance':distances}))
    # sort the dataset by 'Distance', then take the top k rows
    k_nearest_neighbors = results.sort_values(by='Distance').head(k)
    return k_nearest_neighbors

---

#### Predictions

From the k nearest neighbors, we write a function to find the class that occurs the most, and return the class as our prediction.

In [None]:
def majority(k_nearest):
    # find the count of 1's
    ones = np.sum(k_nearest.Class == 1)
    # find the count of 0's
    zeros = np.sum(k_nearest.Class == 0)
    # return the larger of the two
    if ones > zeros:
        return 1
    else:
        return 0

# main function of our classifier
# accepts as input:
#   the DataFrame of existing data,
#   the new data point,
#   the number of nearest neighbors
def classify(new_point, dataset, k):
    k_nearest = find_k_nearest_neighbors(new_point, dataset, k)
    return majority(k_nearest)

We now test the classifier with some known data in the datset.
For each test data we remove the `Class` column so the classifier doesn't know the class of the data, then we pass it to the `classify` function to see what the predicted class is.

In [None]:
# use the 1st row as test data, removing the 'Class' or group data
new_wine = training_set.drop(columns='Class').iloc[0]
print("Predicted group:", classify(new_wine, training_set, 5))
print("Actual group:", training_set.Class.iloc[0])

In [None]:
# use the 3rd row as test data
new_wine = training_set.drop(columns='Class').iloc[2]
print(new_wine.shape)
print("Predicted group:", classify(new_wine, training_set, 5))
print("Actual group:", training_set.Class.iloc[2])

So far so good. The classifier is able to determine correctly the group of the 2 sample test data most of the time.

We now test the classifier with our `testing_set` data which we reserved from the original dataset.

---

## Accuracy

First we save the `Class` or group data of `testing_set`, then we drop the `Class` data from the `testing_set`.



In [None]:
actual = testing_set.Class
test_set = testing_set.drop(columns='Class')

Then we use `apply` to run the classifier with `test_set`. The `args` value for `apply` is the input for `classify`, which are the `training_set` and 5, the number of nearest neighbors.

In [None]:
predicted = test_set.apply(classify, axis=1, args=(training_set, 5))

We store the `actual` group and the `predicted` group in a DataFrame and print the first 5 rows.

In [None]:
results = pd.DataFrame({'Actual': actual, 'Predicted': predicted})
results.head(5)

We see that most of the predictions in the first 5 rows are correct.

We calculate the percent accuracy by finding the ratio of correct predictions over the total predictions.

In [None]:
# add all the rows where Actual is the same as Predicted
correct = (results.Actual == results.Predicted).sum()
# find the ratio of correct predictions over total predictions as a percentage
print("Accuracy:", round(correct / len(results) * 100))

The accuracy is not bad for a relatively simple algorithm.

---

We now test the classifier with a completely new set of data.

Using the textbook dataset, we read in data that can be used for breast cancer detection. The data are measurements of biopsy images.

In [None]:
url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/breast-cancer.csv"
patients = pd.read_csv(url)
rows, columns = patients.shape
print("Number of rows:", rows)
print("Number of columns:", columns)
print("First 5 rows:")
patients.head()

The ID column is not relevant to the cancer data so we remove it.

In [None]:
patients = patients.drop(columns='ID')
print("First 5 rows:")
patients.head()

Since the data in the columns are in the same range, we don't need to use standard units.

We divide the data into the training and testing set.

In [None]:
shuffled_data = patients.sample(frac=1)
training_set = shuffled_data.iloc[:len(shuffled_data)//2].copy()
testing_set = shuffled_data.iloc[len(shuffled_data)//2:].copy()

We run `classify` with the 2 datasets.

In [None]:
# save the actual group, and drop the group from the testing_set
actual = testing_set.Class
test_set = testing_set.drop(columns='Class')

# run the classifier to get the predicted group
predicted = test_set.apply(classify, axis=1, args=(training_set, 5))

# combine the actual and predicted into a DataFrame, and find the number
# of correct predictions
correct = (actual == predicted).sum()
print("Accuracy:", round(correct / len(actual) * 100))

We see that the classifier has high accuracy on a new dataset which is completely different from the wine dataset that we used to develop the algorithm.

---

In this notebook we use our understanding of how k nearest neighbors work to write an algorithm or a classifier for binary classifcation. Then we test the algorithm using training and testing sets that come from the original dataset. Data in the training set are neighbors to each data point in the testing set. The classifier predicts the group of each data point in the testing set, and by comparing all the predicted groups against the actual groups, we can measure the accuracy of the classifier.