# Classification

## Reading

[Chapter 17: 17.1 - 17.3](https://inferentialthinking.com/chapters/17/Classification.html)

In the previous two notebooks we studied a machine learning technique called linear regression, in which we write code to create a  regression line for the data. This regression line gives us the predicted values of a specific characteristic of the data.

In this notebook and the next notebook we investigate another machine learning technique called _classification_. With classification we work with data that are made of two or more distinct *groups* or *classes*. An example would be data for patients who have cancer or don't have cancer, and we use the patient's data to _classify_ or predict whether the patient has cancer or not.

When the predicted data are in distinct groups or classes, they are _categorical_ data. We recall that we've worked with categorical data before, when we used a bar chart to plot data in Module 3 Plots class notes.

For this introduction to classification, we will work with datasets where the predicted data are in 2 distinct classes. This is called _binary classification_.

---

## Nearest Neighbor

We look at one of the most common methods for classification called the _nearest neighbor_ method.

But before we dive into creating a classification algorithm, we'll first investigate the common concepts in classification: nearest neighbor, decision boundary, and k nearest neighbors.


We start with the same dataset as the textbook, which works with data for chronic kidney desease or CKD.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

url = 'https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/ckd.csv'
ckd = pd.read_csv(url)
print("First 5 rows:")
ckd.head()

Each row has data for one patient, and the patient's data (the columns) are measurements of different blood test markers.

For our study of classification, we'll take a look at these columns: `Hemoglobin`, `White Blood Cell Count`, `Blood Glucose Random`, and `Class`.

The `Class` variable has 2 values: 0 and 1, which represents the patient status of no CKD (0) or CKD (1).

In [None]:
ckd.Class.value_counts()

We create a new DataFrame from the 4 selected columns, and shorten the name `Blood Glucose Random` to `Glucose`.

In [None]:
data = ckd[['Hemoglobin', 'White Blood Cell Count', 'Blood Glucose Random', 'Class']]
data = data.rename(columns={'Blood Glucose Random': 'Glucose'})
print("First 5 rows:")
data.head()

Since the blood markers are in different units of measurements, we change all 3 markers to standard units so that they all have the same uniform units.

We bring in the `standard_units` function from the Module 8 notebook, then we use the function to convert the data.

In [None]:
def standard_units(array):
    return (array - np.mean(array))/np.std(array)

data['Hemoglobin'] = standard_units(data['Hemoglobin'])
data['White Blood Cell Count'] = standard_units(data['White Blood Cell Count'])
data['Glucose'] = standard_units(data['Glucose'])
print("First 5 rows in standard units:")
data.head()

First we look at the relationship between `Hemoglobin` and `Glucose` by plotting them in a scatterplot.

In [None]:
plt.figure(figsize=(4,4))
plt.scatter(data['Hemoglobin'], data['Glucose'])
plt.xlabel('Hemoglobin')
plt.ylabel('Glucose')
plt.grid()
plt.show()

There appears to be no correlation between these 2 markers, and especially not linear correlation.

But we also have one more patient feature that we can use: we can group the data by `Class`. This allows us to see the `Glucose` and `Hemoglobin` for CKD and non-CKD patients.

In [None]:
plt.figure(figsize=(4,4))

# group by CKD or non-CKD
groups = data.groupby('Class')

# scatterplot for each group
for name, group in groups:
    plt.scatter(group['Hemoglobin'], group['Glucose'], label=name)

plt.xlabel('Hemoglobin')
plt.ylabel('Glucose')
plt.legend()
plt.grid()
plt.show()

Now we clearly see a pattern with the 2 groups or classes:
- Patients in class 0 or non-CKD have low `Glucose` level between -1 and 0 standard units, and high `Hemoglobin` level between -0.5 to 1.5 standard units. Their data are clustered together.
- Patients in class 1 or CKD have a wide range of `Glucose` level, and a`Hemoglobin` level that's mostly less than 0.

If we have a patient named Alice, to use the same name as in the textbook, and we know her `Glucose` and `Hemoglobin` levels, then there's a good chance we can plot her data in the scatterplot and determine whether Alice has CKD or not.

- If Alice's data placed her near or in the blue cluster of data points above, then we would conclude that she's in the non-CKD group.
- If her data placed her near some of the orange data points and far away from the blue cluster, then we would conclude that she's in the CKD class.

The reasoning above is an example of the nearest neighbor classification method. We predict the class for the new data based on the group of the _closest_ existing data. The idea is that when 2 data values are close together, they are likely to share the same characteristics.

---

#### Decision Boundary

If Alice's data placed her very close to the blue cluster or very far from the blue cluster, then the CKD or non-CKD decision is staightforward.

But what if Alice is placed somewhere close to both the orange and blue data points?

We now give Alice a `Glucose` and a `Hemoglobin` level as shown below.

In [None]:
plt.figure(figsize=(4,4))

# group by CKD or non-CKD
groups = data.groupby('Class')

# scatterplot for each group
for name, group in groups:
    plt.scatter(group['Hemoglobin'], group['Glucose'], label=name)

# add Alice
plt.scatter(0, 1, color='purple', label='Alice')
# 0 is the value for x or Hemoglobin
# 1 is the value for y or Glucose

plt.xlabel('Hemoglobin')
plt.ylabel('Glucose')
plt.legend()
plt.grid()
plt.show()

Now it's not so easy to visualize which class Alice belongs. Her data point looks close to both the orange and the blue data points.

Fortunately we can calculate the distance between the purple data point (Alice) and the orange and blue data points (existing patients in the dataset).


In [None]:
# You don't have to write the code below at this time.
# It's used to demo the nearest neighbor concept.

# The code finds the minimum distance between Alice's data point and
# other patients' data points, then it prints the data of the patient
# closest to Alice.

# Alice's data point
alice = np.array([0, 1])

# function to calculate the distance between 2 points
def find_distance(point1, point2):
    return np.sqrt((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2)

# find distance from Alice to each data point in the DataFrame
data['Distance_to_Alice'] = data.apply(lambda row: find_distance(alice,
                                  [row['Hemoglobin'], row['Glucose']]), axis=1)

# find the data point with the minimum distance
closest_point = data.loc[data['Distance_to_Alice'].idxmin()]

print("Closest data point to Alice:")
print("(x,y) =", round(closest_point['Hemoglobin'],2), ",",
      round(closest_point['Glucose'], 2))
print("Distance to Alice:", round(closest_point['Distance_to_Alice'], 2))
print("Class:", closest_point['Class'])

The result shows that the closest patient to Alice is the orange data point to the right of Alice's data point. Since this data point has `Class` 1, we conclude that Alice is also in class 1, which  means Alice likely has CDK.



In the previous example, Alice's data point is at (0, 1), which results in the conclusion that Alice is in the CDK class. What if Alice's data point is at (0, 0.5) instead?

We can use the same code as above to find the class in which she belongs.

In [None]:
# Alice's new data point
alice = [0, 0.5]

data['Distance_to_Alice'] = data.apply(lambda row: find_distance(alice, [row['Hemoglobin'], row['Glucose']]), axis=1)
closest_point = data.loc[data['Distance_to_Alice'].idxmin()]

print("Closest data point to Alice:")
print("(x,y) =", round(closest_point['Hemoglobin'],2), ",", round(closest_point['Glucose'], 2))
print("Distance to Alice:", round(closest_point['Distance_to_Alice'], 2))
print("Class:", closest_point['Class'])

Now the closest data point is a blue data point that appears below Alice's data point in the plot. And since that blue data point has `Class` 0, it means Alice likely belongs in the non-CDK class.

We can see that Alice's location on the plot determines her outcome of CDK or non-CDK. The location (x, y) where Alice switches from being in the CDK class to being in the non-CDK class is part of the _decision boundary_.

The decision boundary is the dividing line between 2 classes of data: each side of the decision boundary are data for one of the 2 classes. For example, one side of the decision line is for the CDK class, and any data that is placed on that side will be determined to be in the CDK class. Conversely, any data that is placed on the other side of the decision line will be determined to be in the non-CDK class. The [textbook](https://inferentialthinking.com/chapters/17/1/Nearest_Neighbors.html#decision-boundary) shows the plot of the decision boundary for the `Hemoglobin` vs `Glucose` scatterplot.

---

#### k-Nearest Neighbors

When the scatterplot shows 2 distinct classes, such as the `Hemoglobin` vs `Glucose` scatterplot, then it is more straightforward to find the nearest neighbor so that we can use its `Class` to determine our data's `Class`.

But the sometimes the groupings overlap, such as when we look at `White Blood Cell Count` and `Glucose`.

In [None]:
plt.figure(figsize=(4,4))

groups = data.groupby('Class')
for name, group in groups:
    plt.scatter(group['White Blood Cell Count'], group['Glucose'], label=name)

plt.xlabel('White Blood Cell Count')
plt.ylabel('Glucose')
plt.legend()
plt.grid()
plt.show()

If Alice's data point is somewhere in the cluster of blue dots, we don't have confidence in which class she belongs.  

In this case we use a technique called the _k Nearest Neighbors_. Instead of looking for the one closest neighbor, we look at _k_ of the closest neighbors. If these neighbors are not all in the same class, then we'll go with the class of the majority of the k neighbors. For example, out of the nearest 3 neighbors, if 2 of them are in the non-CDK class then we conclude that Alice is also in the non-CDK class.

---

## Training and Testing

An important concept in machine learning prediction is the _training_ of the algorithm to do the prediction and then _testing_ the algorithm to see how accurate the predictions are.

The algorithm to do the prediction is the sequence of steps it takes to find the nearest neighbor(s). In the previous examples with Alice's data point, the prediction algorithm is the code that:
- finds the  distance between Alice's data point and all existing patients' data points
- pinpoints the nearest patient's or k patients' data point(s)

### Training Dataset and Testing Dataset

To fully test how well a prediction algorithm works, we need to test it with many new data points or many "Alice" data points that the algorithm has not seen in the dataset. The standard practice is to take the given dataset and divide it into 2 separate (non-overlapping) parts: a training dataset and a testing dataset.

<u>Training Dataset</u><br>
First we run the algorithm with the data in the training dataset. For the CDK example above, the k nearest neighbor algorithm uses the training data to map the relationship between the `Glucose` and `White Blood Cell Count` for each `Class` group, just like with the scatterplot above.

In machine learning terminology, when we run an algorithm with data that have all relevant features so that the algorithm can map out the data, it is said that we're _training_ the algorithm, and the algorithm _learns_ from the data.

<u>Testing Dataset</u><br>
After the algorithm has mapped the training data, then we take the testing dataset and remove the `Class` column. In the CDK example above, the testing data is like the "Alice" data point, which has the `Glucose` and `White Blood Cell Count` data but no `Class` data. Then the algorithm finds the k nearest neighbors for each of the testing data points and produces the predicted values.

To measure the accuracy of the prediction, we compare the predicted values with the `Class` column that we removed from the testing dataset.

---

### Create the Training and Testing Datasets

We use random sampling to create the 2 sets. First we shuffle all the rows of the dataset, then we choose the first half of the rows for the training set, and the second half of the rows for a testing set.

From Module 7 AB Testing class notes, we shuffle by sampling all of the dataset, which means we set the `frac` or fraction of the dataset to 1.

In [None]:
shuffled_data = data.sample(frac=1)
# take first half
training_set = shuffled_data.iloc[:len(shuffled_data)//2]
# take second half
testing_set = shuffled_data.iloc[len(shuffled_data)//2:]

We confirm that the training dataset still has the same characteristic as the original dataset by plotting the `Glucose` vs `White Blood Cell Count`.

In [None]:
plt.figure(figsize=(4,4))

groups = training_set.groupby('Class')
for name, group in groups:
    plt.scatter(group['White Blood Cell Count'], group['Glucose'], label=name)

plt.xlabel('White Blood Cell Count')
plt.ylabel('Glucose')
plt.legend()
plt.grid()
plt.show()

We see that there are fewer data points than in the original dataset, but the non-CDK group is still clustered in the lower right hand corner, and the CDK group is still spread out throughout the plot, including having some data points inside the blue cluster.

---

## Distance Calculation

Now that we've explored the nearest neighbor classification algorithm and the concept of training and testing datasets, we will learn some new math and Python techniques to prepare for the next notebook, where we will code the nearest neighbor algorithm.

One of the most important parts of the k nearest neighbor algorithm is to find the distance between two points, as that will help us find which training data points are the closest to the testing data point.

In 2-dimensional space, the distance between 2 points $(x_1, y_1)$ and $(x_2, y_2)$ is:
$$
D = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}
$$

The formula is derived from the [Pythagorean theorem](https://en.wikipedia.org/wiki/Pythagorean_theorem), a foundational concept in geometry.

The formula looks complicated but we'll break it down step by step in the code below. Then we'll plot it so we can see how the equation works.

In [None]:
# create 2 points
point_1 = (0, 0)   # (x1, y1)
point_2 = (3, 4)   # (x2, y2)

# Calculate the rise and run
# - the rise is the difference along the y-axis between 2 points
# - the run is the difference along the x-axis between 2 points
run = point_1[0] - point_2[0]    # x1 - x2
rise = point_1[1] - point_2[1]   # y1 - y2

# Calculate the distance
distance = np.sqrt(run**2 + rise**2)
print("Distance between Point 1 and Point 2:", distance)

# Plot
plt.figure(figsize=(3,3))

plt.plot(point_1[0], point_1[1], 'o', color='blue', label="Point 1")
plt.plot(point_2[0], point_2[1], 'o', color='orange', label="Point 2")

plt.plot([point_1[0], point_2[0]], [point_1[1], point_1[1]], color='purple',
         linestyle='--', label="Run")
plt.plot([point_2[0], point_2[0]], [point_1[1], point_2[1]], color='green',
         linestyle='--', label="Rise")
plt.plot([point_1[0], point_2[0]], [point_1[1], point_2[1]], color='black',
         linestyle='-', label="Distance")

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
plt.show()

We see that to calculate the distance between 2 points, we:
- Find the run, which is the distance along the x-axis between the 2 points. In the plot above, it's the purple dashed line.
- Find the rise, which is the distance along the y axis between the 2 points. In the plot above, it's the green dashed line.
- Add the square of the rise and the square of the run, and take the square root of the sum. The result is the distance between the 2 points.

We can write a function `find_distance` that goes through these steps, given the input point_1 and point_2.

In [None]:
# a point is an array with 2 values: [x, y]
def find_distance(point_1, point_2):
    run = point_1[0] - point_2[0]   # point[0] is the x component
    rise = point_1[1] - point_2[1]  # point[1] is the y component
    return np.sqrt(run**2 + rise**2)

---

## The `apply` Function

The DataFrame has an `apply` function that will  run a given function for the data in each row.

<u>Example 1</u><br>
We review the CDK `data` from above.

In [None]:
print("first 5 rows:")
data.head()

Suppose that for each row, we want to find the higher of the `Hemoglobin` and the `Glucose` absolute values. For example, in the first row, the absolute value of `Hemoglobin` is 0.865744, and the absolute value of `Glucose` is 0.221549. The higher absolute value is 0.865744.



First we create a new DataFrame with the `Hemoglobin` and `Glucose` columns.

In [None]:
two_columns = data[['Hemoglobin', 'Glucose']].copy()
two_columns.head()

We write a function to find the higher absolute value.

In [None]:
def higher_abs_value(row):
  return np.max(np.abs(row))

We run the `higher_abs_value` function with the first row to check that it produces the correct result.

In [None]:
higher_abs_value(two_columns.iloc[0])

We now use the `apply` function to run a function with each row in the DataFrame, using the format:

> `a_DataFrame.apply(function_name, axis=1)`

The `axis=1` tells `apply` to run `function_name` with each _row_ of data. If we don't add the `axis=1`, then the default behavior for `apply` is to run the function with each column of the data.

In the code below we run the `higher_abs_value` function with each row by using `apply` and we store the results as a new column in the `two_columns` DataFrame. Then we print the DataFrame to check that the output is correct.

In [None]:
two_columns['higher'] = two_columns.apply(higher_abs_value, axis=1)
print("First 5 rows:")
two_columns.head()

<u>Example 2</u><br>
As a second example of using `apply`, given the function `find_distance` above, we write the function to find the distance between Alice and one data point in the `data` DataFrame.

Alice's data point is already set at (1, 0.5) above.

In [None]:
def find_distance_to_alice(row):
  return find_distance(alice, [row['Hemoglobin'], row['Glucose']])

Running the function with the first row of `data`.

In [None]:
print("Distance to Alice:", find_distance_to_alice(data.iloc[0]))

Now we use the `apply` function to:
- find the distance between Alice and all the  data points
- store the distances in the variable named `distances`
- find the smallest distance and print it

In [None]:
distances = data.apply(find_distance_to_alice, axis=1)
print("Smallest distance:", np.min(distances))

---

In this notebook we learn the concepts of a classification problem and how the nearest neighbor method can be used to classify data.

With this background knowledge, we will create a classification model in the next and final notebook.