# Lab Assignment 10 - Part 2 (split between week 10 and 11)

Complete the following steps by **filling in your answer in the Code or Text cell below each question**, and then **run the cell** to observe the output.

Before you start working on the lab, please do the following steps:
- Go to the Settings (gear icon) in the upper right, on the left of the blue "Share" button.
- Click on the gear icon.
- At the pop up menu, click on "AI Assistance"
- Uncheck all 3 boxes, if they're not already unchecked.
- Click "Close".

This setting will let you do your own work instead of having the AI do all the work.

For the questions below:
- If the question has no reference to a class notes Colab notebook, then the question is a review of previous material.
- If the question refers to Module 10 class notes, then it's new material for Module 10 and the specific section should be helpful as you work on the answer.

---

This lab works with two different prediction techniques:
- In Part 1 (last week) we continued with the linear regression work by calculating the accuracy of the regression line.
- In Part 2 (this week) we work on classification.

First we **import the necessary modules**.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

---

## Part 2: Introduction to Classification

We now switch to another prediction technique called _classification_, where we will explore the foundational concepts of k nearest neighbor classification:
- The type of data that can be used for classification.
- The nearest neighbor and the type of clustering that works well with nearest neighbor prediction.
- How to find the distance to the nearest neighbor.

We work with a dataset from [Kaggle](https://www.kaggle.com/datasets/sakshisatre/social-advertisement-dataset), which has data on consumer behavior. The dataset contains data on the consumer's age, salary, and whether or not they purchased a product after seeing the advertisement for the product.



### Question 5

We read in data and inspect the data.

5a. Write code to **read data from the URL and store in a DataFrame called `ads`** for advertisements.<br>
Then **print the number of rows and columns of `students`**
and **print the first 5 rows**.

Make sure to print a text description for the number of rows and columns so it's clear what the numbers are.

In [None]:
url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_labs/social_ads.csv"
ads =



5b. We check whether the type of prediction we want to do is a classification.

The predicted data is in the `Purchased` column, it shows whether the consumer purchased the product or not after seeing the ad for the product.

Based on Module 10 Classification class notes, "Nearest Neighbor" section, write code to **show the unique values (categories) in `Purchased` along with the count of each category**.

In [None]:
ads.

We see that there are 2 categories in `Purchased`: 0 is for not purchased, and 1 is for purchased. This makes the prediction a binary classification, which means we can continue with our k nearest neighbor classification.

---

### Question 6

The prediction will be based on the age and salary of the consumers, and we want to see if there's a relationship between the 2 data sequences.

6a. Add code to the Code cell below to **plot the relationship or correlation between `Age` and `EstimatedSalary`**.

In [None]:
plt.figure(figsize=(4, 4))


plt.xlabel("Age")
plt.ylabel("Estimated Salary")
plt.grid()
plt.show()

There doesn't seem to be any correlation between `Age` and `EstimatedSalary`, so using a regression method for prediction wuld not work.

But for classification, we would like to see whether there's any pattern in the plot when we separate the consumers into 2 groups: purchased or not purchased.

6b. Based on Module 10 Classification class notes, "Nearest Neighbor" section, **add code to do the following tasks**:
- Separate the `Purchased` data into 2 groups
- Create the scatterplot for each group.

In [None]:
plt.figure(figsize=(4, 4))

# separate Purchased data into groups

# scatterplot for each group


plt.xlabel("Age")
plt.ylabel("Estimated Salary")
plt.legend()
plt.grid()
plt.show()

With the addition of `Purchased` groups, the plot looks promising for k neighbor classification.

6c. Based on Module 10 Classification class notes, "Nearest Neighbor" section:
- a. What is the `Age` range and `EstimatedSalary` range where a consumer is likely to <u>not</u> purchase the product? Why?
- b. What is the `Age` range and `EstimatedSalary` range where a consumer is likely to purchase the product? Why?
- c. Would you use 1 nearest neighbor or k nearest neighbors to do the prediction? Why?

Your answer to the "Why?" question should refer to the nearest neighbors.

a. A consumer who likely will not purchase the product ...

b. A consumer who's likely to purchase the product ...

c. I would use ...

---

### Question 7

Suppose we have a consumer named Alice, who's 27 years old and has a salary of 110,000. Would she likely buy the product?

7a. Similar to Question 6b, **add code to group the `ads` data into 2 groups and display the scatterplot of each group.**

The Code cell below already has code to plot Alice's data is in the scatterplot.

In [None]:
plt.figure(figsize=(4, 4))

# separate Purchased data into groups

# scatterplot for each group


# Alice's location
plt.scatter(27, 110000, color='purple')

plt.xlabel("Age")
plt.ylabel("Estimated Salary")
plt.legend()
plt.grid()
plt.show()

From the plot we see that there are 3 data points that are Alice's nearest neighbors.


7b. To find those 3 data points, we look for all rows of `ads` that have an `Age` between 20 and 30 and an `EstimatedSalary` between 100,000 and 120,000.

Add to the code below to **find these rows** and **store their _index_ values with the name `nearest_neighbors`**.

When `nearest_neighbors` is printed, there should be the index values of 3 rows, for the 3 data points closest to Alice.

In [None]:
nearest_neighbors =

print("Nearest neighbors:", nearest_neighbors)

7c. Next we need to find the distance between those 3 data points and Alice so we can see which one is the nearest neighbor. To find the distance, we need to convert the data into standard units, since the `EstimatedSalary` data range is much larger than the `Age` range.

The `standard_units` function from Module 8 class notes is in the Code cell below. **Run the cell to let Python know about the function**.

In [None]:
def standard_units(array):
    return (array - np.mean(array))/np.std(array)

7d. First we change Alice's data to standard units, using the same formula as the `standard_units` function above.

**Run the Code cell** to change Alice's data to standard units.

In [None]:
alice_age = (27 - np.mean(ads['Age'])) / np.std(ads['Age'])
alice_salary = (110000 - np.mean(ads['EstimatedSalary'])) / np.std(ads['EstimatedSalary'])

7e. Based on Module 10 Classification class notes, "Nearest Neighbor" section, write code to **convert the `Age` and `EstimatedSalary` columns of `ads` to standard units**.

Then **display the first 5 rows** of `ads`.

In [None]:


print("First 5 rows in standard units:")
ads.head()

7f. Next **copy the code to display the scatterplot and Alice's data** from step 7a.<br>
**Change Alice's location to `alice_age`, and `alice_salary`** in the code.<br>
Then **run the Code cell** to see that the plot is the same shape, but the units have been normalized around 0.

In [None]:
# display the scatterplot with standard units
plt.figure(figsize=(4, 4))



plt.xlabel("Age")
plt.ylabel("Estimated Salary")
plt.legend()
plt.grid()
plt.show()

Now we're ready to find the distance between Alice and the 3 nearest neighbors.

7g. Based on Module 10 Classification class notes, "Distance Calculation" section, the following `find_distance` function calculates the distance between 2 data points.

```
def find_distance(point_1, point_2):
    run = point_1[0] - point_2[0]
    rise = point_1[1] - point_2[1]
    return np.sqrt(run**2 + rise**2)
```

**Add to the code below to change `find_distance` to `find_distance_to_Alice`**:
- `find_distance_to_Alice` accepts one row of the DataFrame.
- The `run` variable is the difference between `alice_age` and the row's `Age`.
- The `rise` variable is the difference between `alice_salary` and the row's `EstimatedSalary`.

**Run the Code cell** to let Python know about the new function.

In [None]:
def find_distance_to_Alice(row):
    run =
    rise =
    return np.sqrt(run**2 + rise**2)

7h. In the Code cell below, the 3 nearest neighbors to Alice, which are found at the rows at index `nearest_neighbor_index` from Question 7b, are put in a new DataFrame called `nearest_neighbors`.

Based on Module 10 Classification class notes, "The Apply Function" section, **write code to do the following steps**:
- Use the `apply` function to run the `find_distance_to_Alice` function with `nearest_neighbors`.
- Save the output as a new column of `nearest_neighbors`, name the column `distance`.

**Display `nearest_neighbors`**.


In [None]:
nearest_neighbors = ads.loc[nearest_neighbor_index]



7i. Given the `nearest_neighbors` data above, do you predict that Alice will buy the product? Why do you think so?

**Enter your explanation** in the Code cell below.

Alice will ...