# Lab 09 K-nearest neighbors and Multinomial regression!

Welcome to Lab!

In general, there are three types of machine learning: supervised learning, unsupervised learning, and semi-supervised learning. In this lab, we will focus on supervised learning, which is the most commonly used type of machine learning. It involves training a model on a labeled dataset, where the desired output is already known for each input. The model learns to map inputs to outputs by minimizing the difference between its predicted output and the actual output. The ultimate goal is to use the trained model to make prediction on new data where $y$ is unknown. 

There are two main types of supervised learning: regression and classification. Regression involves predicting a continuous numerical value for a given input. For example, predicting the price of a house based on its features such as location, square footage, and number of bedrooms. Classification, on the other hand, involves predicting a categorical label or class for a given input. For example, to predict whether an email is spam or not spam based on its content.

In this lab, you will:
- Use the [`K nearest neighbors regression`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor)
- Use the [`K nearest neighbors classifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
- Use [`multinomial regression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (Note: in sklearn they call it logistic regression whether there are 2 classes or more)


Please import specific component of `scikit-learn` when needed. For example: `from sklearn import neighbors`.

In [1]:
### standard imports
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from interactive2 import *

import warnings
warnings.filterwarnings('ignore')

## 1. Regression task
In this part of the lab, we will be using $K$-nearest neighbors to perform regression. Here are two interactive demos that you can use to understand $K$-nearest neighbors regression better. We have a 1-D example (as done in class) and also a 2-D example which is an extension of the 1-D case.

### Interactive $k$-nearest neighbors regression (with 1 covariatess)
Remember from lectures that $k$-nearest neighbors regression (using 1 covariate) attempts to guess a continuous $y$ from (any) $x$ by **averaging** the $k$ points nearest to it. Let's try to pick $k$ to get an idea of how **$k$-nearest neighbor regression** works. 

In [2]:
knn_1d_reg().interact()

interactive(children=(IntSlider(value=1, description='k', max=50, min=1, step=0), IntSlider(value=50, descript…

### Interactive $k$-nearest neighbors regression (with 2 covariatess)
Remember from lectures that $k$-nearest neighbors regression (using 2 covariates) attempts to guess a continuous $y$ from (any) $x_1$ and $x_2$ by **averaging** the $k$ points nearest to it. Let's try to pick $k$ to get an idea of how **$k$-nearest neighbor regression** works in 2D. 

In [3]:
knn_2d_reg().interact()

interactive(children=(IntSlider(value=1, description='k', max=20, min=1, step=0), IntSlider(value=50, descript…

### 1.1 Get some data

This lab will be a little different in that you will be getting your own data! Go to The University of California at Urvine has a very nice [repository of datasets](https://archive.ics.uci.edu/datasets) that you can look through. Find a dataset that you are interested in which you can perform regression on (Hint: use the filters to narrow down your options). Additionally, a dataset in a .csv format is preferred as you know how to load those in, but I won't stop you if you want to try to load other datasets. In fact some datasets have a .data extension (or something like that) but when you open them up in notepad (Windows) or TextEdit (Mac) you see that they are in a CSV format. The only thing to be careful of is that some datasets are not separated by commas but rather by other characters, such as semicolons (`';'`), spaces (`' '`), or tabs ('\t'). To use these datasets, use the `delimiter` parameter in your pandas function call. **Do not pick the same dataset as your neighbor!**


<div style="background-color:rgba(0, 255, 0, 0.15);">

**Question 1.** 
1. Load in the dataset and get some summary statistics for each variable.
2. Additionally, use `df.corr` to see if any features are correlated with the response variable.

In [4]:
#your answer to Q1.1
...

In [5]:
#your answer to Q1.2
...

<div style="background-color:rgba(0, 255, 0, 0.15);">

**Question 2.** Plot some of the variables against the response variable $y$ and choose a 3 that you think can help you predict $y$

In [6]:
#your answer to Q2
...

<div style="background-color:rgba(0, 255, 0, 0.15);">

**Question 3.** Use [`K nearest neighbors regression`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor) on one of your 3 variables to predict $y$. Remember that $x$ should be a DataFrame and not a data series for sklearn models to work!
1. Plot a scatter plot of $x$ vs $y$ like you did in the previous question, but this time add the $K$-nearest neighbor prediction line. How do you do this?
   - Well define a new variable `x_line` which goes across your entire plot. For example, if your x-axis goes form -10 to 10, you should have `x_line` be an array-like object that looks like `[-10, -9.8, -9.6, ..., 9.6, 9.8, 10]`.
   - Then, define `y_line` to be the model predictions of each value in `x_line` using your $K$-NN model
   - Plot `x_line` vs `y_line` on the same plot as your data
   - Adjust $K$ to get the best looking line (you can eyeball it)
3. Repeat for the other one of the 2 variables
4. Repeat for the last variable

In [7]:
#your answer to Q3.1
...

In [8]:
#your answer to Q3.2
...

In [9]:
#your answer to Q3.3
...

## 2. Classification task

In this part of the lab, we will be using $K$-nearest neighbors and multinomial regression to perform classification. Here are four interactive demos that you can use to understand $K$-nearest neighbors classifcation and multinomial regression better. Just as with the regression, we also a 2-D examples here which are extensions of the 1-D case.

### Interactive $k$-nearest neighbors classification (with 1 covariatess)
Remember from lectures that $k$-nearest neighbors classification (using 1 covariate) attempts to guess a categorical $y$ from (any) $x$ by taking the most **votes** of the $k$ points nearest to it. Let's try to pick $k$ to get an idea of how **$k$-nearest neighbor regression** works. 

In [10]:
knn_1d_class().interact()

interactive(children=(IntSlider(value=1, description='k', max=20, min=1, step=0), IntSlider(value=50, descript…

### Interactive $k$-nearest neighbors classification (with 2 covariatess)
Remember from lectures that $k$-nearest neighbors classification (using 2 covariates) attempts to guess a categorical $y$ from (any) $x_1$ and $x_2$ by taking the most **votes** of the $k$ points nearest to it. Let's try to pick $k$ to get an idea of how **$k$-nearest neighbor classification** works in 2D. 

In [11]:
knn_2d_class().interact()

interactive(children=(IntSlider(value=1, description='k', max=20, min=1, step=0), IntSlider(value=50, descript…

### Interactive multinomial regression (with 1 covariates)
Remember from lectures that the formula for logistic regression with 2 covariates is:
$$
f(x, 0) = \frac{1}{1 + e^{c_0x + b_0} + e^{c_1x + b_1}} \approx \text{Probability that } y=0;
$$
$$
f(x, 1) = \frac{e^{c_0x + b_0}}{1 + e^{c_0x + b_0} + e^{c_1x + b_1}}  \approx \text{Probability that } y=1;
$$
$$
f(x, 2) = \frac{e^{c_1x + b_1}}{1 + e^{c_0x + b_0} + e^{c_1x + b_1}}  \approx \text{Probability that } y=2;
$$
where the second argument (the 0, 1, and 2) specifies which class we want to get the probabiltiy from. Let's try to pick $c_0, c_1, b_0$ and $b_1$ manually to get an idea of how **multinomial regression** works. When you actually use `sklearn` this is done automatically using built-in mathematics (which is based on multivariate calculus and linear algebra).## Interactive multinomial regression (with 1 covariates)
Remember from lectures that the formula for logistic regression with 2 covariates is:
$$
f(x, 0) = \frac{1}{1 + e^{c_0x + b_0} + e^{c_1x + b_1}} \approx \text{Probability that } y=0;
$$
$$
f(x, 1) = \frac{e^{c_0x + b_0}}{1 + e^{c_0x + b_0} + e^{c_1x + b_1}}  \approx \text{Probability that } y=1;
$$
$$
f(x, 2) = \frac{e^{c_1x + b_1}}{1 + e^{c_0x + b_0} + e^{c_1x + b_1}}  \approx \text{Probability that } y=2;
$$
where the second argument (the 0, 1, and 2) specifies which class we want to get the probabiltiy from. Let's try to pick $c_0, c_1, b_0$ and $b_1$ manually to get an idea of how **multinomial regression** works. When you actually use `sklearn` this is done automatically using built-in mathematics (which is based on multivariate calculus and linear algebra).

In [12]:
multinomial1().interact()

interactive(children=(FloatSlider(value=0.0, description='c0', max=50.0, min=-50.0), FloatSlider(value=0.0, de…

### Interactive multinomial regression (with 2 covariatess)
Remember from lectures that the formula for logistic regression with 2 covariates is:
$$
f(x_1, x_2, 0) = \frac{1}{1 + e^{d_0x_1 + c_0x_2 + b_0} + e^{d_1x_1 + c_1x_2 + b_1}} \approx \text{Probability that } y=0;
$$
$$
f(x_1, x_2, 1) = \frac{e^{d_0x_1 + c_0x_2 + b_0}}{1 + e^{d_0x_1 + c_0x_2 + b_0} + e^{d_1x_1 + c_1x_2 + b_1}}  \approx \text{Probability that } y=1;
$$
$$
f(x_1, x_2, 2) = \frac{e^{d_1x_1 + c_1x_2 + b_1}}{1 + e^{d_0x_1 + c_0x_2 + b_0} + e^{d_1x_1 + c_1x_2 + b_1}}  \approx \text{Probability that } y=2;
$$
where the third argument (the 0, 1, and 2) specifies which class we want to get the probabiltiy from. Let's try to pick $d_0, d_1, c_0, c_1, b_0$ and $b_1$ manually to get an idea of how **multinomial regression** works in 2D. When you actually use `sklearn` this is done automatically using built-in mathematics (which is based on multivariate calculus and linear algebra).

In [13]:
multinomial2().interact()

interactive(children=(FloatSlider(value=0.0, description='d0', max=10.0, min=-10.0), FloatSlider(value=0.0, de…

### 1.1 Get some data

Find another dataset that you are interested in which you can perform classification on.

<div style="background-color:rgba(0, 255, 0, 0.15);">

**Question 4.** Load in the dataset and get some summary statistics for each variable.

In [14]:
#your answer to Q4
...

<div style="background-color:rgba(0, 255, 0, 0.15);">

**Question 5.** You will now need to plot 2 variables on a scatter plot, which are colored differently for each class in $y$. Do this for variables that you think separate the classes well. (Hint: If you don't remember how to do this, we did this in Demo 08 in class)

In [15]:
#your answer to Q5
...

<div style="background-color:rgba(0, 255, 0, 0.15);">

**Question 6.** 
1. Perform [`K nearest neighbors classification`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) on these two variables to try to predict $y$
2. Calculate your training accuracy by writing a function with a loop in it to calculate how many points are correctly classified

In [16]:
#your answer to Q6.1
...

In [17]:
#your answer to Q6.2
...

<div style="background-color:rgba(0, 255, 0, 0.15);">

**Question 7.** Perform [`multinomial regression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instead of $K$-nearest neighbors and repeat Q6.1 and Q6.2. (Hint: if you wrote your function well for 6.2, you can just call it here instead of defining a new one)

In [18]:
#your answer to Q7.1
...

In [19]:
#your answer to Q7.2
...

Congratulations, you're done with the lab!  Be sure to...

* **Save** from the File menu,
* **Review** the lab so that you understand each line!
* **Shut down your kernel** from the Kernel menu,
* **Rename your ipynb file**, replacing LASTNAMES with your last names
* **Get your file ready** for grading during next lab session. 