# Lab 1 - Math 178, Spring 2024

This lab is due Thursday night of Week 2. You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It's fine for everyone in the group to submit the same link.)

The goal of this lab is to produce a plot like what is shown in Figures 2.9, 2.10, 2.11, and 2.17 in the *Introduction to Statistical Learning with Applications in Python* textbook.

Put the full names of everyone in your group (even if you're working alone) here. This makes grading easier.

**Names**: Katie Kim, Shun Iwata

## Generate the data

Our true underlying function will be $f(x) = 3x^2$.

Create a 2000-by-2 pandas DataFrame with two columns, `"x"` and `"y"`.  The x-column should contain 2000 random values distributed uniformly between -5 and 5.  The y-column should should be defined using $y = f(x) + \epsilon$, where $\epsilon$ represents Gaussian random noise with mean `0`.  You can experiment with different standard deviations for this random noise (to set the standard deviation, use the `scale` keyword argument in NumPy).

In [None]:
import pandas as pd
import numpy as np

In [None]:
f = lambda x: 3*x**2

In [None]:
std = 1.5
epsilon = np.random.normal(loc=0, scale=std, size=2000)
col_x = np.random.uniform(-5,5,2000)
col_y = f(col_x) + epsilon

In [None]:
d = {'x': col_x, 'y': col_y}
df = pd.DataFrame(data=d)

In [None]:
df

Unnamed: 0,x,y
0,2.896898,25.219962
1,0.220546,1.468329
2,4.717970,68.000708
3,1.487437,7.513712
4,3.867895,42.902198
...,...,...
1995,-1.658986,4.918336
1996,-0.377381,1.080899
1997,-0.609085,1.807661
1998,4.208295,53.293890


## Plot the data

Draw a scatter-plot of this data.  Chris recommends using Altair (and can best help if you use Altair), but you are welcome to use whatever you like, including Plotly, Seaborn, or Matplotlib.

In [None]:
import altair as alt

In [None]:
alt.Chart(df).mark_circle().encode(
    x = 'x',
    y = 'y'
)

## A function to compute train error and test error

Write a function `get_error` that takes three inputs, `train_size`, `k`, and `set_used`.  Descriptions of these arguments:
* `train_size` represents the size of the training set to use as an integer.  (The `train_test_split` function also allows a decimal between `0` and `1`, but we want to specify the absolute number of rows to use.)
* `k` represents the number of neighbors to use.
* `set_used` will be the string `"train"` or `"test"`, and indicates whether we are computing the training error or the test error.

Within the function:
* Divide the data into a training set and a test set using `train_test_split`.  Be sure to choose the number of training rows using the `train_size` argument.
* Instantiate a KNN object from scikit-learn.  (Is this a regression problem or a classification problem?)
* Fit the object to the training data.  (Fitting to all the data or to the test data is a major mistake.)
* Compute the train mean-squared error or the test mean-squared error, according to the `set_used` argument.
* Return this MSE.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

In [None]:
def get_error(train_size, k, set_used):

    X_train, X_test, y_train, y_test = train_test_split(
        df[['x']], df[['y']], train_size=train_size, random_state=1)

    neigh = KNeighborsRegressor(n_neighbors=k)
    neigh.fit(X_train, y_train)

    if set_used == "train":
        pred = neigh.predict(X_train)
        mse = mean_squared_error(y_train, pred)

    elif set_used == 'test':
        pred = neigh.predict(X_test)
        mse = mean_squared_error(y_test, pred)

    return mse

## Plot the results

* Experiment with different values of `train_size` and `k` with the goal of making a plot similar to what is in Figure 2.17 in the textbook.  If you use Altair, you can have a log scale along the x-axis as shown here: https://altair-viz.github.io/gallery/line_with_log_scale.html (Warning: Deepnote does not have the latest version of Altair pre-installed, so you will probably need to use the attribute syntax, not the method syntax.)  For showing both curves together in Altair, I followed the IMDB example [here](https://altair-viz.github.io/user_guide/compound_charts.html#repeated-charts), but it might be simpler to just make the train curve and the test curve separately, and then layer them using `+`.
* Using a log scale is not a requirement, but in my case it made the charts look better.  
* Be sure to use k-inverse rather than k for the x-axis, so that more flexible values (where overfitting is more likely) occur to the right of the chart.  That is the general convention for these charts.
* If your chart doesn't look at least approximately like what is shown in Figure 2.17, try changing parameters (including the standard deviation of the error from the very beginning of this lab) or check for mistakes.

In [None]:
df_mse = pd.DataFrame(columns=["train_mse","test_mse"])
df_mse


Unnamed: 0,train_mse,test_mse


In [None]:
df_mse["train_mse"] = [get_error(1500, k, "train") for k in range(1,101)]
df_mse["test_mse"] = [get_error(1500, k, "test") for k in range(1,101)]

In [None]:
df_mse

Unnamed: 0,train_mse,test_mse
0,0.000000,4.146613
1,1.141556,3.350497
2,1.564466,3.064501
3,1.772812,2.850118
4,1.933264,2.739945
...,...,...
95,4.424147,4.146434
96,4.473588,4.195996
97,4.537213,4.246613
98,4.595525,4.286344


In [None]:
df_mse['1/K'] = [1/(k+1) for k in df_mse.index]

In [None]:
a_df = pd.DataFrame(columns=["Error Rate","1/K", 'Error Type'])
b_df = a_df.copy()

In [None]:
a_df['Error Rate'] = df_mse['train_mse']
a_df['1/K'] = df_mse['1/K']
a_df['Error Type'] = ['Training Errors']*100

b_df['Error Rate'] = df_mse['test_mse']
b_df['1/K'] = df_mse['1/K']
b_df['Error Type'] = ['Test Errors']*100

In [None]:
err_df = pd.concat([a_df, b_df])

In [None]:
err_df

Unnamed: 0,Error Rate,1/K,Error Type
0,0.000000,1.000000,Training Errors
1,1.141556,0.500000,Training Errors
2,1.564466,0.333333,Training Errors
3,1.772812,0.250000,Training Errors
4,1.933264,0.200000,Training Errors
...,...,...,...
95,4.146434,0.010417,Test Errors
96,4.195996,0.010309,Test Errors
97,4.246613,0.010204,Test Errors
98,4.286344,0.010101,Test Errors


In [None]:
alt.Chart(err_df).mark_line(point=True).encode(
    x = alt.X('1/K', scale=alt.Scale(type="log")),
    y = 'Error Rate',
    color = 'Error Type:N'
)

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


## Submission

* Using the `Share` button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.

## Possible extensions

These are not required but some ideas for extra practice.

* Our chart is like the right-hand panel of Figures 2.9 to 2.11.  Can you also make the left-hand panel?
* Conceptually harder but also using the basic functionality of the `get_error` function: Can you make something like one of the charts shown in Figure 2.12?  I don't think this will be possible using the information in ISLP, so you will need to look up the definition of bias and variance somewhere else.  (They involve averaging over many choices of equal-sized training sets.  It will not be practical to use every possible choice of training set, because there are too many.)  I haven't tried this myself so I'm not sure how similar the outcome will be to what's shown in Figure 2.12.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=8b89d8f7-bfa0-4bef-ab12-921f650b4a46' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>