# Coding Assignment 3

CS 598 Practical Statistical Learning

2023-10-09

UIUC Fall 2023

**Authors**
* Ryan Fogle
    - rsfogle2@illinois.edu
    - UIN: 652628818
* Sean Enright
    - seanre2@illinois.edu
    - UIN: 661791377

**Contributions**

Sean contributed to Parts I and II and reviewed Part III, Ryan Contributed to Part III and reviewed Parts I and II. 

## Part I: Optimal span for LOESS

Here we implement LOO-CV and GCV to select the optimal span for LOESS.

In [None]:
# General imports
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
# Part 1 imports
from csaps import csaps
from skmisc.loess import loess

Next, we define `onestep_cv()`, which calculates the LOO-CV and GCV values for a given span value, and `find_cv_vals()`, which returns these for all provided span values.

In [None]:
def onestep_cv(x, y, sp):
    """Calculate the LOO-CV and GCV for a given span"""
    # 1) Fit a LOESS model y - x with span and extract the
    #    corresponding residual vector
    loess_fit = loess(x, y, span=sp)
    y_hat = loess_fit.predict(x).values
    # 2) Call lo_lev to obtain the diagonal entries of S
    s_ii = loess_fit.outputs.diagonal
    # 3) Compute LOO-CV and GCV
    # LOOCV
    loocv = np.mean(np.power((y - y_hat) / (1 - s_ii), 2))
    # GCV
    m = np.mean(s_ii)
    gcv = np.mean(np.power((y - y_hat) / (1 - m), 2))
    return loocv, gcv

def find_cv_vals(x, y, span):
    """Find LOO-CV and GCV for all provided span values"""
    m = len(span)
    cv = np.zeros(m)
    gcv = np.zeros(m)

    for i in range(m):
        cv_i, gcv_i = onestep_cv(x, y, span[i])
        cv[i] = cv_i
        gcv[i] = gcv_i
    return cv, gcv

Determining span values that produce the lowest LOOCV and GCV error.

In [None]:
# https://liangfgithub.github.io/Data/Coding3_Data.csv
data_part1 = pd.read_csv("Coding3_Data.csv")
span_vec = np.linspace(0.2, 0.9, 15)

# Find optimal span by LOOCV and GCV
loo, gcv = find_cv_vals(data_part1["x"], data_part1["y"], span_vec)

# Display table of CV results
print("Span    LOOCV   GCV")
for s, l, g in zip(span_vec, loo, gcv):
    print(f"{s:.2f}\t{l:.3f}\t{g:.3f}")

The span optimization results are presented in the chart below.

In [None]:
sns.set()
mpl.rcParams['figure.dpi'] = 300
plt.scatter(span_vec, loo, color="darkorange", s=5, label="LOO-CV")
plt.plot(span_vec, loo, color="orange", alpha=1, linestyle="dotted")
plt.scatter(span_vec, gcv, color="blue", s=5, label="GCV")
plt.plot(span_vec, gcv, color="lightblue", alpha=0.75, linestyle="--")
plt.xlabel("Span")
plt.ylabel("CV Error")
plt.title("Span vs CV Error")
_ = plt.legend()

For this dataset and choice of span values, the best span value selected by LOOCV and GCV is the same.

In [None]:
# Select lowest span value
span_loocv = span_vec[np.argmin(loo)]
span_gcv = span_vec[np.argmin(gcv)]

print(f"Span by LOO-CV: {span_loocv}")
print(f"   Span by GCV: {span_gcv}")

The true curve is defined below.

In [None]:
def f(x):
    return np.sin(12 * (x + 0.2)) / (x + 0.2)

fx = np.linspace(min(data_part1["x"]), max(data_part1["x"]), 1001)
fy = f(fx)

Finally, we compare the LOESS curve with LOO-CV and GCV optimized span to the true curve.

In [None]:
y_loess = loess(data_part1["x"], data_part1["y"], span=span_loocv).predict(fx).values

sns.set()
mpl.rcParams['figure.dpi'] = 300
plt.scatter(data_part1["x"], data_part1["y"], color="red", s=6)
plt.plot(fx, fy, color="gray", linewidth=1, label="True Function")
plt.plot(fx, y_loess, color="blue", linewidth=1, linestyle="--", label="LOESS Fit")
plt.legend()
plt.xlabel("X")
plt.ylabel("Y")

## Part II: Clustering time series

In this exercise, we cluster time series data, comparing the results of clustering with and without natural cubic splines.



The following imports are specific to Part II.

In [None]:
from numpy.linalg import inv
from scipy.interpolate import splev
from sklearn.cluster import KMeans

A random seed is set to ensure repeatability. The seed is the sum of our UINs. We add 1 to the sum to give a more favorable seed.

In [None]:
# Set random seed to the last four digits of our UINs
np.random.seed(8818 + 1377 + 1)

In this part we use the Sales_Transactions_Dataset_Weekly dataset from the UCI Machine Learning Repository. After reading the dataset file, we select the time series data and center the time series data by its row means, resulting in $\textbf{X}_{811 \times 52}$.

In [None]:
# https://archive.ics.uci.edu/dataset/396/sales+transactions+dataset+weekly
X = pd.read_csv("Sales_Transactions_Dataset_Weekly.csv",
                         index_col=0, usecols=range(53))
# Normalize each time series, i.e., normalize each row by its mean
X = X.sub(X.mean(axis=1), axis=0).to_numpy()
X.shape

The time series features are simply a vector of indeces corresponding to the weeks of the year.

In [None]:
t = np.arange(start=1, stop=53)
t.shape

Here, a Python implementation of R's `splines::ns` is provided. It will be used to generate a natural cubic spline basis function matrix. It was copied from [the course website](https://liangfgithub.github.io/Python_W5_RegressionSpline.html).

In [None]:
# ref: https://liangfgithub.github.io/Python_W5_RegressionSpline.html
# converted from R's ns()
def ns(x, df=None, knots=None, boundary_knots=None, include_intercept=False):
    degree = 3
    
    if boundary_knots is None:
        boundary_knots = [np.min(x), np.max(x)]
    else:
        boundary_knots = np.sort(boundary_knots).tolist()

    oleft = x < boundary_knots[0]
    oright = x > boundary_knots[1]
    outside = oleft | oright
    inside = ~outside

    if df is not None:
        nIknots = df - 1 - include_intercept
        if nIknots < 0:
            nIknots = 0
            
        if nIknots > 0:
            knots = np.linspace(0, 1, num=nIknots + 2)[1:-1]
            knots = np.quantile(x[~outside], knots)

    Aknots = np.sort(np.concatenate((boundary_knots * 4, knots)))
    n_bases = len(Aknots) - (degree + 1)

    if any(outside):
        basis = np.empty((x.shape[0], n_bases), dtype=float)
        e = 1 / 4 # in theory anything in (0, 1); was (implicitly) 0 in R <= 3.2.2

        if any(oleft):
            k_pivot = boundary_knots[0]
            xl = x[oleft] - k_pivot
            xl = np.c_[np.ones(xl.shape[0]), xl]

            # equivalent to splineDesign(Aknots, rep(k.pivot, ord), ord, derivs)
            tt = np.empty((xl.shape[1], n_bases), dtype=float)
            for j in range(xl.shape[1]):
                for i in range(n_bases):
                    coefs = np.zeros((n_bases,))
                    coefs[i] = 1
                    tt[j, i] = splev(k_pivot, (Aknots, coefs, degree), der=j)

            basis[oleft, :] = xl @ tt

        if any(oright):
            k_pivot = boundary_knots[1]
            xr = x[oright] - k_pivot
            xr = np.c_[np.ones(xr.shape[0]), xr]

            tt = np.empty((xr.shape[1], n_bases), dtype=float)
            for j in range(xr.shape[1]):
                for i in range(n_bases):
                    coefs = np.zeros((n_bases,))
                    coefs[i] = 1
                    tt[j, i] = splev(k_pivot, (Aknots, coefs, degree), der=j)
                    
            basis[oright, :] = xr @ tt
        
        if any(inside):
            xi = x[inside]
            tt = np.empty((len(xi), n_bases), dtype=float)
            for i in range(n_bases):
                coefs = np.zeros((n_bases,))
                coefs[i] = 1
                tt[:, i] = splev(xi, (Aknots, coefs, degree))

            basis[inside, :] = tt
    else:
        basis = np.empty((x.shape[0], n_bases), dtype=float)
        for i in range(n_bases):
            coefs = np.zeros((n_bases,))
            coefs[i] = 1
            basis[:, i] = splev(x, (Aknots, coefs, degree))

    const = np.empty((2, n_bases), dtype=float)
    for i in range(n_bases):
        coefs = np.zeros((n_bases,))
        coefs[i] = 1
        const[:, i] = splev(boundary_knots, (Aknots, coefs, degree), der=2)

    if include_intercept is False:
        basis = basis[:, 1:]
        const = const[:, 1:]

    qr_const = np.linalg.qr(const.T, mode='complete')[0]
    basis = (qr_const.T @ basis.T).T[:, 2:]

    return basis

We use `ns` to generate the basis function matrix $\textbf{F}_{52 \times 9}$ and remove the intercept by centering its columns by their means.

In [None]:
F_mat = ns(t, df=9, include_intercept=False)
F_mat = F_mat - F_mat.mean(axis=0)[np.newaxis, :]
F_mat.shape

With $\textbf{F}$ and $\textbf{X}$, we can calculate a matrix of spline coefficients for every obbservation, $\textbf{B}_{811 \times 9}$. This is found by the following formula, which is implemented below.

$$
\textbf{B}^{\top} = (\textbf{F}^{\top} \ \textbf{F})^{-1} \ \textbf{F}^{\top} \textbf{X}^{\top}
$$

In [None]:
B = inv(F_mat.T @ F_mat) @ F_mat.T @ X.T
B = B.T
B.shape

### Clustering using Matrix $\textbf{B}$

Here we cluster by $\textbf{B}$, the spline coefficients for each observation.

In [None]:
n_clusters = 6
n_row = 2
n_col = 3

km_B = KMeans(n_clusters=n_clusters, n_init=10).fit(B)
centers_B = F_mat @ km_B.cluster_centers_.T  

fig, axs = plt.subplots(nrows=2, ncols=3, dpi=300,
                        sharex="all", sharey="all")
for i in range(n_row):
    for j in range(n_col):
        series = X[km_B.labels_ == i * n_col + j, :]
        for k in range(series.shape[0]):
            axs[i, j].plot(t, series[k], color="darkgrey", linewidth=0.75)
        axs[i, j].plot(t, centers_B[:, i * n_col + j], color="red", linewidth=0.75)
        axs[i, j].set_xlim([1, 52])
        axs[i, j].set_ylim([-30, 30])
        axs[i, j].set_xticks(range(0, 52, 10))
        axs[i, j].set_yticks(np.linspace(-30, 30, 7))
        axs[i, j].set_xlabel("Weeks")
        axs[i, j].set_ylabel("Weekly Sales")

### Clustering using Matrix $\textbf{X}$

By comparison, we cluster by $\textbf{X}$, the raw time series data. The centers are noticeably less smooth than the NCS-clustered centers from matrix $\textbf{B}$.

In [None]:
km_X = KMeans(n_clusters=n_clusters, n_init=10).fit(X)
centers_X = km_X.cluster_centers_.T

fig, axs = plt.subplots(nrows=2, ncols=3, dpi=300,
                        sharex="all", sharey="all")
for i in range(n_row):
    for j in range(n_col):
        series = X[km_X.labels_ == i * n_col + j, :]
        for k in range(series.shape[0]):
            axs[i, j].plot(t, series[k], color="darkgrey", linewidth=0.75)
        axs[i, j].plot(t, centers_X[:, i * n_col + j], color="red", linewidth=0.75)
        axs[i, j].set_xlim([1, 52])
        axs[i, j].set_ylim([-30, 30])
        axs[i, j].set_xticks(range(0, 52, 10))
        axs[i, j].set_yticks(np.linspace(-30, 30, 7))
        axs[i, j].set_xlabel("Weeks")
        axs[i, j].set_ylabel("Weekly Sales")

## Part III: Ridgeless and double descent

In [None]:
from sklearn.model_selection import train_test_split

# Set random seed to the last four digits of our UINs
np.random.seed(8818 + 1377 + 1)

# read in data
df = pd.read_csv('Coding3_dataH.csv', header=None)

### Task 1: Ridgeless Function

Calculate the beta coeficients, we can simplify OLS due to the properties of SVD.

We know OLS is closed-form and can be solved via linear algebra by solving for $\hat{\beta}$
$$
y = F \hat{\beta}
$$

Multiple both sides by $F^T$
$$
F^T y = F^T F \hat{\beta}
$$

Due to SVD, we have the following equations:
$$
F = U D
$$

We can now simplify $ F^T F$
$$
F^T F = (U D)^T (U D) = D^T U^T U D = D^T D
$$

Now we have:
$$
F^T y = (D^T D) \hat{\beta}
$$

Then finally:
$$
\hat{\beta} = (D^T D)^{-1} F^T y
$$

Since $D$ is a diagnonal matrix, $D^T D$ is the same as squaring all of the diagonal entries. The inverse of a diagonal matrix is just $1/d_{ii}$ if $d_{ii}$ are the diagonal entries. So we can preform simple matrix multiplication.

In [None]:
class PCR:
    """Class to handle Principle Component Regression
    """
    def __init__(self, eps=1e-10):
        self.eps = eps

    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit model given X (n,m) and y (n,) array

        Args:
            X (np.ndarray): Data matrix
            y (np.ndarray): Response vector
        """

        # Preform SVD
        U, S, Vh = np.linalg.svd(X, full_matrices=False)

        # Find PC below threshold
        k = (S > self.eps).sum()

        # Create new design matrix
        F = U @ np.diag(S)

        # ignore small PC
        F = F[:, :k]

        # compute 1/ (singular value)**2 for each singular value
        D = np.diag( 1 / S[:k]**2)

        # create beta vector, center y 
        self.beta = D @ F.T @ (y - y.mean()).reshape(-1, 1)

        # create intercept
        self.b0 = y.mean()
        
        # save for later use
        self.Vh = Vh
        self.k = k

    def predict(self, X: np.ndarray) -> np.ndarray:
        """Produce predictions for X (n,m) data matrix

        Args:
            X (np.ndarray): Data matrix

        Returns:
            np.ndarray: Returns (n,1) predictions
        """
        # compute design matrix
        F = X @ self.Vh.T

        # remove small PCs
        F = F[:, :self.k]
        
        # compute predictions
        return F @ self.beta + self.b0
    

def ridgeless_sim(X_train, X_test, y_train, y_test, eps):
    pcr = PCR(eps=eps)
    pcr.fit(X_train, y_train)
    y_test_pred = pcr.predict(X_test)
    y_train_pred = pcr.predict(X_train)

    return (
            np.mean((y_train - y_train_pred.reshape(-1))**2), 
            np.mean((y_test - y_test_pred.reshape(-1))**2)
        )


### Task II: Simulation Study

In [None]:
from tqdm import tqdm
T = 30 # number of iterations
eps=1e-10 # ignore PC below this number

data = []
for i in tqdm(range(T)):
    X = df.iloc[:, 1:].to_numpy()
    y = df.iloc[:, 0].to_numpy()

    # center X
    X = X - X.mean()
    
    # create a fit with 5 to 240 features.
    for d in range(5, 241):
        X_train, X_test, y_train, y_test = train_test_split(X[:, :d], y, test_size=0.75)
        train_loss, test_loss = ridgeless_sim(X_train, X_test, y_train, y_test, eps)
        data.append((i, d, train_loss, test_loss, np.log(test_loss)))
results = pd.DataFrame(data,
                       columns=['Iteration', '# of Features', 'Training Error',
                                'Test Error', 'Log of Test Error'])
results

In [None]:
dd = results.groupby('# of Features')[
    ['Training Error', 'Test Error', 'Log of Test Error']].median().reset_index()
plt.figure()
sns.scatterplot(dd, x='# of Features', y='Log of Test Error')
plt.show()