In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("D2.ipynb")

# Discussion Lab 2

In this week's workbook, you will implement the Ordinary Least Squared regression based on the previous `numpy` skills we built on together. In addition, we will also explore and practice the Categorical Encoding.

In [None]:
# !!!IMPORTANT!!! uncomment if needed for module installation
# Please comment out when you submit to gradescope

# %pip install -q --force-reinstall git+https://github.com/COGS118A/quiz_module.git
# %pip install -q otter-grader
# print("Install Successful. Please Restart the Kernel")


In [None]:
# Setup for JupyterQuiz
from quiz import display_quiz, record_quiz, check_quiz_answer, show_chosen_option
from IPython.core.display import display, Javascript, HTML
# Setup for Otter Grader. 
# If you have not install it, see previous cell
import otter
grader = otter.Notebook()

import numpy as np

D2_path = "https://raw.githubusercontent.com/COGS118A/DiscussionLabExercises/main/D2/"

## Quiz A: Vector Calculus Recap / Cosine Similarity
Vector calculus is essential for your machine learning journey in this course and beyond. Mastering the concept of vector calculus helps you understand the basics of a lot of machine learning models. 

In [None]:
HTML(display_quiz(f"{D2_path}A.txt"))

In [None]:
# show your choice
with open("record.txt", "a+") as r:
    pass
MCQ = ["A1", "A2", "A3", "A4"]
for q in MCQ:
    show_chosen_option(q)

## Implementing Linear Regression using Ordinary Least Square (OLS)

The process of linear regression using OLS has several distinct steps. In this part, we will break the OLS into several small chunks of steps that let you implement it. 

### Derivation of OLS analytical solutions

If we have $n$ data pont $$(x_1, y_1), (x_2, y_2), ... , (x_n, y_n)$$

Our goal is to predict each $y_1$ given the all the data collection $$\mathbb{X} = (1, x_1, x_2, ..., x_n) \in \mathbb{R}^{d+1}$$ while $y \in \mathbb{R}$

To achieve this, we also create weight vector $$\vec w = (w_0, w_1, ..., w_d) \in \mathbb{R^{d+1}}$$

To linearly transform $\mathbb{X}$ using weight vector $\vec w$, $$\hat y = \vec w^T x_i = <w^T, x_i>$$

Our goal is predicint $y$ with low error. Recall that the definition of residuals is, loosely speaking, the error of our estimation. This error is measured by the distance between the actual data point with respect to the predicted value.

In [None]:
HTML(display_quiz(f"{D2_path}OLS.txt"))

In [None]:
# show your choice
MCQ = ["OLS1"]
for q in MCQ:
    show_chosen_option(q)

Our objective function to minimize is $$\arg\min_w  \sum^n_{i=1} e_i^2 = e^Te$$

First, we expand the equation using matrix multiplications.

$$
\begin{align*}
e^T e &= (Xw-y)^T (Xw - y) \\ 
L(w) &= y^Ty - y^TXw - (Xw)^Ty + (Xw)^T(XW) 
\end{align*}
$$

Some matrix calculus rules that might be helpful: For two matrix $A, b$

$$
\begin{align*}
\frac{\partial}{\partial b} b^TAb &= 2Ab \\
\end{align*}
$$

Therefore, to calculate the optimum $\hat w$ that yeilds the minimum loss, we take the gradient w.r.t. $\vec w$ and set it to zero.

$$
\begin{align*}
\frac{\partial}{\partial w} L(w) &= 0 - 2X^T + 2X^TXw = 0 \\ 
X^TXw &= X^Ty \\
\hat w &= \left( X^TX \right)^{-1}X^Ty
\end{align*}
$$

In a function `OLS`, for a given $X$ and $y$, calculate the optimum solution of $\vec w$ using the equation above.

**Hint:** You might found `np.linalg.inv` useful to calculate the inverse of one matrix

In [None]:
X = np.array([[1, 1],[1, 2],[1, 3],[1, 4],[1, 5]])
y = np.array([10, 15, 16, 17, 20])

In [None]:
def OLS(X, y):
    ...

In [None]:
# See the result.
OLS(X, y)

## Categorial Encoding
Analyzing numeric data is often straightforward. Suppose you have a list of ages of students in COGS 118A. Each age will be indicated as a number, and you can easily calculate the average, visualize the distribution, etc. 

However not all data is numeric. Data has several other distinct types, such as categorical data. Examples of categorical data are nationality, which residential colleges you are in, etc. Unlike numeric data, categorical data is more complicated to analyze because they cannot be directly interpreted mathematically. For example, which number should Revelle College and Sixth College map to? If we assign 1 to Revelle and 2 to Sixth, does it mean Sixth is worth more than Revelle? Probably not.

A common solution to this challenge is called One Hot Encoding. It transforms a categorical feature into a one-hot matrix where we use one array for each unique value in the feature, and use 1 to represent the occurrence of this value.

In this question, you will need to implement `one_hot_encoding` by yourself.

**Note:** The public test provided is only for sanity check. Passing the public test won't guarantee you pass the hidden test later on.

In [None]:
def one_hot_encode(x:np.ndarray):
    ...

In [None]:
grader.check("one_hot_encode")

# Quiz B - OneHotEncoding Interpretation

Given the following one-hot-encoded table, answer the following question.

| Name    |   Gender_Female |   Gender_Male |   Gender_Non-binary |
|:--------|----------------:|--------------:|--------------------:|
| Joan    |               1 |             0 |                   0 |
| Matt    |               0 |             1 |                   0 |
| Jeff    |               0 |             1 |                   0 |
| Melissa |               1 |             0 |                   0 |
| Devi    |               1 |             0 |                   0 |
| John    |               0 |             0 |                   1 |

In [None]:
np.array([[1,0,0],[0,1,0],[0,1,0],[1,0,0],[1,0,0],[0,0,1]]).sum(axis=0)

In [None]:
np.array([[1,0,0],[0,1,0],[0,1,0],[1,0,0],[1,0,0],[0,0,1]]).sum(axis=1)

In [None]:
HTML(display_quiz(f"{D2_path}B.txt"))

In [None]:
# show your choice
MCQ = ["B1", "B2"]
for q in MCQ:
    show_chosen_option(q)

In [None]:
# Make sure you have complete all the questions
MCQ = ["A1", "A2", "A3", "A4", "OLS1", "B1", "B2"]
for q in MCQ:
    show_chosen_option(q)

**The End of D2**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Now, please save this jupyter notebook from `File -> Save and Checkpoint`. Then, submit this jupyter notebook file with the `record.txt` to gradescope.