# ECE 364 Lecture 14: Midterm 1 Review

# Midterm Logistics
## When?
Tuesday, October 15th, 9:30-10:50am (in class)
## Where?
ECEB 3081
## What?
* Homeworks 1-4
    * Homework 1: PyTorch basics, tensors, storage and memory, indexing and slicing, derivatives and chain rule
    * Homework 2: Linear algebra (on paper, in PyTorch), more differentiation and chain rule, derivatives with matrices and vectors
    * Homework 3: Gradient descent, computational graphs, backpropagation
    * Homework 4: Linear regression: one-dimensional, multi-dimensional, transformations of variables (i.e. non-linear function approximation like HW4 P2)
* Lectures 2-12
## How?
* Mixture of free-response, fill-in-the-blank (complete the code), multiple choice
* Less focus on code syntax, more on the concepts of what code is trying to accomplish or the mathematical concepts
## Why?
Because grades :(
## Who?
All of you!

# Homework 1 Content
## PyTorch basics, tensors, storage and memory, indexing and slicing, derivatives and chain rule
* Tensors are the primary class we use for storing and acting on data in PyTorch.
    * Choosing the appropriate data type and shape for a tensor is important to efficiently store data.
* Operations on tensors can either create a new tensor or create a **view** of a Tensor.
    * For example operations with an underscore, e.g. ``torch.cos_(x)``, operate **in-place** and thus do not allocate new memory. 
* Tensor views share the same underlying memory and thus point to the same locations in memory.
* Tensor/array slicing are efficient ways to access or collect values according to certain conditions from a tensor/array
    * We may use slicing to extract only a particular column or every other element in a row, for example.
    * We may use Boolean operators to create **truth arrays** to separate elements based on a condition
* We also reviewed differentiation and chain rule as they play a core role to auto-differentiation engines like PyTorch

## Tensors and storage practice

Consider the below code snippet:

In [2]:
import torch
a = torch.zeros(3, 2)
b = a.cos_().add_(b)
c = a+b
print(a)
print(b)
print(c)

tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])
tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])
tensor([[4., 4.],
        [4., 4.],
        [4., 4.]])


a) Does ``b`` share the same pointer as ``a``? Does ``c`` share the same pointer as ``b``?

``b`` shares the same pointer as ``a`` due to the in-place cosine function. ``c`` allocates a new tensor because the given addition is not in-place

b) What would be printed if we ran ``print(a, b, c)``?

``a`` = 3x2 matrix of ones

``b`` = 3x2 matrix of ones

``c`` = 3x2 matrix of twos

## Slicing and truth arrays practice
Consider the below code snippet which creates a (10, 10) Tensor of random integers between $\{-99, 99\}$.

In [11]:
import torch
a = torch.randint(low=-99, high=100, size=(10, 10))
print(a)
d = a[0:5, 5:10]
print(d)
e = a[:, 3]
print(e)
h = e[e%4==0]
print(h)

tensor([[ 73, -99,  31, -59,  -5, -59, -86,  41, -11, -11],
        [-73,   9, -91,  67,  27, -96, -26,  98, -91,   0],
        [ 49, -69, -45,  71, -38, -26,  25, -51,  49, -35],
        [-40,  25, -57,  -9, -29, -67,  -7,  -6, -33,  12],
        [ 41, -58, -48, -80, -84,  20,  -7, -79, -59, -90],
        [-16,  22,  -1, -79,  50,   7, -77, -76, -99,  -2],
        [-98, -10,  -8, -27,  88, -36, -31, -99,  72, -22],
        [ 30,  58,  46,  97,  54,   2, -88, -73, -29,   1],
        [-90,  16,  40, -23, -79,  -3, -42,  46, -62,  82],
        [-82,  21, -30,  33,  46, -66, -35,  -8,  17, -94]])
tensor([[-59, -86,  41, -11, -11],
        [-96, -26,  98, -91,   0],
        [-26,  25, -51,  49, -35],
        [-67,  -7,  -6, -33,  12],
        [ 20,  -7, -79, -59, -90]])
tensor([-59,  67,  71,  -9, -80, -79, -27,  97, -23,  33])
tensor([-80])


a) We would like for ``d`` to contain the top-right quadrant of ``a``. What should ``b`` and ``c`` be to accomplish this?

``b`` = 0:5 or :5

``c`` = 5:10 or 5:

b) We would like for ``h`` to contain all multiples of 4 in the third column of ``a``. What should ``f``, ``g`` and ``i`` be to accomplish this? Note: you may assume the "zeroth" column refers to index 0, "first" to index 1, and so on.

``f`` = :

``g`` = 3

``i`` = e%4==0


# Homework 2 Content
## Linear algebra review, linear algebra with PyTorch, more differentiation and chain rule including matrices and vectors
* We reviewed linear algebra basics including dot products, matrix operations: multiplication, element-wise product, transpose, inverse, norms, gradients, and basic eigendecomposition.
* PyTorch readily performs linear algebra operations and may help us automate common calculations like solving systems of equations (via matrix inverse), for example.
* Multivariable functions have partial deriviatives with respect to each function variable. The collection of partial derivatives is referred to as the gradient.
* Scalar or vector-valued functions with vector and matrix arguments may also have partial derivatives with respect to the scalar or matrix arguments.
* For vector-valued functions, we have the **Jacobian** which represents the partial derivative of each entry in the output vector with repect to each element in the input vector.
* Scalar-valued example (gradient): $x\in\mathbb{R}^n,~y\in\mathbb{R}^n, A\in\mathbb{R}^{n\times n}$
$$
f(x, y, A) = y^\top Ax \in\mathbb{R}.
$$
$$
\frac{\partial  f}{\partial x} = \begin{bmatrix}\frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \cdots \frac{\partial f}{\partial x_n} \end{bmatrix}^{\top}
$$
Note that the shape of $\frac{\partial  f}{\partial x}$ matches the shape of $x$!

* Vector-valued example (Jacobian): $x\in\mathbb{R}^n, A\in\mathbb{R}^{m\times n}$
$$
\mathbf{f}(x, A) = Ax \in\mathbb{R}^{m}
$$
$$
\frac{\partial  \mathbf{f}}{\partial x} = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n}\\
\frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n}\\
\vdots & \vdots & \vdots & \vdots\\
\frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n}\\
\end{bmatrix}\in\mathbb{R}^{m\times n}
$$

## Gradients with vectors and matrices
Compute each of the requested partial derivatives for the given functions. You may assume $x\in\mathbb{R}^n$ and $A\in\mathbb{R}^{m\times n}$.

a) Compute $\frac{\partial f}{\partial x}$ for $f(x) = \mathrm{Tr}(xx^\top)$, recall that $\mathrm{Tr}(A)=\sum_{i}^{n}A_{ii}$.

**Solution**:

We start by observing that $xx^\top$ is an $n\times n$ matrix as follows:

$$
xx^\top = \begin{bmatrix}
x_1x_1 & x_1x_2 & \cdots & x_1x_n\\
x_2x_1 & x_2x_2 & \cdots & x_2x_n\\
\vdots & \vdots & \ddots & \vdots\\
x_nx_1 & x_nx_2 & \cdots & x_nx_n
\end{bmatrix}
$$
Thus, the trace of $xx^\top$ will be given by
$$
\mathrm{Tr}(xx^\top) = \sum_{i=1}^{n}x_i^2.
$$
The partial derivative of $f(x)$ with respect to each $x_i$ will then be $2x_i$. Therefore, the overall partial derivative will be 
$$
\frac{\partial f}{\partial x} = 2x.
$$
b) Compute $\frac{\partial f}{\partial x}$ for $f(x) = \frac{1}{2}\lVert Ax\rVert_2^2$.

**Solution**:

We may approach the partial $\frac{\partial f}{\partial x}$ in a couple different ways.

**Option 1**:

Let $y=Ax$. We may then write
$$
\begin{align*}
    \frac{1}{2}\lVert Ax\rVert_2^2 &= \frac{1}{2}\lVert y\rVert_2^2\\
    \frac{\partial f}{\partial x} &= \frac{\partial f}{\partial y}\frac{\partial y}{\partial x}.
\end{align*}
$$
Above, we use chain rule to re-write the desired partial derivative where $\frac{\partial y}{\partial x}\in\mathbb{R}^{m\times n}$ is the Jacobian of $y$ with respect to $x$. Let $J_y\in \mathbb{R}^{m\times n}$ be this Jacobian $\frac{\partial y}{\partial x}$. In matrix form, $J_y$ will look like
$$
J_y =
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n}\\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n}\\
\vdots & \vdots & \ddots & \vdots\\
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}.
$$
The partial derivative $\frac{\partial f}{\partial y}$ will be a vector in $\mathbb{R}^{m}$. Thus, we have two ways to express express $\frac{\partial f}{\partial x}$:
$$
\begin{align*}
    \frac{\partial f}{\partial x} &= J_y^\top\frac{\partial f}{\partial y} = \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1}\\
\frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_2}\\
\vdots & \vdots & \ddots & \vdots\\
\frac{\partial y_1}{\partial x_n} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}\begin{bmatrix}
\frac{\partial f}{\partial y_1}\\
\frac{\partial f}{\partial y_2}\\
\vdots\\
\frac{\partial f}{\partial y_m}
\end{bmatrix}\\
    \frac{\partial f}{\partial x} &= \left(\frac{\partial f}{\partial y}\right)^\top J_y = \begin{bmatrix}
\frac{\partial f}{\partial y_1} &
\frac{\partial f}{\partial y_2} &
\cdots &
\frac{\partial f}{\partial y_m}
\end{bmatrix}\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n}\\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n}\\
\vdots & \vdots & \ddots & \vdots\\
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}
\end{align*}
$$
The first solution would give us $\frac{\partial f}{\partial x}\in\mathbb{R}^n$ while the second solution is the transpose of this with $\frac{\partial f}{\partial x}\in\mathbb{R}^{1\times n}$.

With the application of chain rule set, we may determine the require partial derivatives and obtain the solution.
$$
\begin{align*}
f(y) &= \frac{1}{2}\lVert y\rVert_2^2\\
&= \frac{1}{2}y^\top y\\
\frac{\partial f}{\partial y} &= y
\end{align*}
$$
To find the Jacobian $J_y$, it is helpful to consider the partial derivative of one element in $y$ with respect to $x$. Let $a_i^\top$ represent the $i$'th row in $A$. Thus, $y_i=a_i^\top x$. Therefore, $\frac{\partial y_i}{\partial x}$, which is the $i$'th row in $J_y$ will be $a_i^\top$. In other words, the rows of $A$ become the rows of $J_y$. Therefore, $J_y=A$. Plugging these results into our expression of chain rule, we find
\begin{align*}
\frac{\partial f}{\partial x} &= J_y^\top\frac{\partial f}{\partial y}\\
&= A^\top y\\
&= A^\top Ax.
\end{align*}
**It is important to note that the Jacobian is defined as being $m\times n$ for functions mapping from $\mathbb{R}^n$ to $\mathbb{R}^m$ like in the above example. However, we had to consider transposing the Jacobian and left-multiplying to make sure chain rule was carried out correctly like in the $J_y^\top y$ equations above. Notice how the matrix-vector multiplication accumulates the partial derivatives of each $x_i$ from each $y_j$**

**Option 2**

The alternative we may examine involves expanding the L2 norm and using the product rule of differentation.
$$
\begin{align*}
f(x) &= \frac{1}{2}\lVert Ax\rVert_2^2\\
&= \frac{1}{2}(Ax)^\top(Ax)\\
\end{align*}
$$
Computing the partial derivative:
$$
\begin{align*}
\frac{\partial f}{\partial x} &= \frac{1}{2}\left(\frac{\partial (Ax)^\top}{\partial x}Ax+(Ax)^\top\frac{\partial (Ax)}{\partial x}\right)\\
&= \frac{1}{2}\left(\left(\frac{\partial Ax}{\partial x}\right)^\top Ax+\left(\frac{\partial Ax}{\partial x}\right)^\top Ax\right)\\
&= \frac{1}{2}\left(\frac{\partial (Ax)^\top}{\partial x}Ax+\left(\frac{\partial (Ax)^\top}{\partial x}\right) Ax\right)\\
&= \frac{\partial (Ax)^\top}{\partial x}Ax\\
&= A^\top Ax.
\end{align*}
$$
Above, note that we re-distribute the transpose in the second term in the second line so that we may combine like terms, i.e. $(CD)^\top=D^\top C^\top$ where we have $C=(Ax)^\top$ and $D=\frac{\partial Ax}{\partial x}$. In the last line, we borrow our result from Option 1 where $\frac{\partial Ax}{\partial x}=A$ and thus $\frac{\partial (Ax)^\top}{\partial x}=A^\top$.

# Homework 3 Content
## Gradient descent, computational graphs, backpropagation
* Derivatives and gradients are critical to minimizing or maximizing functions. Function optimization is at the heart of machine learning and data science problems where we seek to minimize a cost function or maximize or a reward function by learning from data.
* Most functions and real-world problems cannot be optimized in closed-form by solving for where the derivative is equal to zero. This motivates the use of gradient-based methods like **gradient descent**.
* Gradient descent is an iterative algorithm that updates parameters by stepping in the direction of the negative gradient (direction of steepest descent).
* The gradient descent equation for function $f(x)$ at iteration $k+1$ is stated as
$$
x^{(k+1)} = x^{(k)}-\alpha\nabla f(x)
$$
where $\alpha$ is the **step-size** or **learning rate** to control the size of each update step.
* Auto-differentiation engines operate based on **computational graphs** that build up complicated functions from simple mathematical operations using directed acyclic graphs. These graphs give important structure for computation.
* Performing gradient descent by hand is intractable for larger or more complicated functions even if we may determine the derivatives using chain rule. Thus, we would like to automate computing gradients. Methods like numeric differentiation or symbolic differentation are possible, however, we focus on auto-differentiation via **backpropagation** as the most scalable method for machine learning.
* Backpropagation works in two stages: (1) forward pass and (2) backward pass.
    * Forward pass: inputs are passed to the computational graph and intermediate values are stored at each node of computation of the graph
    * Backward pass: well-defined gradient functions at each node utilize the values from forward propagation to transmit partial derivatives to each predecessor node. The **adjoint**, partial derivative with respect to each **seed node**, at each node is composed of the adjoints of all successors and the partial derivatives of each successor with respect to the current node:
 $$
 \bar{w}_i = \sum_{j\in\textrm{successors}(i)}\bar{w}_j\frac{\partial w_j}{\partial w_i}.
 $$
 The above equation demonstrates chain rule!

## Backpropagation practice
Consider the following multivariable function
$$
f(x, y) = 2\cos^2(xy)+\ln(\cos(xy))
$$

a) Determine each partial derivative $\frac{\partial f}{\partial x}$, $\frac{\partial f}{\partial y}$.

$$
\begin{align*}
\frac{\partial f}{\partial x} &= -4y\cos(xy)\sin(xy)-\frac{y\sin(xy)}{\cos(xy)}\\
\frac{\partial f}{\partial y} &= -4x\cos(xy)\sin(xy)-\frac{x\sin(xy)}{\cos(xy)}
\end{align*}
$$

The below computational graph depicts $f(x, y)$.

<div>
<center><img src="computational-graph.png" width="800"/> </center>
</div>

b) Determine the values of each node $w_i$
$$
\begin{align*}
    w_1 &= x &w_2&=y\\
    w_3 &=w_1w_2 &w_4&= \cos(w_3)\\
    w_5 &= w_4^2 &w_6&=2w_5\\
    w_7 &= \ln(w_4) &w_8&=w_6+w_7
\end{align*}
$$

c) Determine the partial derivatives for each successor node with respect to its predecessor nodes.
$$
\begin{align*}
    \frac{\partial w_8}{\partial w_7} &= 1 &\frac{\partial w_8}{\partial w_6} &= 1\\
    \frac{\partial w_7}{\partial w_4} &= \frac{1}{w_4} &\frac{\partial w_6}{\partial w_5} &= 2\\
    \frac{\partial w_5}{\partial w_4} &= 2w_4 &\frac{\partial w_4}{\partial w_3} &= -\sin(w_3)\\
    \frac{\partial w_3}{\partial w_2} &= w_1 &\frac{\partial w_3}{\partial w_1} &= w_2\\
\end{align*}
$$

d) Compute the adjoints of all nodes. Verify that your expressions for $\bar{w}_1$ and $\bar{w}_2$ match your partial derivatives from part (a).
$$
\begin{align*}
    \bar{w}_8 &= 1 &\bar{w}_7&=1\\
    \bar{w}_6 &= 1 &\bar{w}_5&=2\\
    \bar{w}_4 &= \frac{1}{w_4} + 4w_4 &\bar{w}_3&= -\frac{\sin(w_3)}{w_4} -4w_4\sin(w_3)\\
    \bar{w}_2 &= -\frac{w_1\sin(w_3)}{w_4} -4w_1w_4\sin(w_3) &\bar{w}_1 &= -\frac{w_2\sin(w_3)}{w_4} -4w_2w_4\sin(w_3) 
\end{align*}
$$
Recursively plugging in using the expressions from part (b):
$$
\begin{align*}
\bar{w}_1 &= \frac{\partial f}{\partial w_1} = -4y\cos(xy)\sin(xy)-\frac{y\sin(xy)}{\cos(xy)} \\
\bar{w}_2 &= \frac{\partial f}{\partial w_2} = -4x\cos(xy)\sin(xy)-\frac{x\sin(xy)}{\cos(xy)} 
\end{align*}
$$

# Homework 4 Content
## Linear regression: one-dimensional, multi-dimensional, transformations of inputs

* Linear regression in one dimension seeks to find a line-of-best-fit for a dataset of $(x, y)$ coordinates.
* Consider $\mathcal{D}=\{(x_i, y_i)\}_{i=1}^{N}$. Linear regression minimizes:
$$
\min_{w_1,~w_0}~\frac{1}{2}\sum_{i=1}^{N}(y_i-w_1x_i-w_0)^2,
$$
where $w_0$ represents the **bias** term, i.e. y-intercept.
* This may also be written in vector form:
$$
\min_{w}~\frac{1}{2}\lVert \mathbf{X}^\top w - y\rVert_2^2,
$$
where
$$
\mathbf{X}^\top = \begin{bmatrix}
    x_1 & 1\\
    x_2 & 1\\
    \vdots & \vdots \\
    x_N & 1
\end{bmatrix}
$$
* By taking the gradient of the objective functions and setting to zero, we may obtain a closed-form solution for linear regression

$$
w^*=(\mathbf{X}\mathbf{X}^\top)^{-1}\mathbf{X}y.
$$
* We may also have linear regression in higher dimensions, e.g. plane-of-best-fit, hyperplane-of-best-fit.
* Alternatively, linear regression may be applied to more complicated regression problems where transformations are applied to input variables, e.g. polynomial regression like third-order polynomial regression below.
$$
\min_{w_3,~w_2,~w_1,~w_0}~\frac{1}{2}\sum_{i=1}^{N}(y_i-w_3x_i^3-w_2x_i^2-w_1x_i-w_0)^2
$$
* Where a transformation $\Phi(\mathbf{X})$ is applied to input data for linear regression, we may now describe the closed-form solution as
$$
w^*=(\mathbf{\Phi}\mathbf{\Phi}^\top)^{-1}\mathbf{\Phi}y.
$$
* What happens if $\mathbf{X}^\top\in\mathbb{R}^{N\times d}$ has $d>N$?

## Linear regression practice
a) Determine the $\mathbf{\Phi}$ matrix for the above third-order polynomial regression problem.

$$
\mathbf{\Phi}^\top = \begin{bmatrix}
x_1^3 & x_1^2 & x_1 & 1\\
x_2^3 & x_2^2 & x_2 & 1\\
\vdots & \vdots & \vdots & \vdots\\
x_N^3 & x_N^2 & x_N & 1
\end{bmatrix}
$$

b) What is the minimum number of data points $N$ such that we may obtain a unique solution according to the closed-form solution?

We have four unknowns to solve for in the third-order polynomial regression problem (don't forget this includes the bias term!). Thus, we need $N\geq 4$ dat apoints to obtain a unique closed-form solution.