# Exercise 7 - Theory Recap 

We give you a few examples to train your theoretical understanding, covering some topics from the first part of the course.

------
### A) PageRank

Recall the adjacency matrix we defined in the lecture for a directed graph with $n$ nodes:
$$
 A_{ij} = \begin{cases}
        1 & \text{if website $j$ links towards website $i$} \\
        0 & \text{otherwise}
    \end{cases} \, .
$$

-  **Question 1**: How can you determine the in-degree and out-degree of a node $i$ from its corresponding row and column in the adjacency matrix $A$? 

- **Question 2**: In which cases is the use of a dense array suboptimal as a datastructure in your algorithm? 

- **Question 3**: How can you use the given adjacency matrix to find the number of directed triangles in a graph? The existence of a directed triangle on the internet means that starting on website $i$, a surfer can return to the starting website by clicking three links (e.g. `Physics->Universe->Time->Physics` on wikipedia).
Sketch the idea of an algorithm that would count these triangles, and explain why it works. (If you feel like it, you can go back to the solution of exercise 1 and implement it.)
*Hint: What conditions would the matrix $A$ fullfill, if there existed a directed triangle between the nodes $i,j$ and $k$?*

Recall our definition of the matrix $S$ which we used for Page Rank in Lecture 1:
$$
    S_{ij} = \begin{cases}
        \frac{A_{ij}}{d_j} & \text{if $d_j \geq 1$} \\
        \frac{1}{n} & \text{if $d_j = 0$}
    \end{cases} \,,
$$
with 
$$
d_j = \sum_{i = 1}^n A_{ij} \, .
$$

- **Question 4**: Prove that $S$ is *column-wise stochastic*, i.e. $$\sum_{i=1}^n S_{ij} = 1 \quad \text{for} \quad i = 1, \dots, n \,.$$


### A) - Anwers
- **Answer 1**: In a directed graph, you can determine the in-degree of a vertex by summing the values in the corresponding row of the adjacency matrix. Each entry in that row represents an incoming edge to the vertex, and the sum of these entries is equal to the in-degree of the vertex. To determine the out-degree of a vertex, sum the values in the corresponding column of the adjacency matrix. Each entry in that column represents an outgoing edge from the vertex, and the sum of these entries is equal to the out-degree of the vertex. I.e. the in-degree of a node $i$ is $d^{\mathrm{in}}_i=\sum_{j=1}^nA_{ij}$ and its out-degree is $d^{\mathrm{out}}_i=\sum_{j=1}^nA_{ji}$.

- **Answer 2**: It is inefficient for sparse graphs, as it requires space proportional to the square of the number of vertices, while in sparse graphs the number of edges usually only scales in $n$.

- **Answer 3**: To find the number of directed triangles in a graph using its adjacency matrix, you can raise the adjacency matrix to the third power and then examine the diagonal entries. Each entry $A_{ii}$ in the resulting matrix represents the number of directed triangles that include node $i$. Summing up all the diagonal entries and dividing by 3 (since each directed triangle is counted once by all his nodes) will give you the total number of directed triangles in the graph.

- **Answer 4**: See Lecture Notes.

-------
### B) Principle Component Analysis and Singular Value Decomposition

Recall the singular value decomposition we discussed in Lecture 2. Assume that we are given a data matrix $A \in \mathbb{C}^{n\times d}$.

- **Question 1**: Define precisely what is meant by the SVD for the matrix $A$. 

- **Question 2**: What is a singular value of a matrix $A$? What are left and right singular eigenvectors in this context? How are they related to the SVD?

- **Question 3**: You want to obtain a low rank approximation of rank $k$ of your original matrix $A$ using its SVD. The quality of the approximation is measured in terms of the Frobenius norm to the original matrix. How does one find such an approximation? Why?

- **Question 4**: How can you calculate the Frobenius norm of matrix $A$ using the singular values obtained from SVD? Prove your suggestion.

### B) - Answers

- **Answer 1**: We have that $A=U\Sigma V^*$ with with $U\in \mathbb C^{n\times n}$ and $V\in \mathbb C^{d\times d}$ and $\Sigma=\mathrm{diag}(\sigma_1,\ldots,\sigma_{\min(n,d)}) \in \mathbb C^{n\times d}\ ,$ where $\sigma_i\ge 0$ are real and $\sigma_1\ge\ldots\ge\sigma_{\min(n,d)}$.

- **Answer 2**: They are vectors $u\in \mathbb C^n$ and $v\in \mathbb C^d$ such that there exists $\sigma\in \mathbb C$ with $A^*u=\sigma v$ and $ Av=\sigma u.$ The columns of $U$ and the columns of $V$ are left and right singular vectors of $A$.

- **Answer 3**: You would select the $k$ singular vectors with the largest singular values, because this gives the minimum Frobenius norm approximation $||.||_F^2$.

- **Answer 4**: The norm $||A||_F$ is the square root of the sum of the squared singular values of the SVD. The proof is contained in the Young-Eckhard Theorem in the lecture notes.

------
### C) Linear Regression
*Disclaimer: This exercise is probably a bit more difficult than what you would see in the exam.*

Let $X\in\mathbb R^{n\times d}$ be such that $X^TX$ is invertible and $\vec y\in\mathbb R^n$. This is the case where we have enough datapoints, and we select exactly the solution which minimizes the least squares loss

$$
{\arg \min}_{\vec \alpha} \| Y - \hat Y( \vec \alpha) \|^2
$$

where ${\vec \alpha} \in \mathbb{R}^d$ are the parameters we are fitting and $\hat Y(\vec \alpha) = X \vec \alpha$.
The goal of this exercise is to prove geometrically that under the given assumptions, the least square minimizer is such that

$$
\hat {\vec \alpha} = (X^TX)^{-1}X^TY.
$$
- **Question 1**: What is the dimension of the vector $Y$? What is the basis $\mathcal B=\{\vec b_i\}_{i=1}^r$ of $\mathbb R^n$ in which you can express $\hat Y(\vec \alpha)$? I.e. what is the value of $r$, what are the $b_i$'s?

- **Question 2**: Note that the least squares loss minimizes the Euclidian distance between the two points $Y$ and $\hat Y$ in $\mathbb R^n$. The degrees of freedom to achieve this minimization are $\vec \alpha \in \mathbb R^d$. Re-express the condition for a minimal loss in geometric terms.
*Hint: Decompose $Y=\hat Y + \vec e$ with $\vec e \in \mathbb R^n$. What is the norm of $\vec e$ related to the least squares?  How can you interpret it geometrically? Use orthogonality to express the least squares condition.*

- **Question 3**: Using this new formulation, deduce the above expression for $\hat\alpha$.

### C) - Anwers
- **Answer 1**: $Y$ is $n$-dimensional. $Y(\vec \alpha)$ is in the span of the $r=d$ basis vectors that are the columns of $X$. 

- **Answer 2**: Inspecting $Y=\hat Y + \vec e$, we can interpret $\hat Y(\vec \alpha)$ as a projection of $Y$ onto the $d$-dimensional basis spanned by the column vectors of $X$. Then $\|\vec e\|^2=\|Y-\hat Y\|^2$ is the least squares objective value. 
Minimizing this geometrically means that we want to select $\alpha$ such that we obtain the smallest possible euclidian distance $\|\vec e\|$ between the original $Y$ and the projected $\hat Y(\vec \alpha)$. From the linear algebra class, we know that this condition is realized by the orthogonal projection. We can write this condition as

$$(Y-\hat Y(\vec \alpha))\perp \sum_{i=1}^db_i\vec x_i$$ 

or 

$$\vec e \perp \sum_{i=1}^db_i\vec x_i$$ 

for the column vectors $\vec x_i$ of $X$ and all possible $\{b_i\}$.

Equivalently, this gives $X^T(Y-X\hat{\vec \alpha})=0$.

- **Answer 3**: We deduce
$$
\begin{align}
X^T(Y-X\hat{\vec \alpha})&=0 \\
X^TY &= X^TX\hat{\vec \alpha} \\
(X^TX)^{-1}X^TY &= \hat{\vec \alpha}.
\end{align}
$$

-------
### D) Gradient Descent

- **Question 1**: When and why do you need a validation and a test dataset? 
- **Question 2**: Gradient descent often requires selecting a learning rate. What is the learning rate, and what are the trade-offs in choosing a learning rate that is too small or too large?
- **Question 3**:Why is generally not a good idea to use a the step function $$\theta(x) = \begin{cases}
        0 & \text{if $x > 0$} \\
        1 & \text{if $else$}\\
    \end{cases}$$ as a loss for gradient based optimization?

### D) - Answers

- **Answer 1**: The validation dataset is required during training to fine-tune hyperparameters and choose the best model. It ensures that the model is well-suited to the specific task without overfitting the training data. The test dataset, separate from the training and validation data, serves to evaluate the model's ability to generalize to unseen, real-world data. It provides an unbiased assessment of model performance.
- **Answer 2**: It controls the size of steps taken during training to minimize the cost function. If the learning rate is too small, the training process can be extremely slow, and the model might get stuck or take too long to converge. If it's too large, there's a risk of overshooting the optimal solution, leading to divergence. The ideal learning rate balances the trade-offs between convergence speed and stability.
- **Answer 3**: Its derivative is zero almost everywhere, so the algorithm will not move anywhere.

### E) Probability

Recall the definitions of expectation of a random variable $X$:

$\mathbb{E}(X)= \sum_{x} x \mathbb{P}_X(x)$ for a discrete random variable with probability mass function $\mathbb{P}_X$ and
$\mathbb{E}(X)= \int x p_X(x) dx$ for a continuous random variable with probability density $p_X(x)$.

The variance $\text{Var}(X)$ is defined as:

$$\text{Var}(X) = \mathbb{E}(X-\mathbb{E}(X))^2 .$$

Also recall that two events $\mathcal{E}_1, \mathcal{E}_2$ are said to be independent if:

$$\mathbb{P}[\mathcal{E}_1 \cap \mathcal{E}_2] = \mathbb{P}[\mathcal{E}_1]\mathbb{P}[\mathcal{E}_2]. $$

Answer the following questions.

- **Question 1**
Suppose that $X$ is a discrete random variable taking values in a countable set $\mathcal{X}$ with probability mass function $P_X$, and $a,b$ are two positive numbers. Which of these statement(s) are always true?

    a. $\text{Var}(X) \geq 0$

    b. $\mathbb{E}(X) \in \mathcal{X}$

    c. $\mathbb{E}( a X^2 +b) = a \mathbb{E}(X)^2+b$

    d. If two events $\mathcal{E}_1$,  $\mathcal{E}_2$ are disjoint ($\mathcal{E}_1 \cap  \mathcal{E}_2 = \emptyset$) then they are independent

    e. If two events $\mathcal{E}_1$,  $\mathcal{E}_2$ are disjoint then $\mathbb{P}( \mathcal{E}_1 \cup \mathcal{E}_2) = \mathbb{P}(\mathcal{E}_1) + \mathbb{P}(\mathcal{E}_2)$

    f. If two events $\mathcal{E}_1$,  $\mathcal{E}_2$ are disjoint  then $\mathbb{P}( \mathcal{E}_1 | \mathcal{E}_2)  = \mathbb{P}(\mathcal{E}_1)$. 


- **Question 2**
Let $X$ be a continuous random variable with probability density $p_X$ on the real line. Find the correct claim(s):

   a. $\mathbb{E}(X^3)=\int_{-\infty}^{\infty} x p_X(x)^3 \mathrm{d}x$

   b. $\mathbb{E}(X^3)=\int_{-\infty}^{\infty} x p_X(x^3) \mathrm{d}x$

   c. $\mathbb{E}(X^3)=\int_{-\infty}^{\infty} x^3 p_X(x) \mathrm{d}x$

   d. $\mathbb{E}(X^3)=\int_{-\infty}^{\infty} x^3 p_X(x^3) \mathrm{d}x$

   e. $\mathbb{E}(X^3)=\int_{-\infty}^{\infty} (x p_X(x))^3 \mathrm{d}x$

   f. $\mathbb{E}(X^3)=\int_{-\infty}^{\infty} p_X(x^{1/3}) \frac{x^{1/3}}{3} \mathrm{d}x$

- **Question 3** A pair of dice is rolled until a sum of either 5 or 7 appears. Find the
probability that a 5 occurs first.

- **Question 4** Let $P_A,P_B, P_C$ denote three joint-distributions over real-valued random variables $X$ and $Y$. Below we plot $1000$ samples from each of the distributions $P_A,P_B, P_C$.

![Local Image](plots_theoryex.png)

Based on the above plots, for each of the distributions $P_A,P_B, P_C$, state and explain which of the following statements appear to be true:

 a.  $X$ and $Y$ are uncorrelated i.e. $\mathbb{E}[XY]=0$.

 b.  $X$ and $Y$ are independent.

 E) - Answers

- **Answer 1** a and e. a is true since $Var(X)$ is a weighted average of non-negative values. e is an axiom of probability theory.
- **Answer 2** c and f. c is simply the definition of the expectation, f is obtained by doing the change of variable $y=x^3$.
- **Answer 3** 2/5. Let $\mathcal{E}_{5,n}$ denote the event that a 5 occurs on the $n_{th}$ roll and no 5 or 7
occurs on the first $n-1$ rolls. Similarly, let $\mathcal{E}_{7,n}$ denote the event that a 7 occurs on the $n_{th}$ roll and no 5 or 7 occurs on the first $n-1$ rolls. Observe that the ratio $\frac{\mathbb{P}[\mathcal{E}_{5,n}]}{\mathbb{P}[\mathcal{E}_{5,n}]}$ is just the ratio of the probabilities of getting sums 5 or 7 in a single throw. Since the number of ways of getting a 5 is 4 while that of getting a 7 is 6, this ratio is $2/3$.
- **Answer 4** For $P_A$, both a, b are true since by looking at a vertical slice of the plot, we see that for any fixed value of X, the values of Y are centered and distributed identically (independent of X), and vice-versa for X at fixed Y. For $P_B$, only claim a is true since the values of $Y$ at fixed $X$ are centered but depend on $X$. This illustrates and absence of correlation doesn't imply independence. For $P_C$ neither of the claims are true since $X,Y$ are correlated and therefore also non-independent.


