# Home Quiz 1 - Logistic Regression

## Problem 1: Gradient Descent

#### Team F
Andreas Chouliaras 2143


Our model for this problem is:

$$y = w_0 + w_1x_1 + w_2x_2 + w_3x_1^2 + \epsilon \qquad \text{ where } \epsilon \sim \mathcal{N}(0, \sigma^2)$$

where the learning algorithm will estimate the parameters $w_0$, $w_1$, $w_2$, and $w_3$.

Because $\epsilon$ follows a normal distribution, y also follows a normal distribution with:

$$
y | \mathbf x \sim \mathcal N ({ w_0 + w_1x_1 + w_2x_2 + w_3x_1^2},{ \sigma^2})
$$

\textbf{a)} So $P(y|x_1,x_2)$ is given by the following expression:

$$P(y|x_1,x_2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} \left(y -  w_0 - w_1x_1 - w_2x_2 - w_3x_1^2\right)^2 \right)$$

\textbf{b)} Given a set of training observations $(x^{(i)}_1 , x^{(i)}_2 , y^{(i)})$ for $i = 1, ... ,n $, the conditional log likelihood of this training data is given as:
$$\log P(y|x_1,x_2) = - \sum_{i=1}^n \left(y^{(i)} -  w_0 - w_1x^{(i)}_1 - w_2x^{(i)}_2 - w_3x_1^{2(i)}\right)^2$$

\textbf{c)}  To find the desired parameter estimates we need to maximize the conditional loglikelihood from above.<br>

The problem is the same if we try to minimize the function $f(w0,w1,w2,w3)$, that is defined as the opposite<br>
of the conditional loglikelihood, as this:

$$
 f(w0,w1,w2,w3)= - \log P(y|x_1,x_2) = \sum_{i=1}^n \left(y^{(i)} -  w_0 - w_1x^{(i)}_1 - w_2x^{(i)}_2 - w_3x_1^{2(i)}\right)^2
$$

So our aim is now to minimize $f$ to find the desired parameter estimates

\textbf{d)}  So now we need to calculate the gradient of $f(w)$  where $w = [w_0,w_1,w_2,w_3]^T$. We get:

$$
 \nabla_w{f(w)} = \left[ \frac{\partial f(w)}{\partial w_0} , \frac{\partial f(w)}{\partial w_1} , \frac{\partial f(w)}{\partial w_2} , \frac{\partial f(w)}{\partial w_3} \right]^T
$$

$$
\nabla_w{f(w)}=
\left[\begin{array}{c} 
\sum_{i=1}^N \left(y^{(i)} - w_0 - w_1x^{(i)}_1 - w_2x^{(i)}_2 - w_3x_1^{2(i)}\right)(-1) \\
\sum_{i=1}^N \left(y^{(i)} - w_0 - w_1x^{(i)}_1 - w_2x^{(i)}_2 - w_3x_1^{2(i)}\right)(-x^{(i)}_1) \\
\sum_{i=1}^N \left(y^{(i)} - w_0 - w_1x^{(i)}_1 - w_2x^{(i)}_2 - w_3x_1^{2(i)}\right)(-x^{(i)}_2) \\
\sum_{i=1}^N \left(y^{(i)} - w_0 - w_1x^{(i)}_1 - w_2x^{(i)}_2 - w_3x_1^{2(i)}\right)(-2x^{(i)}_1) \\
\end{array}
\right]
$$

\textbf{e)} The gradient descent update rule for our weights $\mathbf{w}$ is:

$${\mathbf{w}^{(t+1)}} = {\mathbf{w}^{(t)}} - \lambda \nabla f({\mathbf{w}^{(t)}}), \quad t =1,2,3,\ldots $$

\textbf{f)}
We use sympy library to calculate the partial derivatives:

In [13]:
import sympy as sym

y = sym.symbols('y')
x1= sym.symbols('x1')
x2= sym.symbols('x2')
w0= sym.symbols('w0')
w1= sym.symbols('w1')
w2= sym.symbols('w2')
w3= sym.symbols('w3')

f= ( y- w0 -w1*x1 -w2*x2 -w3*x1**2)**2
print('Partial derivative with respect to w0',sym.diff(f,w0) /2 )
print('Partial derivative with respect to w1',sym.diff(f,w1) /2 )
print('Partial derivative with respect to w2',sym.diff(f,w2) /2 )
print('Partial derivative with respect to w3',sym.diff(f,w3) /2 )

Partial derivative with respect to w0 w0 + w1*x1 + w2*x2 + w3*x1**2 - y
Partial derivative with respect to w1 -x1*(-w0 - w1*x1 - w2*x2 - w3*x1**2 + y)
Partial derivative with respect to w2 -x2*(-w0 - w1*x1 - w2*x2 - w3*x1**2 + y)
Partial derivative with respect to w3 -x1**2*(-w0 - w1*x1 - w2*x2 - w3*x1**2 + y)


As we see it agrees with the formula we derived earlier