**COMP3670/6670 Tutorial Week 7 - Regression**
---

THEORY SECTION
---

Regression and gradient descent are pillars in machine learning.  

The first part of this tutorial to go over the lecture slides in regression and gradient descent.  

**PRIORITIES:**
1. Ensure you understand **every step** of (stochastic) gradient desent.
2. Ensure you could can derive the gradient of regression problems with or without regression. 

Once that's all done, revisit the clustering section. Did you understand everything here as well? Unsupervised learning is important because it not only has immediate practical uses but is more relevant to the development of AGI (Artificial General Intelligence) than supervised (labelled) learning. If you don't understand this paragraph, please ask your tutor to clarify the meaning of supervised and unsupservised learning.

**PROGRAMMING SECTION**
---

We're going to do a simple scalar linear regression with gradient descent.


-----------

   **TASK:** 
   
   1. Randomly generate a matrix $X \in \mathbb{R}^{m \times n}$, where each row of $X$ is a training example.
   2. Choose a vector $t \in \mathbb{R}^{n \times 1}$.
   3. Generate $Y$ by $Xt = Y$.
   4. Then generate a random matrix $\theta \in \mathbb{R}^{n \times 1}$.
   5. Implement gradient descent to approximate $t$ with $\theta$.
   6. Check your gradient descent algorithm correctly approximated $t$. Talk to your classmates and tutor to make sure if you're unsure.
   7. Verify your answer with the closed form solution employing the Moore-Penrose inverse.
   
Note that in the above we're essentially pretending we don't know $t$. Obviously, if we have $t$, linear regression with gradient descent would be unnecessary, but the point is to help you understand what gradient descent is doing.

Also note: we should use the squared loss function, computed as the square of the difference between the predicted function values and the observed function values (or ground truth). 


-----------

**GENERAL COURSE HINTS:** 
- $n$ can be any number you like, but be reasonable.
- If you need extra study materials, Stanford and MIT both have some amazing freely available course content online. Look up "Machine Learning Stanford CS229" or "CS221" or "CS221N" for details.
- Wikipedia is your friend. It's not always right, but it's always there for you.

In [3]:
import numpy as np
! pip install matplotlib
import matplotlib.pyplot as plt



In [43]:
# YOUR CODE HERE.
m = 5
n = 5
X = np.random.rand(m,n)
t = np.random.rand(n,1)
Y = X@t
theta = np.random.rand(n,1)
print(theta)
def grad(X, theta, Y):
    # YOUR CODE HERE
    return np.matrix.transpose(np.matrix.transpose(X@theta-Y)@(X)/len(Y))
def grad_descent(X, Y, theta,iterations, learning_rate):
    i=0
    theta = theta
    while(i<iterations):
        theta = theta - learning_rate*grad(X, theta, Y)
        i+=1
    return theta
new_theta = grad_descent(X, Y, theta,5000, 0.2)
print(new_theta)
print(t)

[[0.86417955]
 [0.89924213]
 [0.79641104]
 [0.61700092]
 [0.63738377]]
[[0.09796915]
 [0.98491521]
 [0.91701654]
 [0.05733517]
 [0.26664093]]
[[0.09777152]
 [0.98472275]
 [0.91735593]
 [0.05724705]
 [0.26665602]]


-----------
**Extended Task:** We study various influence factors in training a linear regression model in this task.

1. noise. When collecting real-world data, it is common that there would be noise included. Adding noise to your generated data and see how would this influence the parameter estimation.

2. sample amount. Sometimes it is expensive to collect data while a lot of parameters need to be trained. Let's study the effect of training example amount. That is changing the $m$ for $X \in \mathbb{R}^{m \times n}$ and compare the final loss fixing training epoch and learning rate. 

3. learning rate. How would the learning rate influence the convergence of the optimization process?

-----------
**Hint**
- You can add noise by settting $Y=Xt+\epsilon$ where $\epsilon \sim \mathcal{N}(\mu,\sigma^2)$.

In [64]:
# YOUR CODE HERE.

# Adding noise to your generated data 
m = 5
n = 5
X = np.random.rand(m,n)
t = np.random.rand(n,1)
mu,sigma = 0,0.1
e = np.random.normal(mu, sigma, size=(m, 1))
Y = X@t+e
print("The initial Y is ",Y)
theta = np.random.rand(n,1)
print("The initial theta is ",theta)

new_theta = grad_descent(X, Y, theta,20000, 0.1)
print("After we add a normal error to the Y, the outcome is ",new_theta)# Add the small error into the Y degrade all the theta outcome
print("The initial t is ",t)



The initial Y is  [[1.87689516]
 [0.78796779]
 [2.09011557]
 [1.56566321]
 [1.46735578]]
The initial theta is  [[0.02358686]
 [0.21625371]
 [0.24875499]
 [0.10278459]
 [0.12265653]]
After we add a normal error to the Y, the outcome is  [[0.02538885]
 [0.93488344]
 [1.25034703]
 [0.16142739]
 [0.26647301]]
The initial t is  [[0.004002  ]
 [0.98574362]
 [0.75081542]
 [0.36654095]
 [0.60710437]]


In [69]:
## change sample amount
m = 100
n = 5
X = np.random.rand(m,n)
t = np.random.rand(n,1)
mu,sigma = 0,0.1
e = np.random.normal(mu, sigma, size=(m, 1))
Y = X@t +e
print("The initial Y is ",Y)
theta = np.random.rand(n,1)
print("The initial theta is ",theta)

new_theta = grad_descent(X, Y, theta,10000, 0.1)
print("After changing sample amount, the outcome theta is ",new_theta)# Add the small error into the Y degrade all the theta outcome
print("The initial t is ",t)
# Add sample size imporve our result

The initial Y is  [[0.91332575]
 [1.0815174 ]
 [1.67551892]
 [0.59464297]
 [1.52280865]
 [0.61340883]
 [1.56566667]
 [1.0495464 ]
 [0.72370564]
 [1.19331662]
 [1.22240439]
 [1.02689012]
 [0.68425347]
 [1.1020185 ]
 [1.43416533]
 [0.94068382]
 [1.32119537]
 [0.46817941]
 [1.14845274]
 [1.01399539]
 [1.20230635]
 [1.01689513]
 [1.05620694]
 [0.72274382]
 [1.25726859]
 [1.51775101]
 [0.79738437]
 [0.57433226]
 [1.30551421]
 [1.31442523]
 [0.38517785]
 [1.38005087]
 [1.06508577]
 [1.32238513]
 [1.35633212]
 [0.8203235 ]
 [1.43910035]
 [1.16243306]
 [0.84352568]
 [0.80192598]
 [1.20779304]
 [1.09080915]
 [0.83154598]
 [1.33653659]
 [1.6080151 ]
 [1.16507139]
 [0.92617373]
 [0.90977508]
 [1.15088573]
 [1.49312828]
 [1.04292199]
 [1.10313388]
 [1.27225881]
 [1.26653351]
 [1.67632866]
 [1.41288644]
 [1.04312378]
 [0.84749094]
 [1.08547789]
 [0.73029364]
 [1.12525954]
 [1.00279579]
 [1.2385895 ]
 [1.13119156]
 [1.53685053]
 [1.21689394]
 [1.32543221]
 [0.69015676]
 [0.88001325]
 [0.48912674]
 [

In [80]:
m = 100
n = 5
X = np.random.rand(m,n)
t = np.random.rand(n,1)
mu,sigma = 0,0.1
e = np.random.normal(mu, sigma, size=(m, 1))
Y = X@t +e
print("The initial t is ",t)


new_theta = grad_descent(X, Y, theta,500, 0.2)
print("With 5 iteration, the outcome theta is ",new_theta)# Add the small error into the Y degrade all the theta outcome

new_theta = grad_descent(X, Y, theta,1000, 0.2)
print("With 10 iteration, the outcome theta is ",new_theta)# Add the small error into the Y degrade all the theta outcome

new_theta = grad_descent(X, Y, theta,20000, 0.2)
print("With 20 iteration, the outcome theta is ",new_theta)# Add the small error into the Y degrade all the theta outcome


The initial t is  [[0.59077559]
 [0.42660289]
 [0.69155614]
 [0.86850741]
 [0.77219976]]
With 5 iteration, the outcome theta is  [[0.57198403]
 [0.46781438]
 [0.75894822]
 [0.8219425 ]
 [0.74691118]]
With 10 iteration, the outcome theta is  [[0.5718215 ]
 [0.4677852 ]
 [0.75923962]
 [0.82167408]
 [0.74702282]]
With 20 iteration, the outcome theta is  [[0.57182116]
 [0.46778529]
 [0.75924009]
 [0.8216736 ]
 [0.74702295]]
