### Handin 2


# Info
Everything should be completed and approved in person. Groups are fine, as 1 random person will have to present on behalf of the group.

The objectives for this handin is:
* Investigate loss curves
* Linear Regression
* Feature Encoding 
* Simple Interface with Dash
* Speeding up with Numba


# Task 1

Prove that there exist an $\alpha \in R$ such that $y$ becomes 2.  (Taken from a math exam at BI Nydalen)

I: $\alpha x + y = 4$   
II: $-x + 3y = 2$  

## solution

1. y = 2
2. $- x + 3(2) = 2$
3. x = 4
4. $\alpha (4) + 2 = 4$
5. $\alpha = \frac{1}{2}$
6. Answer: $\alpha = \frac{1}{2} \in R$


# Task 2 -- Investigating the loss curve


We are going to investigate how an algorithm navigates the L2 loss curve.

To this end we will first use our very simple model $f_\theta(x) = \theta$ to model the training data given below.


### Task 2a
Visualize the model $f_\theta(x)$ with  $\theta=0.34$  alongside the training data in the plot below.


In [5]:
import numpy as np
import plotly.express as px


x_train = np.arange(0.0, 1.0, 0.025)
y_train = 0.4 + x_train * 0.55 + np.random.randn(x_train.shape[0])*0.2


fig = px.scatter(x=x_train, y=y_train, title="train dataset")

# add to plot here
theta = 0.34
fig.add_hline(y=theta, line_color="red")

fig.show()

### Task 2b

Create a plot that shows the loss curve for $\theta$ in the range [0, 1], using the Mean Squared Error loss function.  
That is, $L(x, y) = \frac{1}{m} \sum [ f_\theta(x_k) - y_k)^2 ]$. Where $m$ is the number of data points in the training set. Remember: $f_{\theta}(x) = \theta$.


Using the plot find the value of $\theta$ that minimize the loss.

In [6]:
# -- CODE -- for Task 2b goes here.
def mse(y_true, y_pred):
    mse = np.mean((y_true - y_pred) ** 2)
    return mse

thetas_mse = []
for n in range(0,100):
    theta = n / 100
    y_pred = np.full_like(y_train, theta)
    thetas_mse.append((theta, mse(y_train, y_pred)))

fig = px.line(x=[t[0] for t in thetas_mse], y=[t[1] for t in thetas_mse], title="MSE vs Theta", labels={"x":"Theta", "y":"MSE"})
fig.show()

### Task 2c 
We have the following loss curves (same loss function as in 2b):
![title](loss_curves_mse.png)



# Model:
The model is of the form $f_\theta(x) = ax + b$ with $a,b \in \theta$. 
Here the different curves is the loss for:
1) Set $b = 0.1$ and $a \in [-1, 1]$.
2) Set $b = 0.75$ and $a \in [-1, 1]$.
3) Set $b = 1.5$ and $a \in [-1, 1]$.  
While a is between [-1, 1] (the x-axis in the plot).


Objective: Find a set of datapoints that duplicate these graphs


In [21]:
# -- CODE for 2c -- goes here.
import numpy as np

def gen_datapoints(num_points, x_lower_bound, x_upper_bound, y_lower_bound, y_upper_bound):
    x = np.random.uniform(x_lower_bound, x_upper_bound, num_points)
    y = np.random.uniform(y_lower_bound, y_upper_bound, num_points)
    return x, y

losses = dict(model1=[],model2=[],model3=[])
for n in range(-100,100):
    theta = n / 100
    x_test, y_test = gen_datapoints(10000, 0.5, 0.8, 0.9, 1.2)
    y_hat1 = theta * x_test + 0.1
    mse1 = mse(y_test, y_hat1)
    losses['model1'].append((theta, mse1))
    y_hat2 = theta * x_test + 0.75
    mse2 = mse(y_test, y_hat2)
    losses['model2'].append((theta, mse2))
    y_hat3 = theta * x_test + 1.5
    mse3 = mse(y_test, y_hat3)
    losses['model3'].append((theta, mse3))

fig = px.scatter(title='loss vs alpha', labels={'x':'Theta', 'y':'MSE'})
fig.add_scatter(x=[t[0] for t in losses['model1']], y=[t[1] for t in losses['model1']], mode='lines', name='Model 1')
fig.add_scatter(x=[t[0] for t in losses['model2']], y=[t[1] for t in losses['model2']], mode='lines', name='Model 2')
fig.add_scatter(x=[t[0] for t in losses['model3']], y=[t[1] for t in losses['model3']], mode='lines', name='Model 3')
fig.show()


# Task 3
Train a  linear regression with a L2 loss on the training data using Gradient Descent. 
The code below should give a (non-vectorized) on how it is found.  

The gradient is found as:  
$ L = \frac{1}{2}(\hat{y} - y )^2 $  

$ \hat{y} = f_\theta(x) = \theta$  


$ \frac{\partial L}{\partial\theta} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial \theta}$ (using the chain rule).

With:  
$\frac{\partial L}{\partial \hat{y}} = (\hat{y} - y) \times 1 = (\hat{y} - y)$  

$\frac{\partial \hat{y}}{\partial \theta} = 1  $


Gives us:  
$\frac{\partial L}{\partial\theta} = (\hat{y} - y)$


### Questions:
1) Draw the graph/tree that shows how these partial derivatives are connected. <br>
![graph](assets/Gradient.drawio.svg)
1) Find a set of hyperparameters that converge for $\theta_{\text{init}} = 5.5$.  How can we determine if the algorithm has converged?<br>
   as soon as the change in loss is very small (below a threshold) we can say it has converged. learning rate = 0.01 and n_steps = 600 seems to work.
2) Can you find a learning rate that the algorithm does not converge for? <br>
   learning rate = 2 makes the algorithm diverge.
3) What is the "best" learning rate for this particular dataset?  <br>
   since the predictions are constant, a higher learning rate will make it converge faster. learning rate = 1 and n_steps = 2 seems to work fine.
4) You might be asked to show how the gradients flow in another loss/function : be prepared.



In [33]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go


x_train = np.arange(0.0, 1.0, 0.025)
y_train = 0.4 + x_train * 0.55 + np.random.randn(x_train.shape[0])*0.2


def gradient_of_J(theta, x, y):
    # 
    y_hat = theta
    
    # dL / dy_hat
    dLdy = (y_hat - y)
    
    # dy_hat / dTheta
    dy_HatdTheta = 1
    
    # chain rule
    dLdTheta = dLdy * dy_HatdTheta
    
    return dLdTheta


def calculate_l2_loss_non_vectorized(theta, xs, ys):
    loss = 0.0
    for k in range(ys.shape[0]):
        y_pred = theta 
        loss += (y_pred - ys[k])**2

    
    mean_loss = loss / ys.shape[0]
    return mean_loss


    

initial_theta = 5.5

learning_rate = 0.01
theta = np.array([initial_theta])
m = x_train.shape[0]
n_steps = 600

print("Running GD with initial theta: {:.2f}, learning rate: {} over {} datapoints for {} steps".format(
    theta.item(),
    learning_rate,
    m,
    n_steps))



search_history = []
for steps in range(n_steps):    
        
    gradient_theta_sum = 0.0
    for k in range(m):
        gradient_theta_sum += gradient_of_J(theta, x_train[k], y_train[k]) 

    mean_gradient = (1/m) * gradient_theta_sum
    loss = calculate_l2_loss_non_vectorized(theta, x_train, y_train)

    print("[visit] theta: {:.2f} => loss: {:.2f}".format(theta.item(), loss.item()))

    # update theta using GD
    theta = theta - (learning_rate * mean_gradient)
    search_history.append((theta, loss))

    


# quick helper to generate plots 
loss_x = np.arange(-4, 6, 0.01)

loss_y = np.array([calculate_l2_loss_non_vectorized(t, x_train, y_train) for t in loss_x])

fig = px.line(x=loss_x, y=loss_y, title="GD History : Marks are iterations.")


x_visit, _ = list(zip(*search_history))
x_visit = np.concatenate(x_visit)
y_visit = np.array([calculate_l2_loss_non_vectorized(t, x_train, y_train) for t in x_visit])

fig.add_trace(go.Scatter(x=x_visit, y=y_visit, name='GD history',
                         line = dict(color='firebrick', width=8, dash='dot')))

fig.show()

Running GD with initial theta: 5.50, learning rate: 0.01 over 40 datapoints for 600 steps
[visit] theta: 5.50 => loss: 23.36
[visit] theta: 5.45 => loss: 22.89
[visit] theta: 5.40 => loss: 22.44
[visit] theta: 5.36 => loss: 21.99
[visit] theta: 5.31 => loss: 21.56
[visit] theta: 5.26 => loss: 21.13
[visit] theta: 5.22 => loss: 20.71
[visit] theta: 5.17 => loss: 20.30
[visit] theta: 5.13 => loss: 19.90
[visit] theta: 5.08 => loss: 19.50
[visit] theta: 5.04 => loss: 19.12
[visit] theta: 4.99 => loss: 18.74
[visit] theta: 4.95 => loss: 18.37
[visit] theta: 4.91 => loss: 18.00
[visit] theta: 4.87 => loss: 17.64
[visit] theta: 4.82 => loss: 17.29
[visit] theta: 4.78 => loss: 16.95
[visit] theta: 4.74 => loss: 16.62
[visit] theta: 4.70 => loss: 16.29
[visit] theta: 4.66 => loss: 15.96
[visit] theta: 4.62 => loss: 15.65
[visit] theta: 4.58 => loss: 15.34
[visit] theta: 4.54 => loss: 15.03
[visit] theta: 4.50 => loss: 14.73
[visit] theta: 4.47 => loss: 14.44
[visit] theta: 4.43 => loss: 14.16


## Task 4: Gradient Descent
Below is a simple vectorized impl. of GD that *CAN* be used as a starting point. 
Please make sure you understand exactly HOW it works (so that you could have implemented one yourself).
(Note that it uses a augmented matrix to skip the bias term).

1) Change the code to handle the bias parameter directly. (No agumented matrix).
2) Change the code to use Stochastic Gradient Descent with mini-batches. (Batch size should be 2 or more).
3) Re-organize the code and add numba as to make the SGD go pew pew (faster). To make it easier for yourself - numba does not play nice with Jupyter, so consider running it in its own python file.
4) *OPTIONAL* numba also supports GPU (https://numba.readthedocs.io/en/stable/cuda) implement GPU acceleration for SDG. (again: numba dont like notebooks)

Numba: https://numba.readthedocs.io/en/stable/


In [None]:
import numpy as np
import plotly.express as px

def predict(theta, xs): 
    return np.dot(xs, theta)

def J_squared_residual(theta, xs, y):
    h = predict(theta, xs)
    sr = ((h - y)**2).sum()    
    return sr

def gradient_J_squared_residual(theta, xs, y):
    h = predict(theta, xs) 
    grad = np.dot(xs.transpose(), (h - y)) 
    return grad


# the dataset (already augmented so that we get a intercept coef)
# remember: augmented x -> we add a colum of 1's instead of using a bias term.
data_x = np.array([[1.0, 0.5], [1.0, 1.0], [1.0, 2.0]])
data_y = np.array([[1.0], [1.5], [2.5]])
n_features = data_x.shape[1]

# variables we need 
theta = np.zeros((n_features, 1))
learning_rate = 0.1
m = data_x.shape[0]

# run GD
j_history = []
n_iters = 10
for it in range(n_iters):
    j = J_squared_residual(theta, data_x, data_y)
    j_history.append(j)
    
    theta = theta - (learning_rate * (1/m) * gradient_J_squared_residual(theta, data_x, data_y))
    
print("theta shape:", theta.shape)

# append the final result.
j = J_squared_residual(theta, data_x, data_y)
j_history.append(j)
print("The L2 error is: {:.2f}".format(j))


# find the L1 error.
y_pred = predict(theta, data_x)
l1_error = np.abs(y_pred - data_y).sum()
print("The L1 error is: {:.2f}".format(l1_error))


# Find the R^2 
# if the data is normalized: use the normalized data not the original data (task 3 hint).
# https://en.wikipedia.org/wiki/Coefficient_of_determination
u = ((data_y - y_pred)**2).sum()
v = ((data_y - data_y.mean())** 2).sum()
print("R^2: {:.2f}".format(1 - (u/v)))


# plot the result
fig = px.line(j_history, title="J(theta) - Loss History")
fig.show()


theta shape: (2, 1)
The L2 error is: 0.03
The L1 error is: 0.25
R^2: 0.97


# KKD Real Estate

### Note: no pandas, sklearn or similar libraries should be used, numpy, dash, numba and plotly should be sufficient. Ask if you wonder about a library.
### Implementing your own SGD/GD is a core component of this task.

The project consists of 5 parts: 

1) 
Go though the data and understand how encode the various features. 
* Clean the data for potential noise and simply wrong input.
* Make sure you identify how a linear classifier will be affected by the encoding scheme. 
* How do you handle missing data?
* How are the different features connected?
* Encode the features.

2) 
Identify objectives that could be valuable for KKD Real Estate, the objectives comes in two flavors:
* Answering questions, such as: What is a fair price for our ad packages? How is our best agent?
* Creating dashboards/interfaces that for instance: can predict the marked price for a house or maybe tell 
the probability of the house being sold in X amount of days?
* To pass the handin you at least need to implement a price model. 

3) 
Train one/many linear model(s) based on the data to solve the objectives.

4) 
Implement the dashboard interface such that we can input the nessesary parameters.
(See 'kkd_dashboard.py' on canvas for a potential starting point for the dashboard.)  
This interface should be powered by your models.

5) 
A client wants to know how the price model works, my inspecting the weights (that is: $\theta$) give a 
overview of the most important factors in the model. What are *key-drives* in the model?


### Deliverables
A presentation of the findings, a short written log of the work (bullet points / jupyter), a dashboard, and all code that you have.


### Dataset
* agents.jsonl - the real estate agents.
* districts.jsonl - the city districts
* houses.jsonl - the houses that has been on the marked the last year
* schools.jsonl - info about the schools in the districts


