### Handin 2


# Info
Everything should be completed and approved in person. Groups are fine, as 1 random person will have to present on behalf of the group.

The objectives for this handin is:
* Investigate loss curves
* Linear Regression
* Feature Encoding 
* Simple Interface with Dash
* Speeding up with Numba


# Task 1

Prove that there exist an $\alpha \in R$ such that $y$ becomes 2.  (Taken from a math exam at BI Nydalen)

1) $\alpha x + y = 4$   
2) $-x + 3y = 2$  


In [1]:
# out of the second equation we get x = 4 if we set y = 2
x = 4
y = 2
alpha = (4 - y)/x

print(f"y becomes 2 for alpha = {alpha} and x = {x}")



y becomes 2 for alpha = 0.5 and x = 4


# Task 2 -- Investigating the loss curve


We are going to investigate how an algorithm navigates the L2 loss curve.

To this end we will first use our very simple model $f_\theta(x) = \theta$ to model the training data given below.


### Task 2a
Visualize the model $f_\theta(x)$ with  $\theta=0.34$  alongside the training data in the plot below.


In [2]:
import numpy as np
import plotly.express as px
import pandas as pd


x_train = np.arange(0.0, 1.0, 0.025)
y_train = 0.4 + x_train * 0.55 + np.random.randn(x_train.shape[0])*0.2


theta = 0.34
y_hat = np.full_like(x_train, theta)


df_points = pd.DataFrame({'x': x_train, 'y': y_train, 'type': 'Training data'})


df_line = pd.DataFrame({'x': x_train, 'y': y_hat, 'type': 'Model fθ(x)=0.34'})


df_all = pd.concat([df_points, df_line])


fig = px.scatter(df_all, x='x', y='y', color='type',
                 title="Training data and model fθ(x)=0.34",
                 labels={'x':'x', 'y':'y'})


fig.update_traces(mode='lines+markers', selector=dict(name='Model fθ(x)=0.34'))

fig.show()


### Task 2b

Create a plot that shows the loss curve for $\theta$ in the range [0, 1], using the Mean Squared Error loss function.  
That is, $L(x, y) = \frac{1}{m} \sum [ f_\theta(x_k) - y_k)^2 ]$. Where $m$ is the number of data points in the training set. Remember: $f_{\theta}(x) = \theta$.


Using the plot find the value of $\theta$ that minimize the loss.

In [3]:
# -- CODE -- for Task 2b goes here.

def model(x, theta):
    return theta

def loss(x, y, theta):
    loss = []
    for th in theta:
        loss.append((1/len(x)) * sum((model(x, th) - y)**2))
    return loss

theta = np.linspace(0, 1, len(x_train))
loss_values = loss(x_train, y_train, theta)

fig = px.line(x=theta, y=loss_values)
fig.show()

# As one can see in the plot the loss is the lowest for theta = 0.7

    

### Task 2c 
We have the following loss curves (same loss function as in 2b):
![title](loss_curves_mse.png)



# Model:
The model is of the form $f_\theta(x) = ax + b$ with $a,b \in \theta$. 
Here the different curves is the loss for:
1) Set $b = 0.1$ and $a \in [-1, 1]$.
2) Set $b = 0.75$ and $a \in [-1, 1]$.
3) Set $b = 1.5$ and $a \in [-1, 1]$.  
While a is between [-1, 1] (the x-axis in the plot).


Objective: Find a set of datapoints that duplicate these graphs


In [4]:
import numpy as np
import plotly.express as px

np.random.seed(42)

# --- Model ---
def model_2(x, a, b):
    return a * x + b

# --- Loss Function ---
def loss_2(x, y, theta):
    a_values, b_values = theta
    all_losses = []
    for b in b_values:
        losses_b = []
        for a in a_values:
            mse = (1/len(x)) * np.sum((model_2(x, a, b) - y)**2)
            losses_b.append(mse)
        all_losses.append(losses_b)
    return all_losses 


# Parameter
a_values = np.linspace(-1, 1, 100)
b_values = [0.1, 0.75, 1.5]
theta = [a_values, b_values]

# Data
# x can stay the same because the length does not matter due to 1/m
#y_train = np.random.uniform(0.2, 1.6, len(x_train))
y_train = 0.3 * x_train + 0.75 + np.random.normal(0, 0.05, size=len(x_train))

loss_values = loss_2(x_train, y_train, theta)


import plotly.graph_objects as go

fig = go.Figure()

for i, b in enumerate(b_values):
    fig.add_trace(go.Scatter(
        x=a_values,
        y=loss_values[i],
        mode='lines',
        name=f"b = {b}"
    ))

fig.update_layout(
    title="Loss (MSE) über a für verschiedene b-Werte",
    xaxis_title="a",
    yaxis_title="Loss (MSE)",
    template="plotly_white"
)

fig.show()


# Task 3
Train a  linear regression with a L2 loss on the training data using Gradient Descent. 
The code below should give a (non-vectorized) on how it is found.  

The gradient is found as:  
$ L = \frac{1}{2}(\hat{y} - y )^2 $  

$ \hat{y} = f_\theta(x) = \theta$  


$ \frac{\partial L}{\partial\theta} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial \theta}$ (using the chain rule).

With:  
$\frac{\partial L}{\partial \hat{y}} = (\hat{y} - y) \times 1 = (\hat{y} - y)$  

$\frac{\partial \hat{y}}{\partial \theta} = 1  $


Gives us:  
$\frac{\partial L}{\partial\theta} = (\hat{y} - y)$


### Questions:
1) Draw the graph/tree that shows how these partial derivatives are connected. 
![](derivative_tree.jpg)
2) Find a set of hyperparameters that converge for $\theta_{\text{init}} = 5.5$.  How can we determine if the algorithm has converged?  
    -> For leaning_rate = 0.5 the algorithm converges to 0.01. You can see that it converges if the loss is not changing over a couple of interations.
3) Can you find a learning rate that the algorithm does not converge for?  
    -> for learning_rate = 3 the algorithm does not converge.
4) What is the "best" learning rate for this particular dataset? 
    -> The "best" learning rate would be the one who converges the fastest but still getting to the lowest possible loss. Here it would be 1.
5) You might be asked to show how the gradients flow in another loss/function : be prepared.



In [5]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go


def gradient_of_J(theta, x, y):
    # 
    y_hat = theta
    
    # dL / dy_hat
    dLdy = (y_hat - y)
    
    # dy_hat / dTheta
    dy_HatdTheta = 1
    
    # chain rule
    dLdTheta = dLdy * dy_HatdTheta
    
    return dLdTheta


def calculate_l2_loss_non_vectorized(theta, xs, ys):
    loss = 0.0
    for k in range(ys.shape[0]):
        y_pred = theta 
        loss += (y_pred - ys[k])**2

    
    mean_loss = loss / ys.shape[0]
    return mean_loss


    

initial_theta = 5.5

learning_rate = 1
theta = np.array([initial_theta])
m = x_train.shape[0]
n_steps = 10

print("Running GD with initial theta: {:.2f}, learning rate: {} over {} datapoints for {} steps".format(
    theta.item(),
    learning_rate,
    m,
    n_steps))



search_history = []
for steps in range(n_steps):    
        
    gradient_theta_sum = 0.0
    for k in range(m):
        gradient_theta_sum += gradient_of_J(theta, x_train[k], y_train[k]) 

    # to switch between Mean squared error and squared error replace m with one
    mean_gradient = (1/m) * gradient_theta_sum
    loss = calculate_l2_loss_non_vectorized(theta, x_train, y_train)

    print("[visit] theta: {:.2f} => loss: {:.2f}".format(theta.item(), loss.item()))

    # update theta using GD
    theta = theta - (learning_rate * mean_gradient)
    search_history.append((theta, loss))

    


# quick helper to generate plots 
loss_x = np.arange(-4, 6, 0.01)

loss_y = np.array([calculate_l2_loss_non_vectorized(t, x_train, y_train) for t in loss_x])

fig = px.line(x=loss_x, y=loss_y, title="GD History : Marks are iterations.")


x_visit, _ = list(zip(*search_history))
x_visit = np.concatenate(x_visit)
y_visit = np.array([calculate_l2_loss_non_vectorized(t, x_train, y_train) for t in x_visit])

fig.add_trace(go.Scatter(x=x_visit, y=y_visit, name='GD history',
                         line = dict(color='firebrick', width=8, dash='dot')))

fig.show()

Running GD with initial theta: 5.50, learning rate: 1 over 40 datapoints for 10 steps
[visit] theta: 5.50 => loss: 21.30
[visit] theta: 0.89 => loss: 0.01
[visit] theta: 0.89 => loss: 0.01
[visit] theta: 0.89 => loss: 0.01
[visit] theta: 0.89 => loss: 0.01
[visit] theta: 0.89 => loss: 0.01
[visit] theta: 0.89 => loss: 0.01
[visit] theta: 0.89 => loss: 0.01
[visit] theta: 0.89 => loss: 0.01
[visit] theta: 0.89 => loss: 0.01


## Task 4: Gradient Descent
Below is a simple vectorized impl. of GD that *CAN* be used as a starting point. 
Please make sure you understand exactly HOW it works (so that you could have implemented one yourself).
(Note that it uses a augmented matrix to skip the bias term).

1) Change the code to handle the bias parameter directly. (No agumented matrix).
2) Change the code to use Stochastic Gradient Descent with mini-batches. (Batch size should be 2 or more).
3) Re-organize the code and add numba as to make the SGD go pew pew (faster). To make it easier for yourself - numba does not play nice with Jupyter, so consider running it in its own python file.
4) *OPTIONAL* numba also supports GPU (https://numba.readthedocs.io/en/stable/cuda) implement GPU acceleration for SDG. (again: numba dont like notebooks)

Numba: https://numba.readthedocs.io/en/stable/


In [6]:
import numpy as np
import plotly.express as px

def predict(theta, xs, batch):
    
    bias = theta[0]
    weight = theta[1:] 
    return bias + np.dot(xs[batch], weight)

def J_squared_residual(theta, xs, y, batch):
    h = predict(theta, xs, batch)
    sr = ((h - y[batch])**2).sum()    
    return sr

def gradient_J_squared_residual(theta, xs, y, batch):
    h = predict(theta, xs, batch) 
    #sum(x_i^2) = x^T * x (with x = h - y and h = X*0)
    # ----> J(0) = (h-y)^T * (h-y)
    # grad = dJ/d0 ------> dJ/d0 = xs^t * (h-y)
    # this is gradient computed with some fancy math I probably cant remember 
    # simplfied: calculates partial derivatives from L ------> 0 (backward propagation)
    # dJ/d0 = dJ/dh * dh/d0
    grad = np.dot(xs[batch].transpose(), (h - y[batch])) 
    return grad


# the dataset (already augmented so that we get a intercept coef)
# remember: augmented x -> we add a colum of 1's instead of using a bias term.
#data_x = np.array([[0.5], [1.0], [2.0]])  # (3,1)
#data_y = np.array([[1.0], [1.5], [2.5]]) # (3,1)
data_x = x_train
data_y = y_train
# make to collumn vektor
data_x = np.array(data_x).reshape(-1, 1)
data_y = np.array(data_y).reshape(-1, 1)
print(data_x.shape)
n_features = data_x.shape[1]

# variables we need 
theta = np.zeros((n_features + 1, 1)) #(2,1)
learning_rate = 0.1
m = data_x.shape[0]
batch_size = 30


# run GD
j_history = []
n_iters = 30
for it in range(n_iters):
    batch = np.random.randint(0, batch_size + 1, size=batch_size)
    j = J_squared_residual(theta, data_x, data_y, batch)
    j_history.append(j)
    
    theta = theta - (learning_rate * (1/m) * gradient_J_squared_residual(theta, data_x, data_y, batch))
    
print("theta shape:", theta.shape)

# append the final result.
j = J_squared_residual(theta, data_x, data_y, batch)
j_history.append(j)
print("The L2 error is: {:.2f}".format(j))


# find the L1 error.
y_pred = predict(theta, data_x, batch)
l1_error = np.abs(y_pred - data_y[batch]).sum()
print("The L1 error is: {:.2f}".format(l1_error))


# Find the R^2 
# if the data is normalized: use the normalized data not the original data (task 3 hint).
# https://en.wikipedia.org/wiki/Coefficient_of_determination
u = ((data_y[batch] - y_pred)**2).sum()
v = ((data_y[batch] - data_y[batch].mean())** 2).sum()
print("R^2: {:.2f}".format(1 - (u/v)))


# plot the result
fig = px.line(j_history, title="J(theta) - Loss History")
fig.show()


(40, 1)
theta shape: (2, 1)
The L2 error is: 1.94
The L1 error is: 7.45
R^2: -15.04


# KKD Real Estate

### Note: no pandas, sklearn or similar libraries should be used, numpy, dash, numba and plotly should be sufficient. Ask if you wonder about a library.
### Implementing your own SGD/GD is a core component of this task.

The project consists of 5 parts: 

1) 
Go though the data and understand how encode the various features. 
* Clean the data for potential noise and simply wrong input.
* Make sure you identify how a linear classifier will be affected by the encoding scheme. 
* How do you handle missing data?
* How are the different features connected?
* Encode the features.

2) 
Identify objectives that could be valuable for KKD Real Estate, the objectives comes in two flavors:
* Answering questions, such as: What is a fair price for our ad packages? How is our best agent?
* Creating dashboards/interfaces that for instance: can predict the marked price for a house or maybe tell 
the probability of the house being sold in X amount of days?
* To pass the handin you at least need to implement a price model. 

3) 
Train one/many linear model(s) based on the data to solve the objectives.

4) 
Implement the dashboard interface such that we can input the nessesary parameters.
(See 'kkd_dashboard.py' on canvas for a potential starting point for the dashboard.)  
This interface should be powered by your models.

5) 
A client wants to know how the price model works, my inspecting the weights (that is: $\theta$) give a 
overview of the most important factors in the model. What are *key-drives* in the model?


### Deliverables
A presentation of the findings, a short written log of the work (bullet points / jupyter), a dashboard, and all code that you have.


### Dataset
* agents.jsonl - the real estate agents.
* districts.jsonl - the city districts
* houses.jsonl - the houses that has been on the marked the last year
* schools.jsonl - info about the schools in the districts


