[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/psse-cpu/ml-workshop/blob/main/notebooks/workshop-templates/multivariate.ipynb)

In this workshop, you'll predict the salary of an applicant, given the data of other
applicants and the salary they're offered.

- `column 1:` [Microsoft Python Certification Exam][1] score
  * passing score is 28, so the range here is 28 to 40
- `column 2:` years experience
  * so far, we have applicants from 1 year to 12  years experience
- `column 3:` monthly salary offered, $x100,000$ pesos

> 👀 large values like the salary need to be scaled down, otherwise your MSE might produce $\infty$ costs.
> Remember that before getting the average, we get the SUM first
> and sum of squared errors of LARGE numbers like 150k salary can go towards $\infty$  
>  
> You'll get a similar error with [house prices](https://www.kaggle.com/swathiachath/kc-housesales-data?select=kc_house_data.csv) in the hundred-thousand dollar range.

[1]: https://www.udemy.com/course/microsoft-python-certification-exam-98-381-practice-tests/

In [13]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

import tensorflow as tf
from tensorflow import keras
from random import randrange

# 3 decimal places, suppress scientific notation
np.set_printoptions(precision=3, suppress=True)

Normally, data like these are loaded from CSV files, or databases, but that would require the participants
to learn another library, with its own set of [invented syntax](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) that heavily uses
[operator overloading](https://datapythonista.me/blog/python-operators-and-how-they-affect-pandas.html).  

Ain't got no time for that in our 3 hour workshop.  So let's just use a hardcoded Numpy array.

In [14]:
data = np.array([
    [28.   ,  1.   ,  0.801],
    [28.   ,  2.   ,  0.758],
    [28.   ,  3.   ,  0.785],
    [28.   ,  4.   ,  0.862],
    [28.   ,  5.   ,  0.799],
    [28.   ,  6.   ,  0.836],
    [28.   ,  7.   ,  0.783],
    [28.   ,  8.   ,  0.8  ],
    [28.   ,  9.   ,  0.887],
    [28.   , 10.   ,  0.904],
    [28.   , 11.   ,  0.831],
    [28.   , 12.   ,  0.848],
    [29.   ,  1.   ,  0.839],
    [29.   ,  2.   ,  0.836],
    [29.   ,  3.   ,  0.803],
    [29.   ,  4.   ,  0.83 ],
    [29.   ,  5.   ,  0.877],
    [29.   ,  6.   ,  0.884],
    [29.   ,  7.   ,  0.871],
    [29.   ,  8.   ,  0.888],
    [29.   ,  9.   ,  0.915],
    [29.   , 10.   ,  0.912],
    [29.   , 11.   ,  0.859],
    [29.   , 12.   ,  0.876],
    [30.   ,  1.   ,  0.857],
    [30.   ,  2.   ,  0.854],
    [30.   ,  3.   ,  0.841],
    [30.   ,  4.   ,  0.818],
    [30.   ,  5.   ,  0.825],
    [30.   ,  6.   ,  0.852],
    [30.   ,  7.   ,  0.849],
    [30.   ,  8.   ,  0.876],
    [30.   ,  9.   ,  0.873],
    [30.   , 10.   ,  0.88 ],
    [30.   , 11.   ,  0.947],
    [30.   , 12.   ,  0.964],
    [31.   ,  1.   ,  0.825],
    [31.   ,  2.   ,  0.912],
    [31.   ,  3.   ,  0.909],
    [31.   ,  4.   ,  0.856],
    [31.   ,  5.   ,  0.893],
    [31.   ,  6.   ,  0.91 ],
    [31.   ,  7.   ,  0.867],
    [31.   ,  8.   ,  0.974],
    [31.   ,  9.   ,  0.921],
    [31.   , 10.   ,  0.888],
    [31.   , 11.   ,  0.905],
    [31.   , 12.   ,  0.952],
    [32.   ,  1.   ,  0.943],
    [32.   ,  2.   ,  0.88 ],
    [32.   ,  3.   ,  0.877],
    [32.   ,  4.   ,  0.904],
    [32.   ,  5.   ,  0.891],
    [32.   ,  6.   ,  0.898],
    [32.   ,  7.   ,  0.965],
    [32.   ,  8.   ,  0.962],
    [32.   ,  9.   ,  0.989],
    [32.   , 10.   ,  0.976],
    [32.   , 11.   ,  1.003],
    [32.   , 12.   ,  1.02 ],
    [33.   ,  1.   ,  0.891],
    [33.   ,  2.   ,  0.888],
    [33.   ,  3.   ,  0.975],
    [33.   ,  4.   ,  0.932],
    [33.   ,  5.   ,  0.929],
    [33.   ,  6.   ,  0.976],
    [33.   ,  7.   ,  0.983],
    [33.   ,  8.   ,  0.97 ],
    [33.   ,  9.   ,  0.957],
    [33.   , 10.   ,  1.014],
    [33.   , 11.   ,  0.981],
    [33.   , 12.   ,  1.048],
    [34.   ,  1.   ,  0.929],
    [34.   ,  2.   ,  0.986],
    [34.   ,  3.   ,  0.943],
    [34.   ,  4.   ,  0.93 ],
    [34.   ,  5.   ,  0.967],
    [34.   ,  6.   ,  0.974],
    [34.   ,  7.   ,  1.021],
    [34.   ,  8.   ,  0.978],
    [34.   ,  9.   ,  1.045],
    [34.   , 10.   ,  0.972],
    [34.   , 11.   ,  0.979],
    [34.   , 12.   ,  0.986],
    [35.   ,  1.   ,  1.027],
    [35.   ,  2.   ,  0.964],
    [35.   ,  3.   ,  1.041],
    [35.   ,  4.   ,  1.048],
    [35.   ,  5.   ,  1.065],
    [35.   ,  6.   ,  0.972],
    [35.   ,  7.   ,  0.979],
    [35.   ,  8.   ,  1.036],
    [35.   ,  9.   ,  0.993],
    [35.   , 10.   ,  1.07 ],
    [35.   , 11.   ,  1.017],
    [35.   , 12.   ,  1.014],
    [36.   ,  1.   ,  1.055],
    [36.   ,  2.   ,  1.052],
    [36.   ,  3.   ,  1.059],
    [36.   ,  4.   ,  0.986],
    [36.   ,  5.   ,  1.023],
    [36.   ,  6.   ,  1.05 ],
    [36.   ,  7.   ,  1.047],
    [36.   ,  8.   ,  1.114],
    [36.   ,  9.   ,  1.081],
    [36.   , 10.   ,  1.088],
    [36.   , 11.   ,  1.045],
    [36.   , 12.   ,  1.112],
    [37.   ,  1.   ,  1.003],
    [37.   ,  2.   ,  1.1  ],
    [37.   ,  3.   ,  1.077],
    [37.   ,  4.   ,  1.054],
    [37.   ,  5.   ,  1.021],
    [37.   ,  6.   ,  1.048],
    [37.   ,  7.   ,  1.055],
    [37.   ,  8.   ,  1.112],
    [37.   ,  9.   ,  1.109],
    [37.   , 10.   ,  1.136],
    [37.   , 11.   ,  1.153],
    [37.   , 12.   ,  1.16 ],
    [38.   ,  1.   ,  1.061],
    [38.   ,  2.   ,  1.128],
    [38.   ,  3.   ,  1.135],
    [38.   ,  4.   ,  1.042],
    [38.   ,  5.   ,  1.079],
    [38.   ,  6.   ,  1.056],
    [38.   ,  7.   ,  1.133],
    [38.   ,  8.   ,  1.08 ],
    [38.   ,  9.   ,  1.087],
    [38.   , 10.   ,  1.184],
    [38.   , 11.   ,  1.191],
    [38.   , 12.   ,  1.178],
    [39.   ,  1.   ,  1.049],
    [39.   ,  2.   ,  1.056],
    [39.   ,  3.   ,  1.133],
    [39.   ,  4.   ,  1.15 ],
    [39.   ,  5.   ,  1.127],
    [39.   ,  6.   ,  1.134],
    [39.   ,  7.   ,  1.101],
    [39.   ,  8.   ,  1.188],
    [39.   ,  9.   ,  1.145],
    [39.   , 10.   ,  1.122],
    [39.   , 11.   ,  1.219],
    [39.   , 12.   ,  1.206],
    [40.   ,  1.   ,  1.107],
    [40.   ,  2.   ,  1.134],
    [40.   ,  3.   ,  1.161],
    [40.   ,  4.   ,  1.108],
    [40.   ,  5.   ,  1.135],
    [40.   ,  6.   ,  1.182],
    [40.   ,  7.   ,  1.179],
    [40.   ,  8.   ,  1.226],
    [40.   ,  9.   ,  1.143],
    [40.   , 10.   ,  1.18 ],
    [40.   , 11.   ,  1.237],
    [40.   , 12.   ,  1.194]
])

Verifying we sliced $X$ correctly, just showing the first 5 rows.

In [15]:
X = data[:, 0:2]
X[0: 5]

array([[28.,  1.],
       [28.,  2.],
       [28.,  3.],
       [28.,  4.],
       [28.,  5.]])

Verifying we sliced $y$ correctly, just showing the first 5 rows.

In [16]:
y = data[:, [-1]]
y[0:5]

array([[0.801],
       [0.758],
       [0.785],
       [0.862],
       [0.799]])

Displaying the lowest and highest of each of the following:
- certification score
- years experience

In [17]:
# first the lowest of each
np.min(X, axis=0) # 0 = x-axis, 1 = y-axis

array([28.,  1.])

In [18]:
# then the highest of each
np.max(X, axis=0)

array([40., 12.])

Then we scale using Scikit-learn's scaler, which is easier.  We can also use numpy's vectorization, that's one less library to learn.

Note that:
- for certification scores of `40`, they become `1`s, `28` become `0`s.
- for years experience of `1`, they become `0`s, `12` becomes `1`s.

`X[-14:-9]` will show a few of those who got scores of 39 (veterans), and 40 (few years experience).

In [19]:
X_norm = X # <YOUR CODE TO FEATURE-SCALE X HERE>

# Showing the original values
X[-14:-9]

array([[39., 11.],
       [39., 12.],
       [40.,  1.],
       [40.,  2.],
       [40.,  3.]])

In [20]:
# and the normalized ones
X_norm[-14:-9]

array([[39., 11.],
       [39., 12.],
       [40.,  1.],
       [40.,  2.],
       [40.,  3.]])

In [21]:
# <COMPILE AND TRAIN YOUR MODEL HERE>

In [22]:
# Showing our training data again, so we can compare
# `vstack` -> stack two matrices vertically
np.vstack((data[0:13], data[-4:-1]))

array([[28.   ,  1.   ,  0.801],
       [28.   ,  2.   ,  0.758],
       [28.   ,  3.   ,  0.785],
       [28.   ,  4.   ,  0.862],
       [28.   ,  5.   ,  0.799],
       [28.   ,  6.   ,  0.836],
       [28.   ,  7.   ,  0.783],
       [28.   ,  8.   ,  0.8  ],
       [28.   ,  9.   ,  0.887],
       [28.   , 10.   ,  0.904],
       [28.   , 11.   ,  0.831],
       [28.   , 12.   ,  0.848],
       [29.   ,  1.   ,  0.839],
       [40.   ,  9.   ,  1.143],
       [40.   , 10.   ,  1.18 ],
       [40.   , 11.   ,  1.237]])

How much will we offer?
- another candidate barely passing, with 12 years exp
- another _"kabit"_, but a 15 year veteran?
  * we never had a 15 year veteran apply before
- barely passing also, and with only half a year experience?
  * never had an almost fresh-grad so far
- perfect score, 15 years experience
- perfect score, 12 years experience, like the guy offered $119.4k$

Scale the inputs for prediction as well, before passing it to `model.predict`.

In [23]:
scaled_input = np.array([ # THIS IS NOT SCALED YET, ALSO SCALE INPUT FOR PREDICTION
    [28, 12], 
    [28, 15],
    [28, 0.5],
    [40, 15],
    [40, 12]
])

"""
If you wrote your code correctly, the scaled values should be:
array([[ 0.   ,  1.   ],
       [ 0.   ,  1.273],
       [ 0.   , -0.045], # the -0.045 is because 0.5 is lower than the lowest years exp earlier
       [ 1.   ,  1.273], # same with 15 years become 1.273
       [ 1.   ,  1.   ]])
"""

scaled_input 

array([[28. , 12. ],
       [28. , 15. ],
       [28. ,  0.5],
       [40. , 15. ],
       [40. , 12. ]])

In [24]:
# <YOUR CODE TO PREDICT HERE>

You should get values somewhere around:

- 86,130.559
- 89,092.261
- 74,777.365
- 126,128.817
- 123,167.109

😉 This dataset was artificially generated, with one entry per _"combo"_ of
score and years_exp, using the formula:

80% * certification_score +  
20% * years_experience +  
$\pm$ (1000, 2000, 3000, 4000, 5000 Php) (random)


The weights should be somewhere around that ratio.  I got
$$
  w = \begin{bmatrix}
    0.753 \\
    0.37 \\
    0.109
  \end{bmatrix}
$$

A 77:23 ratio, pretty close to 80:20 😁

Using the weights to predict, when given an input, say `score = 40`, `years_exp = 15`, it would do:

1. Feature Scaled 40 is `1`, feature scaled 15 is `1.273`
2. $\hat{y} = 0.109 \cdot 1.273 + .37 \cdot 1 + 0.753$  
3. $\hat{y} = 1.261757$

but it's not yet $x100k$, so the predicted value is $126,175.7$ Php (actually 126,128.817 when our code
is ran).

The slightly bigger value from our _"mano-mano"_ computation is because all values are rounded off 
to 3 decimal places, but internally in our code, they're not.
