# 04 - Multi-variable Linear Regression

<img width="200" src="https://i.imgur.com/hbPVe1T.png">

# Multi-variable linear regression

Predicting exam score - regression using three inputs (x1, x2, x3)

x1 (quiz 1) | x2 (quiz 2) | x3 (mid 1) | Y (final)
---- | ---- | ----| ----
73 | 80 | 75 | 152
93 | 88 | 93 | 185
89 | 91 | 90 | 180
96 | 98 | 100 | 196
73 | 66 | 70 | 142

Test Scores for General Psychology (https://goo.gl/g2T8Kp )


# Matrix multiplication

## dot product(=scalar product, 내적)
<img src="https://www.mathsisfun.com/algebra/images/matrix-multiply-a.svg" >


https://www.mathsisfun.com/algebra/matrix-multiplying.html

# Multi-feature regression

### Hypothesis

$$ H(x) = w x + b $$

$$ H(x_1, x_2, x_3) = w_1 x_1 + w_2 x_2 + w_3 x_3 + b $$

# Hypothesis using matrix

$$ H(x_1, x_2, x_3) = \underline{w_1 x_1 + w_2 x_2 + w_3 x_3} + b $$

$$ w_1 x_1 + w_2 x_2 + w_3 x_3 $$ 

$$ \begin{pmatrix} w_{ 1 } & w_{ 2 } & w_{ 3 } \end{pmatrix}\cdot \begin{pmatrix} x_{ 1 } \\ x_{ 2 } \\ x_{ 3 } \end{pmatrix} $$

$$ WX $$ (W, X 는 matrix)

# Hypothesis without b

$$ H(x_1, x_2, x_3) = w_1 x_1 + w_2 x_2 + w_3 x_3 + b$$

$$ = b + w_1 x_1 + w_2 x_2 + w_3 x_3 $$

$$ = \begin{pmatrix} b & x_{ 1 } & x_{ 2 } & x_{ 3 } \end{pmatrix}\cdot \begin{pmatrix} 1 \\ w_{ 1 } \\ w_{ 2 } \\ w_{ 3 } \end{pmatrix} $$

$$ = XW $$



# Hypothesis using matrix 

### Many x instances

$$ \begin{pmatrix} x_{ 11 } & x_{ 12 } & x_{ 13 } \\ x_{ 21 } & x_{ 22 } & x_{ 23 } \\ x_{ 31 } & x_{ 32 } & x_{ 33 }\\ x_{ 41 } & x_{ 42 } & x_{ 43 }\\ x_{ 51 } & x_{ 52 } & x_{ 53 }\end{pmatrix} \cdot \begin{pmatrix} w_{ 1 } \\ w_{ 2 } \\ w_{ 3 } \end{pmatrix}=\begin{pmatrix} x_{ 11 }w_{ 1 }+x_{ 12 }w_{ 2 }+x_{ 13 }w_{ 3 } \\ x_{ 21 }w_{ 1 }+x_{ 22 }w_{ 2 }+x_{ 23 }w_{ 3 }\\ x_{ 31 }w_{ 1 }+x_{ 32 }w_{ 2 }+x_{ 33 }w_{ 3 } \\ x_{ 41 }w_{ 1 }+x_{ 42 }w_{ 2 }+x_{ 43 }w_{ 3 } \\ x_{ 51 }w_{ 1 }+x_{ 52 }w_{ 2 }+x_{ 53 }w_{ 3 } \end{pmatrix} $$

$$ [5, 3] \cdot [3, 1] = [5, 1] $$

$$ H(X) = XW $$

5는 데이터(instance)의 수, 3은 변수(feature)의 수, 1은 결과

# Hypothesis using matrix (n output)

$$ [n, 3] \cdot [?, ?] = [n, 2] $$

$$ H(X) = XW $$

* n은 데이터(instance)의 개수, 2는 결과 값의 개수로 주어진다.
* 이때, W [?, ?] ⇒ [3, 2]

# WX vs XW

### Theory (Lecture) :
 $$ H(x) = Wx + b  $$

### TensorFlow (Implementation) :

$$ H(X) = XW $$

# Simple Example (2 variables)

x1 | x2 | y
---- | ---- | ----
1  |  0  |  1
0  |  2  |  2
3  |  0  |  3
0  |  4  |  4
5  |  0  |  5

In [1]:
import tensorflow as tf
import numpy as np

print(tf.__version__)

2.3.0


In [2]:
tf.random.set_seed(0)

In [4]:
x1_data = [1, 0, 3, 0, 5]
x2_data = [0, 2, 0, 4, 0]
y_data = [1, 2, 3, 4, 5]

W1 = tf.Variable(tf.random.uniform((1,), -10.0, 10.0))
W2 = tf.Variable(tf.random.uniform((1,), -10.0, 10.0))
b = tf.Variable(tf.random.uniform((1,), -10.0, 10.0))

learning_rate = tf.Variable(0.001)

for i in range(1000+1):
  with tf.GradientTape() as tape:
    hypothesis = W1 * x1_data + W2 * x2_data + b
    cost = tf.reduce_mean(tf.square(hypothesis - y_data))
  
  W1_grad, W2_grad, b_grad = tape.gradient(cost, [W1, W2, b])
  W1.assign_sub(learning_rate * W1_grad)
  W2.assign_sub(learning_rate * W2_grad)
  b.assign_sub(learning_rate * b_grad)

  if i % 50 == 0:
    print("{:5} | {:10.6f} | {:10.4f} | {:10.4f} | {:10.6f} |".format(i, cost.numpy(), W1.numpy()[0], W2.numpy()[0], b.numpy()[0]))



    0 | 966.489624 |    -6.3849 |    -9.6386 |  -1.997182 |
   50 | 290.719482 |    -2.5520 |    -6.0406 |   0.083186 |
  100 |  97.416336 |    -0.8520 |    -3.7830 |   1.185548 |
  150 |  36.060600 |    -0.1155 |    -2.3524 |   1.777458 |
  200 |  14.622516 |     0.1927 |    -1.4381 |   2.096439 |
  250 |   6.551729 |     0.3151 |    -0.8494 |   2.265759 |
  300 |   3.350713 |     0.3601 |    -0.4675 |   2.350938 |
  350 |   2.032307 |     0.3750 |    -0.2176 |   2.387711 |
  400 |   1.469417 |     0.3797 |    -0.0524 |   2.396086 |
  450 |   1.216593 |     0.3823 |     0.0583 |   2.387553 |
  500 |   1.092846 |     0.3854 |     0.1338 |   2.368865 |
  550 |   1.023482 |     0.3899 |     0.1865 |   2.344070 |
  600 |   0.977404 |     0.3956 |     0.2244 |   2.315640 |
  650 |   0.941599 |     0.4023 |     0.2527 |   2.285101 |
  700 |   0.910600 |     0.4097 |     0.2748 |   2.253407 |
  750 |   0.882098 |     0.4175 |     0.2927 |   2.221159 |
  800 |   0.855110 |     0.4255 |     0.

# Simple Example (2 variables with Matrix)

In [11]:
x_data = [
    [1., 0., 3., 0., 5.],
    [0., 2., 0., 4., 0.]
]
y_data  = [1, 2, 3, 4, 5]

W = tf.Variable(tf.random.uniform((1, 2), -1.0, 1.0))
b = tf.Variable(tf.random.uniform((1,), -1.0, 1.0))

learning_rate = tf.Variable(0.001)

for i in range(1000+1):
    with tf.GradientTape() as tape:
        hypothesis = tf.matmul(W, x_data) + b # (1, 2) * (2, 5) = (1, 5)
        cost = tf.reduce_mean(tf.square(hypothesis - y_data))

        W_grad, b_grad = tape.gradient(cost, [W, b])
        W.assign_sub(learning_rate * W_grad)
        b.assign_sub(learning_rate * b_grad)
    
    if i % 50 == 0:
        print("{:5} | {:10.6f} | {:10.4f} | {:10.4f} | {:10.6f}".format(
            i, cost.numpy(), W.numpy()[0][0], W.numpy()[0][1], b.numpy()[0]))

    0 |  30.003744 |    -0.8693 |    -0.6099 |   0.562821
   50 |   7.836334 |    -0.0238 |    -0.1530 |   0.904018
  100 |   2.377153 |     0.3633 |     0.1291 |   1.070535
  150 |   0.924887 |     0.5399 |     0.3063 |   1.149366
  200 |   0.495451 |     0.6204 |     0.4197 |   1.183373
  250 |   0.350969 |     0.6575 |     0.4935 |   1.194003
  300 |   0.294564 |     0.6755 |     0.5426 |   1.192192
  350 |   0.268330 |     0.6851 |     0.5760 |   1.183638
  400 |   0.253377 |     0.6913 |     0.5994 |   1.171381
  450 |   0.242959 |     0.6961 |     0.6164 |   1.157086
  500 |   0.234499 |     0.7004 |     0.6293 |   1.141689
  550 |   0.226969 |     0.7046 |     0.6394 |   1.125728
  600 |   0.219946 |     0.7087 |     0.6478 |   1.109518
  650 |   0.213252 |     0.7129 |     0.6551 |   1.093248
  700 |   0.206807 |     0.7171 |     0.6615 |   1.077033
  750 |   0.200576 |     0.7213 |     0.6674 |   1.060942
  800 |   0.194542 |     0.7254 |     0.6730 |   1.045016
  850 |   0.18

# GradientDescent using Tensorflow Function

In [20]:
X = tf.constant([[1., 2.],
                 [3., 4.]])
y = tf.constant([[1.5], [3.5]])

W = tf.Variable(tf.random.normal((2,1)))
b = tf.Variable(tf.random.normal((1,)))

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
# optimizer = tf.compat.v1.train.GradientDescentOptimizer(learning_rate=0.01)

n_epoch = 1000+1
for i in range(n_epoch):
  with tf.GradientTape() as tape:
    y_pred = tf.matmul(X, W) + b
    cost = tf.reduce_mean(tf.square(y_pred - y))
  
  grads = tape.gradient(cost, [W, b])

  optimizer.apply_gradients(grads_and_vars=zip(grads, [W, b]))
  if i % 50 == 0:
    print("{:5} | {:10.6f}".format(i, cost.numpy()))


    0 |  13.620182
   50 |   0.017823
  100 |   0.012185
  150 |   0.008331
  200 |   0.005696
  250 |   0.003895
  300 |   0.002663
  350 |   0.001821
  400 |   0.001245
  450 |   0.000851
  500 |   0.000582
  550 |   0.000398
  600 |   0.000272
  650 |   0.000186
  700 |   0.000127
  750 |   0.000087
  800 |   0.000059
  850 |   0.000041
  900 |   0.000028
  950 |   0.000019
 1000 |   0.000013


# Predicting exam score
regression using three inputs (x1, x2, x3)

x1 (quiz 1) | x2 (quiz 2) | x3 (mid 1) | Y (final)
---- | ---- | ----| ----
73 | 80 | 75 | 152
93 | 88 | 93 | 185
89 | 91 | 90 | 180
96 | 98 | 100 | 196
73 | 66 | 70 | 142

In [25]:
x1 = [ 73.,  93.,  89.,  96.,  73.]
x2 = [ 80.,  88.,  91.,  98.,  66.]
x3 = [ 75.,  93.,  90., 100.,  70.]
Y  = [152., 185., 180., 196., 142.]

# weights
w1 = tf.Variable(10.)
w2 = tf.Variable(10.)
w3 = tf.Variable(10.)
b  = tf.Variable(10.)

hypothesis = w1 * x1 +  w2 * x2 + w3 * x3 + b


In [26]:
learning_rate = 0.000001

for i in range(1000+1):
  with tf.GradientTape() as tape:
    hypothesis = w1 * x1 + w2 * x2 + w3 * x3 + b
    cost = tf.reduce_mean(tf.square(hypothesis - Y))

  w1_grad, w2_grad, w3_grad, b_grad = tape.gradient(cost, [w1, w2, w3, b])

  w1.assign_sub(learning_rate * w1_grad)
  w2.assign_sub(learning_rate * w2_grad)
  w3.assign_sub(learning_rate * w3_grad)
  b.assign_sub(learning_rate * b_grad)

  if i % 50 == 0:
    print("{:5} | {:12.4f}".format(i, cost.numpy()))

    0 | 5793889.5000
   50 |   64291.1562
  100 |     715.2902
  150 |       9.8461
  200 |       2.0153
  250 |       1.9252
  300 |       1.9210
  350 |       1.9177
  400 |       1.9145
  450 |       1.9114
  500 |       1.9081
  550 |       1.9050
  600 |       1.9018
  650 |       1.8986
  700 |       1.8955
  750 |       1.8923
  800 |       1.8892
  850 |       1.8860
  900 |       1.8829
  950 |       1.8798
 1000 |       1.8767


## Multi-variable linear regression (1)
*  random  초기화: tf.random_normal()


In [28]:
tf.random.set_seed(0)

In [29]:
# data and label
x1 = [ 73.,  93.,  89.,  96.,  73.]
x2 = [ 80.,  88.,  91.,  98.,  66.]
x3 = [ 75.,  93.,  90., 100.,  70.]
Y  = [152., 185., 180., 196., 142.]

# random weights
w1 = tf.Variable(tf.random.normal((1,)))
w2 = tf.Variable(tf.random.normal((1,)))
w3 = tf.Variable(tf.random.normal((1,)))
b  = tf.Variable(tf.random.normal((1,)))

learning_rate = 0.000001

for i in range(1000+1):
    # tf.GradientTape() to record the gradient of the cost function
    with tf.GradientTape() as tape:
        hypothesis = w1 * x1 +  w2 * x2 + w3 * x3 + b
        cost = tf.reduce_mean(tf.square(hypothesis - Y))
    # calculates the gradients of the cost
    w1_grad, w2_grad, w3_grad, b_grad = tape.gradient(cost, [w1, w2, w3, b])
    
    # update w1,w2,w3 and b
    w1.assign_sub(learning_rate * w1_grad)
    w2.assign_sub(learning_rate * w2_grad)
    w3.assign_sub(learning_rate * w3_grad)
    b.assign_sub(learning_rate * b_grad)

    if i % 50 == 0:
      print("{:5} | {:12.4f}".format(i, cost.numpy()))


    0 |   11325.9121
   50 |     135.3618
  100 |      11.1817
  150 |       9.7940
  200 |       9.7687
  250 |       9.7587
  300 |       9.7489
  350 |       9.7389
  400 |       9.7292
  450 |       9.7194
  500 |       9.7096
  550 |       9.6999
  600 |       9.6903
  650 |       9.6806
  700 |       9.6709
  750 |       9.6612
  800 |       9.6517
  850 |       9.6421
  900 |       9.6325
  950 |       9.6229
 1000 |       9.6134


In [31]:
data = np.array([
    # X1,   X2,    X3,   y
    [ 73.,  80.,  75., 152. ],
    [ 93.,  88.,  93., 185. ],
    [ 89.,  91.,  90., 180. ],
    [ 96.,  98., 100., 196. ],
    [ 73.,  66.,  70., 142. ]
], dtype=np.float32)

# slice data
X = data[:, :-1]
y = data[:, [-1]]

W = tf.Variable(tf.random.normal((3, 1)))
b = tf.Variable(tf.random.normal((1,)))

learning_rate = 0.000001

# hypothesis, prediction function
def predict(X):
    return tf.matmul(X, W) + b

n_epochs = 2000

for i in range(n_epoch+1):
  with tf.GradientTape() as tape:
    cost = tf.reduce_mean((tf.square(predict(X) - y)))
  
  W_grad, b_grad = tape.gradient(cost, [W, b])

  W.assign_sub(learning_rate * W_grad)
  b.assign_sub(learning_rate * b_grad)

  if i % 100 == 0:
    print("{:5} | {:10.6f}".format(i, cost.numpy()))


    0 | 9563.789062
  100 |   4.500271
  200 |   3.312405
  300 |   3.301463
  400 |   3.290700
  500 |   3.280007
  600 |   3.269355
  700 |   3.258698
  800 |   3.248233
  900 |   3.237767
 1000 |   3.227355


In [32]:
W

<tf.Variable 'Variable:0' shape=(3, 1) dtype=float32, numpy=
array([[ 1.861813  ],
       [ 0.53110313],
       [-0.3634701 ]], dtype=float32)>

In [33]:
b

<tf.Variable 'Variable:0' shape=(1,) dtype=float32, numpy=array([-0.5749455], dtype=float32)>

In [34]:
tf.matmul(X, W) + b

<tf.Tensor: shape=(5, 1), dtype=float32, numpy=
array([[150.56538],
       [185.50801],
       [180.74448],
       [193.86018],
       [144.9473 ]], dtype=float32)>

In [35]:
Y

[152.0, 185.0, 180.0, 196.0, 142.0]

In [36]:
predict(X)

<tf.Tensor: shape=(5, 1), dtype=float32, numpy=
array([[150.56538],
       [185.50801],
       [180.74448],
       [193.86018],
       [144.9473 ]], dtype=float32)>

In [37]:
# 새로운 데이터에 대한 예측

predict([[ 89.,  95.,  92.],[ 84.,  92.,  85.]]).numpy()

array([[182.14195],
       [173.78387]], dtype=float32)