# Lecture 6. Softmax classification - Multinomial classfication
---

## Logistic regression

- Logistic function
- Sigmoid function
 - sigmoid is curved in two directions, like the letter "S", or the Greek $\varsigma$ (sigma)
$$g(z) = \frac{1}{(1 + e^{-z})}$$

### Hypothesis
$$
\begin{align}
H(x) &= g(z) \\
( z &= WX )\\
H(X) &= \frac{1}{1+e^{-W^T X}}
\end{align}
$$

## Multinomial classification

### OvR ( One vs Rest )
- A, B, C 세 가지 class 가 존재하는 분류 문제의 경우
 - A 인지 아닌지, 
 - B 인지 아닌지, 
 - C 인지 아닌지 분류하는 3개의 classifier 를 가지고 구현 가능하다.
 
### Matrix form

![](./img/06-softmax01.png?raw=true)

![](./img/06-softmax02.png?raw=true)

### Softmax - hypothesis
![](./img/06-softmax03.png?raw=true)
![](./img/06-softmax04.png?raw=true)
![](./img/06-softmax05.png?raw=true)

### Cost function
- L : label ( 실제값 )
- S : sigmoid ( 예측값 )

![](./img/06-softmax06.png?raw=true)

### cross entropy cost function
- $-\sum L_i \log{S_i} = -\sum L_i \log{\hat{y_i}} = \sum L_i \odot (-\log{\hat{y_i})}$
- $0 \le \hat{y_i} \le 1$ -> $+\infty \le -\log{\hat{y_i}} \le 0$

![](./img/06-softmax07.png?raw=true)

- **ex.** $L = \begin{bmatrix} 0 \\ 1 \end{bmatrix} = B : 실제$
 - $\hat{y} = \begin{bmatrix} 0 \\ 1 \end{bmatrix} = B : 정답$ -> $\begin{bmatrix} 0 \\ 1 \end{bmatrix} \odot - \log \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \odot \begin{bmatrix} \infty \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} = 0$
 - $\hat{y} = \begin{bmatrix} 1 \\ 0 \end{bmatrix} = A : 오답$ -> $\begin{bmatrix} 0 \\ 1 \end{bmatrix} \odot - \log \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \odot \begin{bmatrix} 0 \\ \infty \end{bmatrix} = \begin{bmatrix} 0 \\ \infty \end{bmatrix} = \infty$
 
- **ex.** $L = \begin{bmatrix} 1 \\ 0 \end{bmatrix} = A : 실제$
 - $\hat{y} = \begin{bmatrix} 1 \\ 0 \end{bmatrix} = A : 정답$ -> $\begin{bmatrix} 1 \\ 0 \end{bmatrix} \odot \begin{bmatrix} 0 \\ \infty \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} = 0$
 - $\hat{y} = \begin{bmatrix} 1 \\ 0 \end{bmatrix} = A : 오답$ -> $\begin{bmatrix} 1 \\ 0 \end{bmatrix} \odot \begin{bmatrix} \infty \\ 0 \end{bmatrix} = \begin{bmatrix} \infty \\ 0 \end{bmatrix} = \infty$

## Logistic cost VS cross entropy

- 둘다 같음
 - Logistic cost
   - $c(H(x), y) = -y \cdot log(H(x)) - (1-y) \cdot log(1-H(x))$
 - cross entropy
   - $D(S, L) = \sum L_i \log(S_i)$

## Cost function
- with training set

![](./img/06-softmax08.png?raw=true)

- $\mathcal{L} : Loss$
- $D(w_1, w_2) : Distance$

## Gradient descent
- Step : $\alpha \times derivative = -\alpha \times \Delta \mathcal{L}(w_1, w_2)$

# Lab 6. Softmax classifier
---
- `tf.matmul(W, X)`
- `hypothesis = tf.nn.softmax(tf.matmul(W, X))`
- `cost = tf.reduce_mean(-tf.reduce_sum(Y*tf.log(hypothesis), reduction_indices=1))`
- `optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)`

In [1]:
import tensorflow as tf
import numpy as np

xy = np.loadtxt('./data/train_softmax.txt', unpack=True, dtype='float32')
x_data = np.transpose(xy[0:3])
y_data = np.transpose(xy[3:])

# tf Graph Input
X = tf.placeholder("float", [None, 3]) # x1, x2 and 1 ( for bias ) # None : 데이터의 sample 개수 모르기떄문에 None 으로 부여함
y = tf.placeholder("float", [None, 3]) # A, B, C => 3 classes

# Set model wights
W = tf.Variable(tf.zeros([3, 3])) # W : 3 x 3 ( label 이 3개, feature 가 3개 )

# Construct model
"""
https://www.tensorflow.org/tutorials/mnist/beginners/
First, we multiply x by W with the expression tf.matmul(x, W). 
This is flipped from when we multiplied them in our equation, 
where we had Wx, as a small trick to deal with x being a 2D tensor with multiple inputs. 
We then add b, and finally apply tf.nn.softmax.
"""
hypothesis = tf.nn.softmax(tf.matmul(X, W)) # Softmax

# Minimize error using cross entropy
learning_rate = 0.001

# Cross entropy
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(hypothesis)))

# Gradient Descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

# Initializing the variables
init = tf.global_variables_initializer()

# Launch
with tf.Session() as sess:
    sess.run(init)
    
    for step in range(2001):
        sess.run(optimizer, feed_dict={X:x_data, y:y_data})
        if step % 500 == 0:
            print(step, sess.run(cost, feed_dict={X:x_data, y:y_data}), sess.run(W))

0 8.73609 [[-0.00066667  0.00033333  0.00033333]
 [ 0.00133333  0.00233333 -0.00366667]
 [ 0.00133333  0.00333333 -0.00466667]]
500 7.40905 [[-0.41474834 -0.10312591  0.51787436]
 [ 0.03761362 -0.10297523  0.06536176]
 [ 0.07454177  0.17755088 -0.2520926 ]]
1000 6.7281 [[-0.76251483 -0.19494899  0.95746374]
 [ 0.05222586 -0.12449741  0.07227183]
 [ 0.13092487  0.22215752 -0.35308245]]
1500 6.24793 [[-1.0621376  -0.26723006  1.3293674 ]
 [ 0.06807303 -0.11824042  0.05016781]
 [ 0.17547619  0.23513761 -0.41061375]]
2000 5.88735 [[-1.3272748  -0.32214603  1.64941895]
 [ 0.08332979 -0.10558683  0.02225749]
 [ 0.21334152  0.23823248 -0.45157382]]


### Predict

In [2]:
# load data
xy = np.loadtxt("./data/train_softmax.txt", dtype="float32")
x_data = xy[:, 0:3]
y_data = xy[:, 3:]

# tf Graph Input
X = tf.placeholder("float", [None, 3])
y = tf.placeholder("float", [None, 3])

# Set model weight
W = tf.Variable(tf.zeros([3, 3]))

# Constructor model
hypothesis = tf.nn.softmax(tf.matmul(X, W)) # Softmax

# cost function ( cross entropy )
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(hypothesis)))

# minimize 
optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(cost)

# init
init = tf.global_variables_initializer()

# Launch
with tf.Session() as sess:
    sess.run(init)
    
    for step in range(2001):
        sess.run(optimizer, feed_dict={X: x_data, y: y_data})
        if step % 500 == 0:
            print(step, sess.run(cost, feed_dict={X: x_data, y: y_data}), sess.run(W))
    
    # Predict
    a = sess.run(hypothesis, feed_dict={X: [[1, 11, 7]]})
    print(a, sess.run(tf.arg_max(a, 1)))
    
    b = sess.run(hypothesis, feed_dict={X: [[1, 3, 4]]})
    print(b, sess.run(tf.arg_max(b, 1)))
    
    c = sess.run(hypothesis, feed_dict={X: [[1, 1, 0]]})
    print(c, sess.run(tf.arg_max(c, 1)))
    
    all = sess.run(hypothesis, feed_dict={X: [[1, 11, 7], [1, 3, 4], [1, 1, 0]]})
    print(all, sess.run(tf.arg_max(all, 1)))

0 8.73609 [[-0.00066667  0.00033333  0.00033333]
 [ 0.00133333  0.00233333 -0.00366667]
 [ 0.00133333  0.00333333 -0.00466667]]
500 7.40905 [[-0.41474834 -0.10312591  0.51787436]
 [ 0.03761362 -0.10297523  0.06536176]
 [ 0.07454177  0.17755088 -0.2520926 ]]
1000 6.7281 [[-0.76251483 -0.19494899  0.95746374]
 [ 0.05222586 -0.12449741  0.07227183]
 [ 0.13092487  0.22215752 -0.35308245]]
1500 6.24793 [[-1.0621376  -0.26723006  1.3293674 ]
 [ 0.06807303 -0.11824042  0.05016781]
 [ 0.17547619  0.23513761 -0.41061375]]
2000 5.88735 [[-1.3272748  -0.32214603  1.64941895]
 [ 0.08332979 -0.10558683  0.02225749]
 [ 0.21334152  0.23823248 -0.45157382]]
[[ 0.66555417  0.27094138  0.06350452]] [0]
[[ 0.25935882  0.44414687  0.29649431]] [1]
[[ 0.04603587  0.10412923  0.84983486]] [2]
[[ 0.66555417  0.27094138  0.06350452]
 [ 0.25935882  0.44414687  0.29649431]
 [ 0.04603587  0.10412923  0.84983486]] [0 1 2]
