## Sect5. Logistic (regression) classifier

로지스틱 회귀 (Logistic regression)

- 로지스틱 회귀는 D.R.Cox가 1958년에 제안한 확률 모델로서 독립 변수의 선형 결합을 이용하여 사건의 발생 가능성을 예측하는데 사용되는 통계 기법이다.
- 로지스틱 회귀는 선형 회귀 분석과는 다르게 종속 변수가 범주형 데이터를 대상으로 하며 입력 데이터가 주어졌을 때 해당 데이터의 결과가 특정 분류로 나뉘기 때문에 일종의 분류 (classification) 기법으로도 볼 수 있다.

In [4]:
# 학습 시간에 따른 이항변수 결과

![image.png](attachment:image.png)

$$ H(X) = W X $$
$$ z = H(X) $$
$$ g(z) $$

### 종속변수가 이항형 문제인 경우

![image.png](attachment:image.png)

로지스틱 함수
- 로지스틱 모형 식은 독립 변수가 [-∞,∞]의 어느 숫자이든 상관 없이 종속 변수 또는 결과 값이 항상 범위 [0,1] 사이에 있도록 한다.
- 이는 오즈비(odds ratio)를 로짓(logit) 변환을 수행함으로써 얻어진다

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Logistic Regression Classifier

$$ H(X) = \frac{1}{1 + e^-W^TX} $$
$$ cost(W) = -\frac{1}{m} \sum y log(H(x)) + (1-y)(log(1 - H(x)) $$
$$ W := W - α \frac{σ}{σW} cost(W) $$

![image.png](attachment:image.png)

In [5]:
import tensorflow as tf

tf.set_random_seed(777)  # for reproducibility

# Training Data 2dim: x1(hours), x2(attendence)
x_data = [[1, 2], [2, 3], [3, 1], [4, 3], [5, 3], [6, 2]]  
# Result Data : y(0:fail or 1:pass)
y_data = [[0], [0], [0], [1], [1], [1]]                     

# placeholders for a tensor that will be always fed.
X = tf.placeholder(tf.float32, shape=[None, 2])
Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([2, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# Hypothesis using sigmoid: tf.div(1., 1. + tf.exp(tf.matmul(X, W)))
hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

# cost/loss function
cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) *
                       tf.log(1 - hypothesis))

train = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

# Accuracy computation
# True if hypothesis>0.5 else False
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

In [7]:
from tqdm import tqdm_notebook
with tf.Session() as sess:
    # Initialize TensorFlow variables
    sess.run(tf.global_variables_initializer())

    for step in tqdm_notebook(range(10001)):
        cost_val, _ = sess.run([cost, train], feed_dict={X: x_data, Y: y_data})
        if step % 200 == 0 or step < 10 :
            print("Step : {} \t Cost : {}".format(step, cost_val))
            
    # Accuracy report
    h, c, a = sess.run([hypothesis, predicted, accuracy],
                       feed_dict={X: x_data, Y: y_data})

HBox(children=(IntProgress(value=0, max=10001), HTML(value='')))

Step : 0 	 Cost : 4.00310754776001
Step : 1 	 Cost : 3.921337127685547
Step : 2 	 Cost : 3.8396365642547607
Step : 3 	 Cost : 3.7580204010009766
Step : 4 	 Cost : 3.6765053272247314
Step : 5 	 Cost : 3.5950968265533447
Step : 6 	 Cost : 3.5137970447540283
Step : 7 	 Cost : 3.432631254196167
Step : 8 	 Cost : 3.351590871810913
Step : 9 	 Cost : 3.270707845687866
Step : 200 	 Cost : 0.4376770257949829
Step : 400 	 Cost : 0.40408918261528015
Step : 600 	 Cost : 0.38253891468048096
Step : 800 	 Cost : 0.36638668179512024
Step : 1000 	 Cost : 0.3529997766017914
Step : 1200 	 Cost : 0.34122005105018616
Step : 1400 	 Cost : 0.3304935693740845
Step : 1600 	 Cost : 0.32053497433662415
Step : 1800 	 Cost : 0.3111867904663086
Step : 2000 	 Cost : 0.30235520005226135
Step : 2200 	 Cost : 0.29397961497306824
Step : 2400 	 Cost : 0.28601738810539246
Step : 2600 	 Cost : 0.2784360945224762
Step : 2800 	 Cost : 0.27120915055274963
Step : 3000 	 Cost : 0.26431378722190857
Step : 3200 	 Cost : 0.2577296

In [8]:
print("# Hypothesis: \n{h} \n\n# Correct (Y): \n{c} \n\n# Accuracy: {a}".format(
    h = h, c = c, a = a
))

# Hypothesis: 
[[0.0258438 ]
 [0.15162882]
 [0.28050214]
 [0.79274476]
 [0.94650894]
 [0.9825203 ]] 

# Correct (Y): 
[[0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]] 

# Accuracy: 1.0


### Classifying diabetes
당뇨병 예측하는 실습 예제

- 예측값이 1이면 당뇨병 O
- 예측값이 0이면 당뇨병 X

![image.png](attachment:image.png)

In [10]:
import pandas as pd
df = pd.read_csv('data-03-diabetes.csv', header = None)
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,-0.294118,0.487437,0.180328,-0.292929,0.0,0.00149,-0.53117,-0.033333,0
1,-0.882353,-0.145729,0.081967,-0.414141,0.0,-0.207153,-0.766866,-0.666667,1
2,-0.058824,0.839196,0.04918,0.0,0.0,-0.305514,-0.492741,-0.633333,0
3,-0.882353,-0.105528,0.081967,-0.535354,-0.777778,-0.162444,-0.923997,0.0,1
4,0.0,0.376884,-0.344262,-0.292929,-0.602837,0.28465,0.887276,-0.6,0
5,-0.411765,0.165829,0.213115,0.0,0.0,-0.23696,-0.894962,-0.7,1
6,-0.647059,-0.21608,-0.180328,-0.353535,-0.791962,-0.076006,-0.854825,-0.833333,0
7,0.176471,0.155779,0.0,0.0,0.0,0.052161,-0.952178,-0.733333,1
8,-0.764706,0.979899,0.147541,-0.090909,0.283688,-0.090909,-0.931682,0.066667,0
9,-0.058824,0.256281,0.57377,0.0,0.0,0.0,-0.868488,0.1,0


In [13]:
# Lab 5 Logistic Regression Classifier
import tensorflow as tf
import numpy as np
tf.set_random_seed(777)  # for reproducibility

xy = np.loadtxt('data-03-diabetes.csv', delimiter=',', dtype=np.float32)
x_data = xy[:, 0:-1]
y_data = xy[:, [-1]]

# print(x_data.shape, y_data.shape)
print(" x_data.shape : {x_shape} \n y_data.shape : {y_shape}".format(
        x_shape = x_data.shape, 
        y_shape = y_data.shape
    ))

# placeholders for a tensor that will be always fed.
X = tf.placeholder(tf.float32, shape=[None, 8])
Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([8, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# Hypothesis using sigmoid: tf.div(1., 1. + tf.exp(tf.matmul(X, W)))
hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

# cost/loss function
cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) *
                       tf.log(1 - hypothesis))

train = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

# Accuracy computation
# True if hypothesis>0.5 else False
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

 x_data.shape : (759, 8) 
 y_data.shape : (759, 1)


In [14]:
print(len(xy)) 
xy[:10]

759


array([[-0.294118  ,  0.487437  ,  0.180328  , -0.292929  ,  0.        ,
         0.00149028, -0.53117   , -0.0333333 ,  0.        ],
       [-0.882353  , -0.145729  ,  0.0819672 , -0.414141  ,  0.        ,
        -0.207153  , -0.766866  , -0.666667  ,  1.        ],
       [-0.0588235 ,  0.839196  ,  0.0491803 ,  0.        ,  0.        ,
        -0.305514  , -0.492741  , -0.633333  ,  0.        ],
       [-0.882353  , -0.105528  ,  0.0819672 , -0.535354  , -0.777778  ,
        -0.162444  , -0.923997  ,  0.        ,  1.        ],
       [ 0.        ,  0.376884  , -0.344262  , -0.292929  , -0.602837  ,
         0.28465   ,  0.887276  , -0.6       ,  0.        ],
       [-0.411765  ,  0.165829  ,  0.213115  ,  0.        ,  0.        ,
        -0.23696   , -0.894962  , -0.7       ,  1.        ],
       [-0.647059  , -0.21608   , -0.180328  , -0.353535  , -0.791962  ,
        -0.0760059 , -0.854825  , -0.833333  ,  0.        ],
       [ 0.176471  ,  0.155779  ,  0.        ,  0.        ,  0

In [15]:
print(len(x_data)) 
x_data[:10]

759


array([[-0.294118  ,  0.487437  ,  0.180328  , -0.292929  ,  0.        ,
         0.00149028, -0.53117   , -0.0333333 ],
       [-0.882353  , -0.145729  ,  0.0819672 , -0.414141  ,  0.        ,
        -0.207153  , -0.766866  , -0.666667  ],
       [-0.0588235 ,  0.839196  ,  0.0491803 ,  0.        ,  0.        ,
        -0.305514  , -0.492741  , -0.633333  ],
       [-0.882353  , -0.105528  ,  0.0819672 , -0.535354  , -0.777778  ,
        -0.162444  , -0.923997  ,  0.        ],
       [ 0.        ,  0.376884  , -0.344262  , -0.292929  , -0.602837  ,
         0.28465   ,  0.887276  , -0.6       ],
       [-0.411765  ,  0.165829  ,  0.213115  ,  0.        ,  0.        ,
        -0.23696   , -0.894962  , -0.7       ],
       [-0.647059  , -0.21608   , -0.180328  , -0.353535  , -0.791962  ,
        -0.0760059 , -0.854825  , -0.833333  ],
       [ 0.176471  ,  0.155779  ,  0.        ,  0.        ,  0.        ,
         0.052161  , -0.952178  , -0.733333  ],
       [-0.764706  ,  0.979899  

In [16]:
print(len(y_data)) 
y_data[:10]

759


array([[0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]], dtype=float32)

In [17]:
# Launch graph
with tf.Session() as sess:
    # Initialize TensorFlow variables
    sess.run(tf.global_variables_initializer())

    for step in tqdm_notebook(range(10001)):
        cost_val, _ = sess.run([cost, train], feed_dict={X: x_data, Y: y_data})       
        if step % 200 == 0 or step < 10 :
            print("Step : {} \t Cost : {}".format(step, cost_val))    

    # Accuracy report
    h, c, a = sess.run([hypothesis, predicted, accuracy],
                       feed_dict={X: x_data, Y: y_data})

HBox(children=(IntProgress(value=0, max=10001), HTML(value='')))

Step : 0 	 Cost : 1.0964912176132202
Step : 1 	 Cost : 1.0936403274536133
Step : 2 	 Cost : 1.090810775756836
Step : 3 	 Cost : 1.0880019664764404
Step : 4 	 Cost : 1.0852138996124268
Step : 5 	 Cost : 1.082446575164795
Step : 6 	 Cost : 1.079700231552124
Step : 7 	 Cost : 1.076973795890808
Step : 8 	 Cost : 1.074268102645874
Step : 9 	 Cost : 1.0715829133987427
Step : 200 	 Cost : 0.809843897819519
Step : 400 	 Cost : 0.7428123950958252
Step : 600 	 Cost : 0.7107452154159546
Step : 800 	 Cost : 0.6862677931785583
Step : 1000 	 Cost : 0.6653096079826355
Step : 1200 	 Cost : 0.6469992399215698
Step : 1400 	 Cost : 0.6309411525726318
Step : 1600 	 Cost : 0.6168333292007446
Step : 1800 	 Cost : 0.6044145226478577
Step : 2000 	 Cost : 0.5934560298919678
Step : 2200 	 Cost : 0.5837594866752625
Step : 2400 	 Cost : 0.5751535892486572
Step : 2600 	 Cost : 0.5674916505813599
Step : 2800 	 Cost : 0.5606480240821838
Step : 3000 	 Cost : 0.5545151829719543
Step : 3200 	 Cost : 0.5490014553070068


In [18]:
print("# Hypothesis: \n{h} \n\n# Correct (Y): \n{c} \n\n# Accuracy: {a}".format(
    # h = h, c = c, a = a
    h = h[:20], c = c[:20], a = a
))

# Hypothesis: 
[[0.43222722]
 [0.916368  ]
 [0.26018637]
 [0.93752694]
 [0.34695348]
 [0.7429852 ]
 [0.95144475]
 [0.66804147]
 [0.2602672 ]
 [0.47056407]
 [0.64339465]
 [0.20578018]
 [0.28667298]
 [0.30300844]
 [0.7840216 ]
 [0.4736861 ]
 [0.7301069 ]
 [0.92308855]
 [0.8083443 ]
 [0.5410607 ]] 

# Correct (Y): 
[[0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]] 

# Accuracy: 0.7641633749008179
