# Logistic Regression in income dada  

인간 정보를 이용하여 연수입이 50K달러가 넘는지 판별하는 알고리즘 구성  
dataset: https://www.kaggle.com/datasets/wenruliu/adult-income-dataset  

Logistic regression의 input feature는 나이, 공부기간, 월근무시간("age","educational-num","hours-per-week"), label은 연수입이 50K가 넘는지(income)를 사용함  
최적의 파라미터를 찾기 위해 Gradient Descent(경사하강법)을 사용함  
미분을 위한 python package는 sympy를 사용

In [None]:
!kaggle datasets download -d wenruliu/adult-income-dataset
!unzip "adult-income-dataset.zip"

Dataset URL: https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
License(s): unknown
Downloading adult-income-dataset.zip to /content
  0% 0.00/652k [00:00<?, ?B/s]
100% 652k/652k [00:00<00:00, 72.9MB/s]
Archive:  adult-income-dataset.zip
  inflating: adult.csv               


데이터셋 구축  
편의를 위해 test dataset은 따로 구축하지 않았음  
logistic regression을 위해 데이터 정규화 하였음  

In [None]:
import pandas as pd

dataset = pd.read_csv("adult.csv")[:100]
dataset = dataset[["age","educational-num","hours-per-week","income"]]
print("dataset length:",len(dataset))
print(dataset.head())
dataset["age"] = (dataset["age"] - dataset["age"].mean()) / dataset["age"].std()  # 정규화
dataset["educational-num"] = (dataset["educational-num"] - dataset["educational-num"].mean()) / dataset["educational-num"].std()  # 정규화
dataset["hours-per-week"] = (dataset["hours-per-week"] - dataset["hours-per-week"].mean()) / dataset["hours-per-week"].std()  # 정규화
print("after normalization")
print(dataset.head())

#if income "<=50K" is 0, >50k is 1
dataset["income"] = dataset["income"].apply(lambda x: 0 if x == "<=50K" else 1)
dataset.head()


dataset length: 100
   age  educational-num  hours-per-week income
0   25                7              40  <=50K
1   38                9              50  <=50K
2   28               12              40   >50K
3   44               10              40   >50K
4   18               10              30  <=50K
after normalization
        age  educational-num  hours-per-week income
0 -0.878511        -1.090694        0.041897  <=50K
1  0.038074        -0.306022        0.863416  <=50K
2 -0.666992         0.870986        0.041897   >50K
3  0.461113         0.086314        0.041897   >50K
4 -1.372057         0.086314       -0.779621  <=50K


Unnamed: 0,age,educational-num,hours-per-week,income
0,-0.878511,-1.090694,0.041897,0
1,0.038074,-0.306022,0.863416,0
2,-0.666992,0.870986,0.041897,1
3,0.461113,0.086314,0.041897,1
4,-1.372057,0.086314,-0.779621,0


작성할 코드 부분
learning rate, 반복횟수(epoch), 초기 a,b,c,d값(tmp_a,tmp_b,tmp_c,tmp_d) 각각 0.1, 100, 0.3 사용  
1. 1차함수 생성 및 sigmoid 적용
2. binary cross entropy 함수 생성
3. binary cross entropy 함수를 각 파라미터에 대한 미분(기울기) 함수 생성
4. 각 데이터셋에 대한 기울기 구하기
5. 전체 데이터셋에 대한 기울기 평균 내기
6. 평균낸 기울기를 업데이트 하기
7. 업데이트 epoch번 반복  

In [None]:
import sympy as sp

learning_rate = 0.1
epoch = 200

#add here
#함수 선언에 필요한 variable선언(parameters, in_feature, label)
x1,x2,x3,y,a,b,c,d = sp.symbols("x1 x2 x3 y a b c d")

#각 파라미터에 대한 초기값 선언
tmp_a = 0.3
tmp_b = 0.3
tmp_c = 0.3
tmp_d = 0.3

#sigmoid를 적용한 1차함수 생성
def sigmoid(z):
    return 1 / (1 + sp.exp(-z))

#요기서부터
#add here
#logistic regression 수식 구현(1차함수 후 sigmoid)
p = sigmoid(a*x1 + b*x2 + c*x3 + d)

#add here
#binary cross entropy 수식 구현
#1차함수를 이용하여 binary cross entropy 함수 생성(sp.log(x)를 통해 log사용)
BinaryCrossEntropy = -y*sp.log(p) - (1-y)*sp.log(1-p)
# 요기가 중요

#add here
#각 parameter에 대한 편미분 함수 생성
gradient_a = sp.diff(BinaryCrossEntropy,a)
gradient_b = sp.diff(BinaryCrossEntropy,b)
gradient_c = sp.diff(BinaryCrossEntropy,c)
gradient_d = sp.diff(BinaryCrossEntropy,d)

#경사하강법의 횟수 생성
for E in range(epoch):

  #성능 평가용 코드
  print(f"Epoch {E}")
  print(f"a = {tmp_a}, b = {tmp_b}, c = {tmp_c}, d = {tmp_d}")

  correct_counter = 0
  for i in range(len(dataset)):
    data_x1, data_x2, data_x3, data_y = dataset.iloc[i]

    #logistic regression 수행
    result = p.subs({x1:data_x1, x2:data_x2, x3:data_x3, a:tmp_a, b:tmp_b, c:tmp_c, d:tmp_d})
    if result >= 0.5 and data_y == 1:
      correct_counter += 1

    elif result < 0.5 and data_y == 0:
      correct_counter += 1

  print(f"train accuracy: {correct_counter/len(dataset)}")

  #경사하강법으로 파라미터 업데이트
  #모든 데이터에 대한 gradient연산 후 평균을 내서 업데이트
  gradient_a_list = []
  gradient_b_list = []
  gradient_c_list = []
  gradient_d_list = []
  for i in range(len(dataset)):
    #데이터 하나씩 사용하여 gradient 연산 후 저장
    data_x1, data_x2, data_x3, data_y = dataset.iloc[i]

    #add here
    #각 파라미터에 대해 현재 데이터에서 기울기 연산
    grad_a = gradient_a.subs({x1: data_x1, x2: data_x2, x3: data_x3, y: data_y, a: tmp_a, b: tmp_b, c: tmp_c, d: tmp_d})
    grad_b = gradient_b.subs({x1: data_x1, x2: data_x2, x3: data_x3, y: data_y, a: tmp_a, b: tmp_b, c: tmp_c, d: tmp_d})
    grad_c = gradient_c.subs({x1: data_x1, x2: data_x2, x3: data_x3, y: data_y, a: tmp_a, b: tmp_b, c: tmp_c, d: tmp_d})
    grad_d = gradient_d.subs({x1: data_x1, x2: data_x2, x3: data_x3, y: data_y, a: tmp_a, b: tmp_b, c: tmp_c, d: tmp_d})

    gradient_a_list.append(grad_a)
    gradient_b_list.append(grad_b)
    gradient_c_list.append(grad_c)
    gradient_d_list.append(grad_c)

  #gradient의 평균을 내어 최종 gradient구하기
  gradient_a = sum(gradient_a_list)/len(dataset)
  gradient_b = sum(gradient_b_list)/len(dataset)
  gradient_c = sum(gradient_c_list)/len(dataset)
  gradient_d = sum(gradient_d_list)/len(dataset)

  #add here
  #parameter에 gradient update하기
  tmp_a = tmp_a - learning_rate*gradient_a
  tmp_b = tmp_b - learning_rate*gradient_b
  tmp_c = tmp_c - learning_rate*gradient_c
  tmp_d = tmp_d - learning_rate*gradient_d

Epoch 0
a = 0.3, b = 0.3, c = 0.3, d = 0.3
train accuracy: 0.51
Epoch 1
a = 0.304387066331472, b = 0.308015977162173, c = 0.302460734427025, d = 0.302460734427025
train accuracy: 0.51
Epoch 2
a = 0.308774132662944, b = 0.316031954324346, c = 0.304921468854050, d = 0.304921468854050
train accuracy: 0.51
Epoch 3
a = 0.313161198994417, b = 0.324047931486519, c = 0.307382203281076, d = 0.307382203281076
train accuracy: 0.52
Epoch 4
a = 0.317548265325889, b = 0.332063908648692, c = 0.309842937708101, d = 0.309842937708101
train accuracy: 0.52
Epoch 5
a = 0.321935331657361, b = 0.340079885810865, c = 0.312303672135126, d = 0.312303672135126
train accuracy: 0.52
Epoch 6
a = 0.326322397988833, b = 0.348095862973038, c = 0.314764406562151, d = 0.314764406562151
train accuracy: 0.52
Epoch 7
a = 0.330709464320306, b = 0.356111840135211, c = 0.317225140989176, d = 0.317225140989176
train accuracy: 0.52
Epoch 8
a = 0.335096530651778, b = 0.364127817297384, c = 0.319685875416202, d = 0.3196858754162

In [None]:
# prompt: logistic regression of dataset using scikit-learn

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(dataset.drop("income", axis=1), dataset["income"])
print(model.coef_)
print(model.intercept_)

print("Accuracy:", model.score(dataset.drop("income", axis=1), dataset["income"]))


[[0.76599138 1.10095021 0.65505929]]
[-1.61672803]
Accuracy: 0.8


[[0.76599138 1.10095021 0.65505929]]<br>
[-1.61672803]<br>
Accuracy: 0.8