# Data는 김도휘 형제님과 김명찬 형제님이 만들어주신 보편지향 기도 데이터를 사용하였습니다. 

In [38]:
import pandas as pd
from sklearn.model_selection import train_test_split

## CSV 에서 기도문 읽어오기
def read_data(path_to_file):
    df = pd.read_csv(path_to_file, dtype=str)
    return df

df = read_data('../../data/pray456_v3.csv')

In [39]:
df.to_csv('../../data/pray456_v3withid.csv')

In [40]:
X = df['content']
y = df['label']
print(len(X))
print(len(y))

774
774


In [41]:
X[0]

'주님, 대림시기를 맞는 교회가 회개와 화해의 생활을 하며 저희에게  오실 아기 예수님을 기쁜 마음으로 맞이할 수 있도록 도와주소서.'

In [42]:
y_quiz = df['content'].sample(50)
y_quiz.shape

(50,)

In [43]:
y_quiz.sort_index().to_csv('../../data/quiz_pray1_sample50.csv')

## y data encoding

In [44]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

print(type(y[0]), y[:5])

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(y)
print(integer_encoded[:5])

# # one_hot encode
# onehot_encoder = OneHotEncoder(sparse=False)
# integer_encoded = integer_encoded.reshape(len(integer_encoded), )
# onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
# print(onehot_encoded[:5])

# setup y 
# y = onehot_encoded

y= integer_encoded
print(y[:5])

<class 'str'> 0    1
1    2
2    3
3    4
4    1
Name: label, dtype: object
[0 1 2 3 0]
[0 1 2 3 0]


## 띄어쓰기로 구분

In [45]:
X = [x.split() for x in X]

In [46]:
X[0]

['주님,',
 '대림시기를',
 '맞는',
 '교회가',
 '회개와',
 '화해의',
 '생활을',
 '하며',
 '저희에게',
 '오실',
 '아기',
 '예수님을',
 '기쁜',
 '마음으로',
 '맞이할',
 '수',
 '있도록',
 '도와주소서.']

## 고유 토큰 인덱싱

In [47]:
from collections import defaultdict

In [48]:
# 단어마다 고유한 인덱스를 부여하기 위한 dictionary
token_to_index = defaultdict(lambda : len(token_to_index))

In [49]:
# 단어에 대한 고유 인덱스를 부여하는 함수
def convert_token_to_idx(token_ls):
    for tokens in token_ls:
        yield [token_to_index[token] for token in tokens]
    return

In [50]:
X = list(convert_token_to_idx(X))

In [51]:
# 고유 인덱스로 변환될 경우, 원래 어떤 단어였는지 알기 어려우므로,
# 인덱스로 변환된 단어를 본래의 단어로 재변환하기 위한 dictionary 생성
index_to_token = {val : key for key,val in token_to_index.items()}

#### 인덱싱 결과 확인 

In [52]:
import operator

In [53]:
for k,v in sorted(token_to_index.items(), key=operator.itemgetter(1))[:5]:
    print (k,v)

주님, 0
대림시기를 1
맞는 2
교회가 3
회개와 4


In [54]:
X[0]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

### 빈(empty) 단어 가방(Bag of Words) 생성

In [55]:
n_train_reviews = len(X)       # 학습용 리뷰의 총 수
n_unique_word = len(token_to_index)  # 고유 단어의 갯수 (BOW의 차원의 크기) 

In [56]:
n_unique_word

3329

### numpy를 사용하면 memory error 발생 

In [57]:
import numpy as np

In [58]:
bow = np.zeros((n_train_reviews, n_unique_word), dtype=np.float32)

### Scipy 패키지 활용

In [59]:
# import scipy.sparse as sps

In [60]:
# 학습용 리뷰 수(150,000) x 고유 단어의 수(450,541)의 크기를 갖는 빈 단어가방 생성
# bow_data = sps.lil_matrix((n_train_reviews, n_unique_word), dtype=np.int8)

### 단어 가방 채우기

In [61]:
for i, tokens in enumerate(X):
    for token in tokens:
        # i번 째 리뷰에 등장한 단어들을 세서, 고유 번호에 1씩 더해준다.
        bow[i, token] += 1.0

### Train / test split

In [62]:
bow_train, bow_test, y_train, y_test = train_test_split(bow, y, test_size=0.2, random_state=1212)
print(bow_train.shape, bow_test.shape, y_train.shape, y_test.shape)
print(y_train[:5])

(619, 3329) (155, 3329) (619,) (155,)
[3 2 3 1 0]


## Logistic Regression

In [63]:
from sklearn.linear_model import LogisticRegression

In [64]:
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

### Train

In [65]:
model.fit(bow_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

### Test

In [66]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [67]:
predict = model.predict(bow_test)
accuracy = accuracy_score(y_test, predict)

In [68]:
print('Accuracy : ',accuracy)
print(classification_report(y_test, predict))

Accuracy :  0.7870967741935484
             precision    recall  f1-score   support

          0       0.86      0.95      0.90        38
          1       0.84      0.76      0.80        42
          2       0.81      0.71      0.75        41
          3       0.64      0.74      0.68        34

avg / total       0.79      0.79      0.79       155



## Pytorch

In [69]:
import torch

In [70]:
print(type(y_train), y_train[:5])

<class 'numpy.ndarray'> [3 2 3 1 0]


In [71]:
# dataset : bow_train, bow_test, y_train, y_test
bow_train, y_train, bow_test, y_test = map(
    torch.tensor, (bow_train, y_train, bow_test, y_test)
)

In [72]:
n, c = bow_train.shape
print(bow_train.shape)

torch.Size([619, 3329])


In [73]:
print(y_train.min(), y_train.max())
print(y_test.min(), y_test.max())

tensor(0) tensor(3)
tensor(0) tensor(3)


In [74]:
bs = 64  # batch size

xb = bow_train[0:bs]  # a mini-batch from x
yb = y_train[0:bs]
xv = bow_test[0:bs]
yv = y_test[0:bs]

print(xb[:5])
print(yb[:5])


tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 1., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.]])
tensor([3, 2, 3, 1, 0])


### Logistic regression with equations

In [75]:
import math

weights = torch.randn(3329, 4) / math.sqrt(3329)
weights.requires_grad_()
bias = torch.zeros(4, requires_grad=True)

def log_softmax(x):
    return x - x.exp().sum(-1).log().unsqueeze(-1)

def model(xb):
    return log_softmax(xb @ weights + bias)

def nll(pred, gt):
    return -pred[range(gt.shape[0]), gt].mean()

def accuracy(out, y):
    preds = torch.argmax(out, dim=1)
    return (preds == y).float().mean()

loss_func = nll

print(loss_func(model(xb), yb), accuracy(model(xb), yb))
print(loss_func(model(xv), yv), accuracy(model(xv), yv))


tensor(1.4015, grad_fn=<NegBackward>) tensor(0.0938)
tensor(1.3805, grad_fn=<NegBackward>) tensor(0.2969)


In [76]:
from IPython.core.debugger import set_trace

lr = 0.5  # learning rate
epochs = 30  # how many epochs to train for

for epoch in range(epochs):
    ## Feed foward
    pred = model(bow_train)
    loss = loss_func(pred, y_train)

    ## Backpropagation
    loss.backward()

    with torch.no_grad():
        weights -= weights.grad * lr
        bias -= bias.grad * lr
        weights.grad.zero_()
        bias.grad.zero_()
    
    print("Epoch:", epoch, loss_func(model(bow_train), y_train), accuracy(model(bow_train), y_train), loss_func(model(bow_test), y_test), accuracy(model(bow_test), y_test))

Epoch: 0 tensor(1.3504, grad_fn=<NegBackward>) tensor(0.4475) tensor(1.3518, grad_fn=<NegBackward>) tensor(0.4774)
Epoch: 1 tensor(1.3148, grad_fn=<NegBackward>) tensor(0.5784) tensor(1.3234, grad_fn=<NegBackward>) tensor(0.5419)
Epoch: 2 tensor(1.2817, grad_fn=<NegBackward>) tensor(0.6672) tensor(1.2965, grad_fn=<NegBackward>) tensor(0.6065)
Epoch: 3 tensor(1.2506, grad_fn=<NegBackward>) tensor(0.7270) tensor(1.2711, grad_fn=<NegBackward>) tensor(0.6194)
Epoch: 4 tensor(1.2213, grad_fn=<NegBackward>) tensor(0.7431) tensor(1.2471, grad_fn=<NegBackward>) tensor(0.6000)
Epoch: 5 tensor(1.1936, grad_fn=<NegBackward>) tensor(0.7431) tensor(1.2243, grad_fn=<NegBackward>) tensor(0.6258)
Epoch: 6 tensor(1.1675, grad_fn=<NegBackward>) tensor(0.7512) tensor(1.2028, grad_fn=<NegBackward>) tensor(0.6258)
Epoch: 7 tensor(1.1427, grad_fn=<NegBackward>) tensor(0.7641) tensor(1.1824, grad_fn=<NegBackward>) tensor(0.6258)
Epoch: 8 tensor(1.1192, grad_fn=<NegBackward>) tensor(0.7658) tensor(1.1632, gra

In [77]:
xb

tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 1., 0.,  ..., 0., 0., 0.],
        ...,
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.]])

In [78]:
yb

tensor([3, 2, 3, 1, 0, 2, 2, 0, 1, 3, 0, 2, 2, 1, 3, 1, 0, 1, 0, 3, 2, 0, 1, 1,
        2, 1, 0, 3, 1, 1, 0, 2, 0, 1, 1, 0, 3, 3, 1, 1, 3, 2, 2, 3, 0, 1, 2, 1,
        3, 0, 1, 2, 0, 2, 3, 0, 3, 3, 2, 3, 1, 1, 1, 3])

## Train with Pytorch CrossEntrophy

In [79]:
from torch import nn
import torch.nn.functional as F

loss_func = F.cross_entropy

class Logistic_Regression(nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(3329, 4) / math.sqrt(3329))
        self.bias = nn.Parameter(torch.zeros(4))

    def forward(self, xb):
        return xb @ self.weights + self.bias

model = Logistic_Regression()

with torch.no_grad(): # To see model's paramter without calculating gradient
    for p in model.parameters(): 
        print(p)
    print(model(xb).shape)
    print(yb.shape)
    
    print(loss_func(model(xb), yb).item(), accuracy(model(xb), yb).item())

Parameter containing:
tensor([[ 0.0009,  0.0155, -0.0365, -0.0214],
        [ 0.0047, -0.0089, -0.0168,  0.0142],
        [-0.0022, -0.0303, -0.0123, -0.0138],
        ...,
        [-0.0124,  0.0385, -0.0274,  0.0409],
        [ 0.0161, -0.0038,  0.0064,  0.0242],
        [ 0.0248, -0.0109,  0.0333, -0.0075]], requires_grad=True)
Parameter containing:
tensor([0., 0., 0., 0.], requires_grad=True)
torch.Size([64, 4])
torch.Size([64])
1.3926472663879395 0.171875


In [80]:
loss = loss_func(model(xb), yb)
loss.backward()

In [81]:
loss_func

<function torch.nn.functional.cross_entropy>

In [82]:
def fit():
    for epoch in range(epochs):
        pred = model(bow_train)
        loss = loss_func(pred, y_train)

        #back propagation
        loss.backward()

        # update weights
        with torch.no_grad():
            for p in model.parameters():
                 p -= p.grad * lr
            model.zero_grad()
        
        print("Epoch:", epoch, loss_func(model(bow_train), y_train), accuracy(model(bow_train), y_train), loss_func(model(bow_test), y_test), accuracy(model(bow_test), y_test))

fit()

Epoch: 0 tensor(1.3184, grad_fn=<NllLossBackward>) tensor(0.4814) tensor(1.3244, grad_fn=<NllLossBackward>) tensor(0.4452)
Epoch: 1 tensor(1.2842, grad_fn=<NllLossBackward>) tensor(0.6026) tensor(1.2961, grad_fn=<NllLossBackward>) tensor(0.5226)
Epoch: 2 tensor(1.2529, grad_fn=<NllLossBackward>) tensor(0.6494) tensor(1.2701, grad_fn=<NllLossBackward>) tensor(0.6065)
Epoch: 3 tensor(1.2237, grad_fn=<NllLossBackward>) tensor(0.6850) tensor(1.2460, grad_fn=<NllLossBackward>) tensor(0.6323)
Epoch: 4 tensor(1.1962, grad_fn=<NllLossBackward>) tensor(0.7108) tensor(1.2233, grad_fn=<NllLossBackward>) tensor(0.6387)
Epoch: 5 tensor(1.1702, grad_fn=<NllLossBackward>) tensor(0.7270) tensor(1.2018, grad_fn=<NllLossBackward>) tensor(0.6645)
Epoch: 6 tensor(1.1455, grad_fn=<NllLossBackward>) tensor(0.7447) tensor(1.1815, grad_fn=<NllLossBackward>) tensor(0.6774)
Epoch: 7 tensor(1.1221, grad_fn=<NllLossBackward>) tensor(0.7464) tensor(1.1623, grad_fn=<NllLossBackward>) tensor(0.6710)
Epoch: 8 tensor(

## Train with Pytorch Layers & optimizer

In [83]:
from torch import nn, optim
import torch.nn.functional as F

class Perceptron(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(3329,4)
        self.relu = nn.ReLU() # instead of Heaviside step fn
        
    def forward(self, x):
        output = self.fc(x)
        output = self.relu(output) # instead of Heaviside step fn
        return output

model = Perceptron()
for param in model.parameters():
    print(param)
#     param.requires_grad_(True)

Parameter containing:
tensor([[-0.0010, -0.0089, -0.0043,  ...,  0.0163, -0.0024, -0.0043],
        [-0.0060,  0.0144,  0.0026,  ..., -0.0031,  0.0007,  0.0127],
        [ 0.0168, -0.0105,  0.0068,  ..., -0.0145, -0.0073,  0.0084],
        [-0.0171, -0.0101,  0.0166,  ...,  0.0014,  0.0117,  0.0157]],
       requires_grad=True)
Parameter containing:
tensor([-0.0127, -0.0072, -0.0150,  0.0161], requires_grad=True)


In [84]:
# criterion = nn.BCELoss()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr = lr)

model.train()

for epoch in range(epoch):
    optimizer.zero_grad()
    # Forward pass
    y_pred = model(bow_train)
    
    # Compute Loss
    loss = criterion(y_pred, y_train)
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    # Log
    val_loss = criterion(model(bow_test), y_test)
    print("Epoch:", epoch, criterion(model(bow_train), y_train), accuracy(model(bow_train), y_train), criterion(model(bow_test), y_test), accuracy(model(bow_test), y_test))


Epoch: 0 tensor(1.3765, grad_fn=<NllLossBackward>) tensor(0.3554) tensor(1.3783, grad_fn=<NllLossBackward>) tensor(0.3548)
Epoch: 1 tensor(1.3618, grad_fn=<NllLossBackward>) tensor(0.4346) tensor(1.3655, grad_fn=<NllLossBackward>) tensor(0.4452)
Epoch: 2 tensor(1.3440, grad_fn=<NllLossBackward>) tensor(0.4798) tensor(1.3478, grad_fn=<NllLossBackward>) tensor(0.5097)
Epoch: 3 tensor(1.3212, grad_fn=<NllLossBackward>) tensor(0.5816) tensor(1.3274, grad_fn=<NllLossBackward>) tensor(0.5742)
Epoch: 4 tensor(1.2935, grad_fn=<NllLossBackward>) tensor(0.6365) tensor(1.3028, grad_fn=<NllLossBackward>) tensor(0.5871)
Epoch: 5 tensor(1.2653, grad_fn=<NllLossBackward>) tensor(0.6866) tensor(1.2786, grad_fn=<NllLossBackward>) tensor(0.6258)
Epoch: 6 tensor(1.2375, grad_fn=<NllLossBackward>) tensor(0.7141) tensor(1.2565, grad_fn=<NllLossBackward>) tensor(0.6387)
Epoch: 7 tensor(1.2111, grad_fn=<NllLossBackward>) tensor(0.7383) tensor(1.2346, grad_fn=<NllLossBackward>) tensor(0.6516)
Epoch: 8 tensor(

## Train with Pytorch MLP

In [88]:
from torch import nn, optim
import torch.nn.functional as F

class MultiLayerPerceptron(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc_hidden = nn.Linear(3329,2000)
        self.relu = nn.ReLU() # instead of Heaviside step fn
        self.fc_output = nn.Linear(2000, 4)

    def forward(self, x):
        output = self.fc_hidden(x)
        output = self.relu(output) # instead of Heaviside step fn
        output = self.fc_output(output)
        return output

In [89]:

model = MultiLayerPerceptron()
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 1.2820e-02, -1.0636e-02,  4.4281e-03,  ..., -4.3550e-03,
         -1.4767e-02,  4.0531e-03],
        [-7.6278e-03, -8.9340e-03, -1.0001e-02,  ..., -1.4296e-02,
         -1.1692e-02,  3.9751e-03],
        [ 9.7446e-03, -5.8376e-03, -1.5760e-02,  ...,  5.7479e-04,
          9.7562e-03,  1.4891e-02],
        ...,
        [-2.2288e-03,  1.6520e-02, -1.3872e-02,  ..., -1.4716e-02,
         -1.2840e-02, -1.4484e-02],
        [ 1.5840e-02,  4.9464e-03,  6.3476e-03,  ...,  1.5330e-02,
          9.9402e-04,  1.0017e-02],
        [-8.8604e-03,  7.4830e-03,  7.2563e-05,  ...,  9.7049e-03,
          1.0375e-03, -5.7526e-04]], requires_grad=True)
Parameter containing:
tensor([ 0.0048,  0.0141, -0.0101,  ...,  0.0034,  0.0037,  0.0099],
       requires_grad=True)
Parameter containing:
tensor([[-0.0139, -0.0140, -0.0155,  ..., -0.0137,  0.0066, -0.0025],
        [-0.0038,  0.0140,  0.0033,  ..., -0.0083,  0.0068, -0.0052],
        [-0.0169, -0.0215, -0.0017,  ..., -0.0

In [91]:
# criterion = nn.BCELoss()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr = lr)
loss_func = criterion

model.train()

for epoch in range(epoch):
    optimizer.zero_grad()
    # Forward pass
    y_pred = model(bow_train)
    
    # Compute Loss
    loss = criterion(y_pred, y_train)
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    # Log
    val_loss = criterion(model(bow_test), y_test)
    print("Epoch:", epoch, loss_func(model(bow_train), y_train), accuracy(model(bow_train), y_train), loss_func(model(bow_test), y_test), accuracy(model(bow_test), y_test))


Epoch: 0 tensor(0.8654, grad_fn=<NllLossBackward>) tensor(0.8368) tensor(0.9679, grad_fn=<NllLossBackward>) tensor(0.7548)
Epoch: 1 tensor(0.8349, grad_fn=<NllLossBackward>) tensor(0.8417) tensor(0.9433, grad_fn=<NllLossBackward>) tensor(0.7548)
Epoch: 2 tensor(0.8050, grad_fn=<NllLossBackward>) tensor(0.8449) tensor(0.9194, grad_fn=<NllLossBackward>) tensor(0.7613)
Epoch: 3 tensor(0.7757, grad_fn=<NllLossBackward>) tensor(0.8514) tensor(0.8963, grad_fn=<NllLossBackward>) tensor(0.7613)
Epoch: 4 tensor(0.7472, grad_fn=<NllLossBackward>) tensor(0.8562) tensor(0.8741, grad_fn=<NllLossBackward>) tensor(0.7677)
Epoch: 5 tensor(0.7196, grad_fn=<NllLossBackward>) tensor(0.8627) tensor(0.8530, grad_fn=<NllLossBackward>) tensor(0.7742)
Epoch: 6 tensor(0.6928, grad_fn=<NllLossBackward>) tensor(0.8691) tensor(0.8328, grad_fn=<NllLossBackward>) tensor(0.7742)
Epoch: 7 tensor(0.6669, grad_fn=<NllLossBackward>) tensor(0.8740) tensor(0.8137, grad_fn=<NllLossBackward>) tensor(0.7742)
Epoch: 8 tensor(