# Data는 김도휘 형제님과 김명찬 형제님이 만들어주신 보편지향 기도 데이터를 사용하였습니다. 

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

## CSV 에서 기도문 읽어오기
def read_data(path_to_file):
    df = pd.read_csv(path_to_file, dtype=str)
    return df

df = read_data('../../data/pray456_v3.csv')

In [5]:
df.to_csv('../../data/pray456_v3withid.csv')

In [6]:
X = df['content']
y = df['label']
print(len(X))
print(len(y))

774
774


In [7]:
X[0]

'주님, 대림시기를 맞는 교회가 회개와 화해의 생활을 하며 저희에게  오실 아기 예수님을 기쁜 마음으로 맞이할 수 있도록 도와주소서.'

In [8]:
y_quiz = df['content'].sample(50)
y_quiz.shape

(50,)

In [9]:
y_quiz.sort_index().to_csv('../../data/quiz_pray1_sample50.csv')

## y data one_hot encoding

In [10]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

print(type(y[0]), y[:5])

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(y)
print(integer_encoded[:5])

# # one_hot encode
# onehot_encoder = OneHotEncoder(sparse=False)
# integer_encoded = integer_encoded.reshape(len(integer_encoded), )
# onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
# print(onehot_encoded[:5])

# setup y 
# y = onehot_encoded
y= integer_encoded
print(y[:5])

<class 'str'> 0    1
1    2
2    3
3    4
4    1
Name: label, dtype: object
[0 1 2 3 0]
[0 1 2 3 0]


## 띄어쓰기로 구분

In [11]:
X = [x.split() for x in X]

In [12]:
X[0]

['주님,',
 '대림시기를',
 '맞는',
 '교회가',
 '회개와',
 '화해의',
 '생활을',
 '하며',
 '저희에게',
 '오실',
 '아기',
 '예수님을',
 '기쁜',
 '마음으로',
 '맞이할',
 '수',
 '있도록',
 '도와주소서.']

## 고유 토큰 인덱싱

In [13]:
from collections import defaultdict

In [14]:
# 단어마다 고유한 인덱스를 부여하기 위한 dictionary
token_to_index = defaultdict(lambda : len(token_to_index))

In [15]:
# 단어에 대한 고유 인덱스를 부여하는 함수
def convert_token_to_idx(token_ls):
    for tokens in token_ls:
        yield [token_to_index[token] for token in tokens]
    return

In [16]:
X = list(convert_token_to_idx(X))

In [17]:
# 고유 인덱스로 변환될 경우, 원래 어떤 단어였는지 알기 어려우므로,
# 인덱스로 변환된 단어를 본래의 단어로 재변환하기 위한 dictionary 생성
index_to_token = {val : key for key,val in token_to_index.items()}

#### 인덱싱 결과 확인 

In [18]:
import operator

In [19]:
for k,v in sorted(token_to_index.items(), key=operator.itemgetter(1))[:5]:
    print (k,v)

주님, 0
대림시기를 1
맞는 2
교회가 3
회개와 4


In [20]:
X[0]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

### 빈(empty) 단어 가방(Bag of Words) 생성

In [21]:
n_train_reviews = len(X)       # 학습용 리뷰의 총 수
n_unique_word = len(token_to_index)  # 고유 단어의 갯수 (BOW의 차원의 크기) 

In [22]:
n_unique_word

3329

### numpy를 사용하면 memory error 발생 

In [23]:
import numpy as np

In [24]:
bow = np.zeros((n_train_reviews, n_unique_word), dtype=np.float32)

### Scipy 패키지 활용

In [25]:
# import scipy.sparse as sps

In [26]:
# 학습용 리뷰 수(150,000) x 고유 단어의 수(450,541)의 크기를 갖는 빈 단어가방 생성
# bow_data = sps.lil_matrix((n_train_reviews, n_unique_word), dtype=np.int8)

### 단어 가방 채우기

In [27]:
for i, tokens in enumerate(X):
    for token in tokens:
        # i번 째 리뷰에 등장한 단어들을 세서, 고유 번호에 1씩 더해준다.
        bow[i, token] += 1.0

### Train / test split

In [28]:
bow_train, bow_test, y_train, y_test = train_test_split(bow, y, test_size=0.2, random_state=1212)
print(bow_train.shape, bow_test.shape, y_train.shape, y_test.shape)
print(y_train[:5])

(619, 3329) (155, 3329) (619,) (155,)
[3 2 3 1 0]


## Logistic Regression

In [29]:
from sklearn.linear_model import LogisticRegression

In [30]:
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

### Train

In [31]:
model.fit(bow_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

### Test

In [32]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [33]:
predict = model.predict(bow_test)
accuracy = accuracy_score(y_test, predict)

In [34]:
print('Accuracy : ',accuracy)
print(classification_report(y_test, predict))

Accuracy :  0.7870967741935484
             precision    recall  f1-score   support

          0       0.86      0.95      0.90        38
          1       0.84      0.76      0.80        42
          2       0.81      0.71      0.75        41
          3       0.64      0.74      0.68        34

avg / total       0.79      0.79      0.79       155



## Pytorch

In [35]:
import torch

In [36]:
# dataset : bow_train, bow_test, y_train, y_test
bow_train, y_train, bow_test, y_test = map(
    torch.tensor, (bow_train, y_train, bow_test, y_test)
)

In [37]:
n, c = bow_train.shape
print(bow_train.shape)

torch.Size([619, 3329])


In [38]:
print(y_train.min(), y_train.max())
print(y_test.min(), y_test.max())

tensor(0) tensor(3)
tensor(0) tensor(3)


In [39]:
bs = 64  # batch size

xb = bow_train[0:bs]  # a mini-batch from x
yb = y_train[0:bs]
xv = bow_test[0:bs]
yv = y_test[0:bs]

print(xb[:5])
print(yb[:5])


tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 1., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.]])
tensor([3, 2, 3, 1, 0])


### Logistic regression with equations

In [40]:
import math

weights = torch.randn(3329, 4) / math.sqrt(3329)
weights.requires_grad_()
bias = torch.zeros(4, requires_grad=True)

def log_softmax(x):
    return x - x.exp().sum(-1).log().unsqueeze(-1)

def model(xb):
    return log_softmax(xb @ weights + bias)

def nll(pred, gt):
    return -pred[range(gt.shape[0]), gt].mean()

def accuracy(out, y):
    preds = torch.argmax(out, dim=1)
    return (preds == y).float().mean()

loss_func = nll

print(loss_func(model(xb), yb), accuracy(model(xb), yb))
print(loss_func(model(xv), yv), accuracy(model(xv), yv))


tensor(1.3930, grad_fn=<NegBackward>) tensor(0.2344)
tensor(1.3910, grad_fn=<NegBackward>) tensor(0.2031)


In [41]:
from IPython.core.debugger import set_trace

lr = 0.5  # learning rate
epochs = 30  # how many epochs to train for

for epoch in range(epochs):
    ## Feed foward
    pred = model(bow_train)
    loss = loss_func(pred, y_train)

    ## Backpropagation
    loss.backward()

    with torch.no_grad():
        weights -= weights.grad * lr
        bias -= bias.grad * lr
        weights.grad.zero_()
        bias.grad.zero_()
    
    print("Epoch:", epoch, loss_func(model(bow_train), y_train), accuracy(model(bow_train), y_train), loss_func(model(bow_test), y_test), accuracy(model(bow_test), y_test))

Epoch: 0 tensor(1.3524, grad_fn=<NegBackward>) tensor(0.4233) tensor(1.3595, grad_fn=<NegBackward>) tensor(0.4129)
Epoch: 1 tensor(1.3172, grad_fn=<NegBackward>) tensor(0.5767) tensor(1.3302, grad_fn=<NegBackward>) tensor(0.5226)
Epoch: 2 tensor(1.2841, grad_fn=<NegBackward>) tensor(0.6478) tensor(1.3027, grad_fn=<NegBackward>) tensor(0.5871)
Epoch: 3 tensor(1.2530, grad_fn=<NegBackward>) tensor(0.6753) tensor(1.2769, grad_fn=<NegBackward>) tensor(0.6000)
Epoch: 4 tensor(1.2236, grad_fn=<NegBackward>) tensor(0.6979) tensor(1.2527, grad_fn=<NegBackward>) tensor(0.6000)
Epoch: 5 tensor(1.1959, grad_fn=<NegBackward>) tensor(0.7027) tensor(1.2298, grad_fn=<NegBackward>) tensor(0.5871)
Epoch: 6 tensor(1.1697, grad_fn=<NegBackward>) tensor(0.7205) tensor(1.2082, grad_fn=<NegBackward>) tensor(0.6000)
Epoch: 7 tensor(1.1449, grad_fn=<NegBackward>) tensor(0.7383) tensor(1.1878, grad_fn=<NegBackward>) tensor(0.6065)
Epoch: 8 tensor(1.1213, grad_fn=<NegBackward>) tensor(0.7544) tensor(1.1685, gra

In [42]:
xb

tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 1., 0.,  ..., 0., 0., 0.],
        ...,
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.]])

In [43]:
yb

tensor([3, 2, 3, 1, 0, 2, 2, 0, 1, 3, 0, 2, 2, 1, 3, 1, 0, 1, 0, 3, 2, 0, 1, 1,
        2, 1, 0, 3, 1, 1, 0, 2, 0, 1, 1, 0, 3, 3, 1, 1, 3, 2, 2, 3, 0, 1, 2, 1,
        3, 0, 1, 2, 0, 2, 3, 0, 3, 3, 2, 3, 1, 1, 1, 3])

## Train with Pytorch CrossEntrophy

In [44]:
from torch import nn
import torch.nn.functional as F

loss_func = F.cross_entropy

class Logistic_Regression(nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(3329, 4) / math.sqrt(3329))
        self.bias = nn.Parameter(torch.zeros(4))

    def forward(self, xb):
        return xb @ self.weights + self.bias

model = Logistic_Regression()

with torch.no_grad(): # To see model's paramter without calculating gradient
    for p in model.parameters(): 
        print(p)
    print(model(xb).shape)
    print(yb.shape)
    
    print(loss_func(model(xb), yb).item(), accuracy(model(xb), yb).item())

Parameter containing:
tensor([[-0.0202,  0.0377,  0.0199,  0.0299],
        [ 0.0017, -0.0119,  0.0091,  0.0166],
        [ 0.0142, -0.0208,  0.0059, -0.0066],
        ...,
        [ 0.0326, -0.0266, -0.0374, -0.0289],
        [ 0.0034,  0.0013,  0.0098,  0.0164],
        [ 0.0126, -0.0108,  0.0148,  0.0119]], requires_grad=True)
Parameter containing:
tensor([0., 0., 0., 0.], requires_grad=True)
torch.Size([64, 4])
torch.Size([64])
1.3815109729766846 0.234375


In [45]:
loss = loss_func(model(xb), yb)
loss.backward()

In [46]:
loss_func

<function torch.nn.functional.cross_entropy>

In [47]:
def fit():
    for epoch in range(epochs):
        pred = model(bow_train)
        loss = loss_func(pred, y_train)

        #back propagation
        loss.backward()

        # update weights
        with torch.no_grad():
            for p in model.parameters():
                 p -= p.grad * lr
            model.zero_grad()
        
        print("Epoch:", epoch, loss_func(model(bow_train), y_train), accuracy(model(bow_train), y_train), loss_func(model(bow_test), y_test), accuracy(model(bow_test), y_test))

fit()

Epoch: 0 tensor(1.3116, grad_fn=<NllLossBackward>) tensor(0.4960) tensor(1.3298, grad_fn=<NllLossBackward>) tensor(0.4452)
Epoch: 1 tensor(1.2776, grad_fn=<NllLossBackward>) tensor(0.6171) tensor(1.3019, grad_fn=<NllLossBackward>) tensor(0.5484)
Epoch: 2 tensor(1.2466, grad_fn=<NllLossBackward>) tensor(0.6656) tensor(1.2764, grad_fn=<NllLossBackward>) tensor(0.6000)
Epoch: 3 tensor(1.2176, grad_fn=<NllLossBackward>) tensor(0.7076) tensor(1.2526, grad_fn=<NllLossBackward>) tensor(0.6387)
Epoch: 4 tensor(1.1903, grad_fn=<NllLossBackward>) tensor(0.7173) tensor(1.2301, grad_fn=<NllLossBackward>) tensor(0.6258)
Epoch: 5 tensor(1.1645, grad_fn=<NllLossBackward>) tensor(0.7318) tensor(1.2089, grad_fn=<NllLossBackward>) tensor(0.6581)
Epoch: 6 tensor(1.1401, grad_fn=<NllLossBackward>) tensor(0.7431) tensor(1.1888, grad_fn=<NllLossBackward>) tensor(0.6645)
Epoch: 7 tensor(1.1169, grad_fn=<NllLossBackward>) tensor(0.7496) tensor(1.1698, grad_fn=<NllLossBackward>) tensor(0.6645)
Epoch: 8 tensor(

## Train with Pytorch Layers & optimizer

In [56]:
from torch import nn, optim
import torch.nn.functional as F

class Perceptron(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(3329,4)
        self.relu = nn.ReLU() # instead of Heaviside step fn
        
    def forward(self, x):
        output = self.fc(x)
        output = self.relu(output) # instead of Heaviside step fn
        return output

model = Perceptron()
for param in model.parameters():
    print(param)
#     param.requires_grad_(True)

Parameter containing:
tensor([[-0.0008,  0.0078, -0.0139,  ...,  0.0017, -0.0140, -0.0072],
        [ 0.0157,  0.0065, -0.0055,  ..., -0.0070, -0.0044,  0.0069],
        [ 0.0026, -0.0089,  0.0142,  ...,  0.0109, -0.0079,  0.0038],
        [-0.0068, -0.0107,  0.0038,  ...,  0.0169, -0.0057, -0.0092]],
       requires_grad=True)
Parameter containing:
tensor([-0.0135,  0.0025, -0.0138, -0.0117], requires_grad=True)


In [59]:
# criterion = nn.BCELoss()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr = lr)

model.train()

for epoch in range(epoch):
    optimizer.zero_grad()
    # Forward pass
    y_pred = model(bow_train)
    
    # Compute Loss
    loss = criterion(y_pred, y_train)
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    # Log
    val_loss = criterion(model(bow_test), y_test)
    print("Epoch:", epoch, loss_func(model(bow_train), y_train), accuracy(model(bow_train), y_train), loss_func(model(bow_test), y_test), accuracy(model(bow_test), y_test))


Epoch: 0 tensor(0.6852, grad_fn=<NllLossBackward>) tensor(0.8821) tensor(0.8250, grad_fn=<NllLossBackward>) tensor(0.7806)
Epoch: 1 tensor(0.6787, grad_fn=<NllLossBackward>) tensor(0.8837) tensor(0.8205, grad_fn=<NllLossBackward>) tensor(0.7806)
Epoch: 2 tensor(0.6723, grad_fn=<NllLossBackward>) tensor(0.8853) tensor(0.8160, grad_fn=<NllLossBackward>) tensor(0.7806)
Epoch: 3 tensor(0.6661, grad_fn=<NllLossBackward>) tensor(0.8885) tensor(0.8117, grad_fn=<NllLossBackward>) tensor(0.7806)
Epoch: 4 tensor(0.6600, grad_fn=<NllLossBackward>) tensor(0.8901) tensor(0.8075, grad_fn=<NllLossBackward>) tensor(0.7871)
Epoch: 5 tensor(0.6540, grad_fn=<NllLossBackward>) tensor(0.8918) tensor(0.8035, grad_fn=<NllLossBackward>) tensor(0.7871)
Epoch: 6 tensor(0.6481, grad_fn=<NllLossBackward>) tensor(0.8918) tensor(0.7995, grad_fn=<NllLossBackward>) tensor(0.7871)
Epoch: 7 tensor(0.6424, grad_fn=<NllLossBackward>) tensor(0.8934) tensor(0.7956, grad_fn=<NllLossBackward>) tensor(0.7871)
Epoch: 8 tensor(

## Train with Pytorch MLP

In [68]:
from torch import nn, optim
import torch.nn.functional as F

class MultiLayerPerceptron(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc_hidden = nn.Linear(3329,2000)
        self.relu = nn.ReLU() # instead of Heaviside step fn
        self.fc_output = nn.Linear(2000, 4)

    def forward(self, x):
        output = self.fc_hidden(x)
        output = self.relu(output) # instead of Heaviside step fn
        output = self.fc_output(output)
        return output

In [69]:

model = MultiLayerPerceptron()
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[-0.0097, -0.0005,  0.0110,  ..., -0.0024, -0.0136,  0.0062],
        [-0.0121, -0.0127, -0.0126,  ...,  0.0012, -0.0140,  0.0079],
        [ 0.0153,  0.0111,  0.0146,  ..., -0.0097,  0.0013,  0.0003],
        ...,
        [-0.0162,  0.0105, -0.0173,  ..., -0.0047,  0.0095,  0.0153],
        [ 0.0148, -0.0108, -0.0026,  ..., -0.0084,  0.0101, -0.0113],
        [ 0.0128, -0.0171,  0.0108,  ..., -0.0169, -0.0107, -0.0085]],
       requires_grad=True)
Parameter containing:
tensor([ 0.0004, -0.0110,  0.0043,  ..., -0.0001,  0.0060,  0.0130],
       requires_grad=True)
Parameter containing:
tensor([[-1.5774e-02,  1.8338e-02, -1.9357e-02,  ..., -1.1281e-02,
          2.1894e-02,  6.8926e-04],
        [-6.3298e-03,  1.7564e-02,  4.9387e-03,  ..., -3.7360e-03,
          5.1971e-03,  1.2944e-02],
        [-1.1440e-02,  2.1414e-02, -2.0968e-02,  ..., -1.6232e-02,
          3.0976e-03, -3.1110e-05],
        [ 1.9348e-02, -1.0044e-02, -1.7488e-02,  ..., -1.3856e-02,
 

In [None]:
# criterion = nn.BCELoss()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr = lr)

model.train()

for epoch in range(epoch):
    optimizer.zero_grad()
    # Forward pass
    y_pred = model(bow_train)
    
    # Compute Loss
    loss = criterion(y_pred, y_train)
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    # Log
    val_loss = criterion(model(bow_test), y_test)
    print("Epoch:", epoch, loss_func(model(bow_train), y_train), accuracy(model(bow_train), y_train), loss_func(model(bow_test), y_test), accuracy(model(bow_test), y_test))


Epoch: 0 tensor(0.7056, grad_fn=<NllLossBackward>) tensor(0.8659) tensor(0.8508, grad_fn=<NllLossBackward>) tensor(0.7806)
Epoch: 1 tensor(0.6793, grad_fn=<NllLossBackward>) tensor(0.8772) tensor(0.8313, grad_fn=<NllLossBackward>) tensor(0.7935)
Epoch: 2 tensor(0.6540, grad_fn=<NllLossBackward>) tensor(0.8805) tensor(0.8129, grad_fn=<NllLossBackward>) tensor(0.8000)
Epoch: 3 tensor(0.6297, grad_fn=<NllLossBackward>) tensor(0.8837) tensor(0.7954, grad_fn=<NllLossBackward>) tensor(0.8000)
Epoch: 4 tensor(0.6062, grad_fn=<NllLossBackward>) tensor(0.8918) tensor(0.7790, grad_fn=<NllLossBackward>) tensor(0.7871)
Epoch: 5 tensor(0.5837, grad_fn=<NllLossBackward>) tensor(0.8966) tensor(0.7636, grad_fn=<NllLossBackward>) tensor(0.7806)
Epoch: 6 tensor(0.5621, grad_fn=<NllLossBackward>) tensor(0.8982) tensor(0.7490, grad_fn=<NllLossBackward>) tensor(0.7806)
Epoch: 7 tensor(0.5413, grad_fn=<NllLossBackward>) tensor(0.9095) tensor(0.7354, grad_fn=<NllLossBackward>) tensor(0.7871)
Epoch: 8 tensor(