# [[데이콘] 월간 데이콘 소설 작가 분류 AI 경진대회](https://dacon.io/competitions/open/235670/overview/description)

**[참고자료]**  
[[코드] 데이콘 기초 베이스라인](https://dacon.io/competitions/open/235670/codeshare/1738?page=1&dtype=recent)  
[NLP에서의 전처리 방법(상)](https://developer-kelvin.tistory.com/13)  
[22. 자연어 처리하기 1](https://codetorial.net/tensorflow/natural_language_processing_in_tensorflow_01.html)  
[07-08 케라스(Keras) 훑어보기](https://wikidocs.net/32105)  
[09-01 워드 임베딩(Word Embedding)](https://wikidocs.net/33520)  

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import warnings 
warnings.filterwarnings(action='ignore')
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import re

## 데이터 살펴보기

In [3]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/소설 작가 분류 AI 경진대회/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/소설 작가 분류 AI 경진대회/test_x.csv')
sample_submission = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/소설 작가 분류 AI 경진대회/sample_submission.csv')

In [4]:
train

Unnamed: 0,index,text,author
0,0,"He was almost choking. There was so much, so m...",3
1,1,"“Your sister asked for it, I suppose?”",2
2,2,"She was engaged one day as she walked, in per...",1
3,3,"The captain was in the porch, keeping himself ...",4
4,4,"“Have mercy, gentlemen!” odin flung up his han...",3
...,...,...,...
54874,54874,"“Is that you, Mr. Smith?” odin whispered. “I h...",2
54875,54875,"I told my plan to the captain, and between us ...",4
54876,54876,"""Your sincere well-wisher, friend, and sister...",1
54877,54877,“Then you wanted me to lend you money?”,3


In [5]:
test

Unnamed: 0,index,text
0,0,“Not at all. I think she is one of the most ch...
1,1,"""No,"" replied he, with sudden consciousness, ""..."
2,2,As the lady had stated her intention of scream...
3,3,“And then suddenly in the silence I heard a so...
4,4,His conviction remained unchanged. So far as I...
...,...,...
19612,19612,"At the end of another day or two, odin growing..."
19613,19613,"All afternoon we sat together, mostly in silen..."
19614,19614,"odin, having carried his thanks to odin, proc..."
19615,19615,"Soon after this, upon odin's leaving the room,..."


In [6]:
sample_submission

Unnamed: 0,index,0,1,2,3,4
0,0,0,0,0,0,0
1,1,0,0,0,0,0
2,2,0,0,0,0,0
3,3,0,0,0,0,0
4,4,0,0,0,0,0
...,...,...,...,...,...,...
19612,19612,0,0,0,0,0
19613,19613,0,0,0,0,0
19614,19614,0,0,0,0,0
19615,19615,0,0,0,0,0


## 텍스트 전처리

In [7]:
# 부호를 제거해주는 함수(영문, 숫자, 띄어쓰기 제외하고 모두 제거)
def alpha_num(text):
  return re.sub(r'[^a-zA-Z0-9 ]', '', text) # re.sub(정규 표현식, 치환 문자, 대상 문자열）

train['text'] = train['text'].apply(alpha_num)

In [8]:
train

Unnamed: 0,index,text,author
0,0,He was almost choking There was so much so muc...,3
1,1,Your sister asked for it I suppose,2
2,2,She was engaged one day as she walked in peru...,1
3,3,The captain was in the porch keeping himself c...,4
4,4,Have mercy gentlemen odin flung up his hands D...,3
...,...,...,...
54874,54874,Is that you Mr Smith odin whispered I hardly d...,2
54875,54875,I told my plan to the captain and between us w...,4
54876,54876,Your sincere wellwisher friend and sister LUC...,1
54877,54877,Then you wanted me to lend you money,3


In [9]:
# 불용어 제거해주는 함수
def remove_stopwords(text):
  final_text = []
  for i in text.split():
    if i.strip().lower() not in stopwords:
      final_text.append(i.strip())
  return " ".join(final_text)

# 불용어
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", 
             "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", 
             "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", 
             "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", 
             "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", 
             "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", 
             "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", 
             "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", 
             "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", 
             "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", 
             "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

In [10]:
# 전처리 적용
train['text'] = train['text'].str.lower()
test['text'] = test['text'].str.lower()
train['text'] = train['text'].apply(alpha_num).apply(remove_stopwords)
test['text'] = test['text'].apply(alpha_num).apply(remove_stopwords)

In [11]:
# train set 분리
X_train = np.array([x for x in train['text']])
X_test = np.array([x for x in test['text']])
y_train = np.array([x for x in train['author']])

## 모델링

In [12]:
# 파라미터 설정
vocab_size = 20000
embedding_dim = 16
max_length = 500
padding_type = 'post' # 시퀀스의 뒤에 패딩이 채워짐 / default는 'pre'

In [13]:
# tokenizer에 fit
tokenizer = Tokenizer(num_words = vocab_size)
tokenizer.fit_on_texts(X_train) # 문자 데이터를 입력받아서 리스트의 형태로 변환
word_index =tokenizer.word_index # 단어와 숫자의 키-값 쌍을 포함하는 딕셔너리를 반환

In [23]:
# 데이터를 sequence로 변환해주고 padding 수행
train_sequences = tokenizer.texts_to_sequences(X_train) 
train_padded = pad_sequences(train_sequences, padding=padding_type, maxlen=max_length)

test_sequences = tokenizer.texts_to_sequences(X_test) 
test_padded = pad_sequences(test_sequences, padding=padding_type, maxlen=max_length)

In [26]:
# 가벼운 NLP모델 생성
model = tf.keras.Sequential([ # Squential: 층 구성
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length), # 총 단어 개수, 임베딩 벡터의 출력 차원(결과로서 나오는 임베딩 벡터 크기), 입력 시퀀스의 길이
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'), # 출력 뉴런수 = 24
    tf.keras.layers.Dense(5, activation='softmax') # 출력 뉴런수 = 5
])

In [27]:
# compile model(학습 방식에 대한 환경 설정)
model.compile(loss='sparse_categorical_crossentropy', # 훈련 데이터의 y(=label) 값이 정수(int) 형태인 경우에 사용, 반면에 원-핫 벡터 형태인 경우에는 categorical_crossentropy를 사용함
              optimizer='adam',
              metrics=['accuracy'])

# model summary
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 500, 16)           320000    
                                                                 
 global_average_pooling1d_1   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_2 (Dense)             (None, 24)                408       
                                                                 
 dense_3 (Dense)             (None, 5)                 125       
                                                                 
Total params: 320,533
Trainable params: 320,533
Non-trainable params: 0
_________________________________________________________________
None


In [28]:
# fit model
num_epochs = 20
history = model.fit(train_padded, y_train, epochs=num_epochs, verbose=2, validation_split = 0.2)

Epoch 1/20
1372/1372 - 9s - loss: 1.5669 - accuracy: 0.2750 - val_loss: 1.5526 - val_accuracy: 0.2700 - 9s/epoch - 6ms/step
Epoch 2/20
1372/1372 - 9s - loss: 1.4481 - accuracy: 0.3892 - val_loss: 1.3129 - val_accuracy: 0.4732 - 9s/epoch - 6ms/step
Epoch 3/20
1372/1372 - 8s - loss: 1.2078 - accuracy: 0.5152 - val_loss: 1.1519 - val_accuracy: 0.5250 - 8s/epoch - 6ms/step
Epoch 4/20
1372/1372 - 8s - loss: 1.0805 - accuracy: 0.5648 - val_loss: 1.0775 - val_accuracy: 0.5560 - 8s/epoch - 6ms/step
Epoch 5/20
1372/1372 - 9s - loss: 1.0027 - accuracy: 0.5960 - val_loss: 1.0359 - val_accuracy: 0.5867 - 9s/epoch - 7ms/step
Epoch 6/20
1372/1372 - 8s - loss: 0.9427 - accuracy: 0.6231 - val_loss: 1.0095 - val_accuracy: 0.5880 - 8s/epoch - 6ms/step
Epoch 7/20
1372/1372 - 8s - loss: 0.8919 - accuracy: 0.6477 - val_loss: 0.9603 - val_accuracy: 0.6284 - 8s/epoch - 6ms/step
Epoch 8/20
1372/1372 - 8s - loss: 0.8453 - accuracy: 0.6702 - val_loss: 0.9422 - val_accuracy: 0.6321 - 8s/epoch - 6ms/step
Epoch 9/

In [38]:
# predict values
pred = model.predict(test_padded)



In [39]:
pred

array([[1.6974120e-03, 8.8672921e-02, 6.0592336e-03, 9.0290284e-01,
        6.6753320e-04],
       [1.8136653e-01, 4.3614256e-01, 8.9928947e-02, 2.0071972e-02,
        2.7248996e-01],
       [9.8510158e-01, 1.3356958e-02, 2.0308341e-06, 5.3054521e-09,
        1.5393761e-03],
       ...,
       [1.3515345e-03, 9.9863356e-01, 1.1369999e-08, 8.6366963e-06,
        6.2100744e-06],
       [9.6233306e-04, 9.9899566e-01, 5.0529696e-08, 6.9750008e-06,
        3.5036053e-05],
       [9.9746227e-01, 1.7245600e-04, 1.4108724e-04, 1.5038708e-08,
        2.2241410e-03]], dtype=float32)

In [41]:
# submission
sample_submission[['0','1','2','3','4']] = pred
sample_submission

Unnamed: 0,index,0,1,2,3,4
0,0,0.001697,8.867292e-02,6.059234e-03,9.029028e-01,6.675332e-04
1,1,0.181367,4.361426e-01,8.992895e-02,2.007197e-02,2.724900e-01
2,2,0.985102,1.335696e-02,2.030834e-06,5.305452e-09,1.539376e-03
3,3,0.000027,1.280081e-12,9.987916e-01,5.478291e-10,1.181176e-03
4,4,0.769114,4.397665e-02,1.230180e-01,5.023449e-02,1.365710e-02
...,...,...,...,...,...,...
19612,19612,0.000023,9.999765e-01,4.830500e-16,2.251747e-10,1.186422e-11
19613,19613,0.004875,1.547373e-07,2.966265e-04,1.152473e-13,9.948279e-01
19614,19614,0.001352,9.986336e-01,1.137000e-08,8.636696e-06,6.210074e-06
19615,19615,0.000962,9.989957e-01,5.052970e-08,6.975001e-06,3.503605e-05


In [42]:
sample_submission.to_csv('/content/drive/MyDrive/Colab Notebooks/data/소설 작가 분류 AI 경진대회/submission.csv', index = False, encoding = 'utf-8')