# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계3 : Text classification

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * Machine Learning
>> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
> * Deep Learning
>> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
>> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
>> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)
>> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)

## 1. 개발 환경 설정

### 1-1. 라이브러리 설치

In [None]:
# 필요 라이브러리부터 설치할께요.
!pip install konlpy pandas seaborn wordcloud python-mecab-ko wget transformers



In [None]:
# 런타임 재시작 필요
!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~/.cache/matplotlib -rf

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
fonts-nanum is already the newest version (20200506-1).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.
/usr/share/fonts: caching, new cache contents: 0 fonts, 1 dirs
/usr/share/fonts/truetype: caching, new cache contents: 0 fonts, 3 dirs
/usr/share/fonts/truetype/humor-sans: caching, new cache contents: 1 fonts, 0 dirs
/usr/share/fonts/truetype/liberation: caching, new cache contents: 16 fonts, 0 dirs
/usr/share/fonts/truetype/nanum: caching, new cache contents: 12 fonts, 0 dirs
/usr/local/share/fonts: caching, new cache contents: 0 fonts, 0 dirs
/root/.local/share/fonts: skipping, no such directory
/root/.fonts: skipping, no such directory
/usr/share/fonts/truetype: skipping, looped directory detected
/usr/share/fonts/truetype/humor-sans: skipping, looped directory detected
/usr/share/fonts/truetype/liberation: skipping, looped directory detected
/usr/share/fonts/truetype/

In [None]:
!sudo apt-get install -y fonts-nanum

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
fonts-nanum is already the newest version (20200506-1).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.


### 1-2. 라이브러리 import

In [None]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import wget,os
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
import tensorflow as tf
import nltk
import wget,os

### 1-3. 한글 글꼴 설정

In [None]:
FONT_PATH = '/usr/share/fonts/truetype/nanum/NanumGothic.ttf'
font_name = fm.FontProperties(fname=FONT_PATH, size=10).get_name()
print(font_name)
plt.rcParams['font.family']=font_name
assert plt.rcParams['font.family'] == [font_name], "한글 폰트가 설정되지 않았습니다."

NanumGothic


### 1-4. 구글드라이브 연결

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. 전처리한 데이터 불러오기
* 1, 2일차에 전처리한 데이터를 불러옵니다.
* sparse data에 대해서는 scipy.sparse.load_npz 활용

In [None]:
import scipy
data_path = '/content/drive/MyDrive/미프4_2/'
x_train = scipy.sparse.load_npz(data_path + 'X_tfidf_train.npz')
x_val = scipy.sparse.load_npz(data_path + 'X_tfidf_val.npz')
y_train = np.load(data_path + 'y_train.npy')
y_val = np.load(data_path + 'y_val.npy')
x_train.shape, x_val.shape, y_train.shape, y_val.shape

((2779, 9303), (927, 9303), (2779,), (927,))

## 3. Machine Learning(N-grams)
* N-gram으로 전처리한 데이터를 이용하여 3개 이상의 Machine Learning 모델 학습 및 성능 분석
> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

### 3-1. LGB

In [None]:
!pip install lightgbm



In [None]:
from lightgbm import LGBMClassifier

In [None]:
%%time
model = LGBMClassifier(random_state=2023)
model.fit(x_train, y_train)

You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 33497
[LightGBM] [Info] Number of data points in the train set: 2779, number of used features: 945
[LightGBM] [Info] Start training from score -0.845620
[LightGBM] [Info] Start training from score -1.614488
[LightGBM] [Info] Start training from score -1.651325
[LightGBM] [Info] Start training from score -1.863738
[LightGBM] [Info] Start training from score -3.695740
CPU times: user 5.45 s, sys: 55.2 ms, total: 5.51 s
Wall time: 7.08 s


In [None]:
y_pred = model.predict(x_val)
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(accuracy_score(y_val, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.80      0.88      0.83       392
           1       0.75      0.77      0.76       179
           2       0.76      0.67      0.71       195
           3       0.87      0.75      0.80       130
           4       0.97      0.97      0.97        31

    accuracy                           0.80       927
   macro avg       0.83      0.80      0.81       927
weighted avg       0.80      0.80      0.79       927

[[343  17  28   4   0]
 [ 32 137   8   1   1]
 [ 43  12 130  10   0]
 [ 12  15   6  97   0]
 [  0   1   0   0  30]]
79.50377562028046


### 3-2. RF

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
%%time
model = RandomForestClassifier(random_state=2023)
model.fit(x_train, y_train)

CPU times: user 1.78 s, sys: 10.2 ms, total: 1.79 s
Wall time: 2 s


In [None]:
y_pred = model.predict(x_val)
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(accuracy_score(y_val, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.70      0.94      0.80       392
           1       0.87      0.58      0.70       179
           2       0.77      0.58      0.66       195
           3       0.84      0.75      0.79       130
           4       1.00      0.71      0.83        31

    accuracy                           0.76       927
   macro avg       0.84      0.71      0.76       927
weighted avg       0.78      0.76      0.75       927

[[367   6  13   6   0]
 [ 61 104   9   5   0]
 [ 73   3 113   6   0]
 [ 16   5  11  98   0]
 [  4   2   1   2  22]]
75.94390507011866


### 3-3. CBC

 CatBoost는 "Category Boosting"의 줄임말로, 범주형 데이터를 처리하는 데 특화된 트리 기반의 그래디언트 부스팅 알고리즘

범주형 데이터 처리: CatBoost는 범주형 특성을 자동으로 처리할 수 있어, 별도의 원-핫 인코딩이나 라벨 인코딩과 같은 전처리가 필요없음

In [None]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [None]:
from catboost import CatBoostClassifier

In [None]:
%%time
moel = CatBoostClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_val)
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(accuracy_score(y_val, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.70      0.94      0.80       392
           1       0.87      0.58      0.70       179
           2       0.77      0.58      0.66       195
           3       0.84      0.75      0.79       130
           4       1.00      0.71      0.83        31

    accuracy                           0.76       927
   macro avg       0.84      0.71      0.76       927
weighted avg       0.78      0.76      0.75       927

[[367   6  13   6   0]
 [ 61 104   9   5   0]
 [ 73   3 113   6   0]
 [ 16   5  11  98   0]
 [  4   2   1   2  22]]
75.94390507011866
CPU times: user 1.72 s, sys: 12 ms, total: 1.73 s
Wall time: 1.73 s


### 3-4. SVC

In [None]:
from sklearn.svm import SVC

In [None]:
%%time
model = SVC()
model.fit(x_train, y_train)

CPU times: user 6.68 s, sys: 9.57 ms, total: 6.69 s
Wall time: 8.96 s


In [None]:
y_pred = model.predict(x_val)
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(accuracy_score(y_val, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.80      0.93      0.86       392
           1       0.86      0.79      0.83       179
           2       0.78      0.73      0.76       195
           3       0.89      0.78      0.83       130
           4       1.00      0.55      0.71        31

    accuracy                           0.83       927
   macro avg       0.87      0.75      0.80       927
weighted avg       0.83      0.83      0.82       927

[[363   3  24   2   0]
 [ 27 142   8   2   0]
 [ 44   5 142   4   0]
 [ 13   9   7 101   0]
 [  4   6   0   4  17]]
82.52427184466019


### 3-5. Logistic

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(x_val)
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(accuracy_score(y_val, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.81      0.93      0.87       392
           1       0.86      0.81      0.83       179
           2       0.80      0.74      0.77       195
           3       0.89      0.78      0.83       130
           4       1.00      0.42      0.59        31

    accuracy                           0.83       927
   macro avg       0.87      0.74      0.78       927
weighted avg       0.83      0.83      0.82       927

[[365   2  22   3   0]
 [ 25 145   7   2   0]
 [ 42   5 144   4   0]
 [ 13   9   7 101   0]
 [  6   8   0   4  13]]
82.84789644012946


### 3-6. MLP

N-그램 (TF-IDF) : N-gram을 사용하여 단어의 조합을 생성하고, 이러한 N-gram의 빈도나 중요도를 TF-IDF로 계산하여 벡터화합니다.

In [None]:
!pip uninstall keras

Found existing installation: keras 2.12.0
Uninstalling keras-2.12.0:
  Would remove:
    /usr/local/lib/python3.10/dist-packages/keras-2.12.0.dist-info/*
    /usr/local/lib/python3.10/dist-packages/keras/*
Proceed (Y/n)? y
  Successfully uninstalled keras-2.12.0


In [None]:
!pip uninstall tensorflow
!pip install tensorflow==2.12.0

Found existing installation: tensorflow 2.12.0
Uninstalling tensorflow-2.12.0:
  Would remove:
    /usr/local/bin/estimator_ckpt_converter
    /usr/local/bin/import_pb_to_tensorboard
    /usr/local/bin/saved_model_cli
    /usr/local/bin/tensorboard
    /usr/local/bin/tf_upgrade_v2
    /usr/local/bin/tflite_convert
    /usr/local/bin/toco
    /usr/local/bin/toco_from_protos
    /usr/local/lib/python3.10/dist-packages/tensorflow-2.12.0.dist-info/*
    /usr/local/lib/python3.10/dist-packages/tensorflow/*
Proceed (Y/n)? y
  Successfully uninstalled tensorflow-2.12.0
Collecting tensorflow==2.12.0
  Using cached tensorflow-2.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (585.9 MB)
Collecting keras<2.13,>=2.12.0 (from tensorflow==2.12.0)
  Using cached keras-2.12.0-py2.py3-none-any.whl (1.7 MB)
Installing collected packages: keras, tensorflow
Successfully installed keras-2.12.0 tensorflow-2.12.0


In [None]:
def get_last_layer_units_and_activation(num_classes):
    """Gets the # units and activation function for the last network layer.

    # Arguments
        num_classes: int, number of classes.

    # Returns
        units, activation values.
    """
    if num_classes == 2:
        activation = 'sigmoid'
        units = 1
    else:
        activation = 'softmax'
        units = num_classes
    return units, activation

In [None]:
import tensorflow as tf
from tensorflow.keras import models
from tensorflow.keras.layers import Dense, Dropout

def mlp_model(layers, units, dropout_rate, input_shape, num_classes):
    """Creates an instance of a multi-layer perceptron model.

    # Arguments
        layers: int, number of `Dense` layers in the model.
        units: int, output dimension of the layers.
        dropout_rate: float, percentage of input to drop at Dropout layers.
        input_shape: tuple, shape of input to the model.
        num_classes: int, number of output classes.

    # Returns
        An MLP model instance.
    """
    op_units, op_activation = get_last_layer_units_and_activation(num_classes)
    model = models.Sequential()
    model.add(Dropout(rate=dropout_rate, input_shape=input_shape))

    for _ in range(layers-1):
        model.add(Dense(units=units, activation='relu'))
        model.add(Dropout(rate=dropout_rate))

    model.add(Dense(units=op_units, activation=op_activation))
    return model

layers=2,

In [None]:
def train_ngram_model(x_train, x_val, train_labels, val_labels,
                      learning_rate=1e-3,
                      epochs=1000,
                      batch_size=128,
                      layers=2,
                      units=64,
                      dropout_rate=0.2):
    """Trains n-gram model on the given dataset.

    # Arguments
        data: tuples of training and test texts and labels.
        learning_rate: float, learning rate for training model.
        epochs: int, number of epochs.
        batch_size: int, number of samples per batch.
        layers: int, number of `Dense` layers in the model.
        units: int, output dimension of Dense layers in the model.
        dropout_rate: float: percentage of input to drop at Dropout layers.

    # Raises
        ValueError: If validation data has label values which were not seen
            in the training data.
    """


    # Verify that validation labels are in the same range as training labels.
    num_classes = len(np.unique(val_labels))
    unexpected_labels = [v for v in val_labels if v not in range(num_classes)]

    if len(unexpected_labels):
        raise ValueError('Unexpected label values found in the validation set:'
                         ' {unexpected_labels}. Please make sure that the '
                         'labels in the validation set are in the same range '
                         'as training labels.'.format(
                             unexpected_labels=unexpected_labels))

    # Create model instance.
    model = mlp_model(layers=layers,
                      units=units,
                      dropout_rate=dropout_rate,
                      input_shape=x_train.shape[1:],
                      num_classes=num_classes)

    # Compile model with learning parameters.
    if num_classes == 2:
        loss = 'binary_crossentropy'
    else:
        loss = 'sparse_categorical_crossentropy'
    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss=loss, metrics=['acc'])

    # Create callback for early stopping on validation loss. If the loss does
    # not decrease in two consecutive tries, stop training.
    callbacks = [tf.keras.callbacks.EarlyStopping(
        monitor='val_loss', patience=2)]

    # Train and validate model.
    history = model.fit(
            x_train,
            train_labels,
            epochs=epochs,
            callbacks=callbacks,
            validation_data=(x_val, val_labels),
            verbose=2,  # Logs once per epoch.
            batch_size=batch_size)

    # Print results.
    history = history.history
    print('Validation accuracy: {acc}, loss: {loss}'.format(
            acc=history['val_acc'][-1], loss=history['val_loss'][-1]))

    # Save model.
    model.save('IMDb_mlp_model.h5')
    return history['val_acc'][-1], history['val_loss'][-1]

In [None]:
x_train_dense = x_train.toarray()
x_val_dense = x_val.toarray()

In [None]:
from imblearn.over_sampling import ADASYN
from imblearn.over_sampling import SMOTE

adasyn = ADASYN(random_state=42)
smote =SMOTE(random_state=42)

x_train_ada, y_train_ada = adasyn.fit_resample(x_train_dense, y_train)
x_train_smt, y_train_smt = smote.fit_resample(x_train_dense, y_train)

In [None]:
# 모델 학습
val_acc, val_loss = train_ngram_model(x_train_dense, x_val_dense, y_train, y_val)

Epoch 1/1000
22/22 - 3s - loss: 1.5424 - acc: 0.4207 - val_loss: 1.4524 - val_acc: 0.4229 - 3s/epoch - 134ms/step
Epoch 2/1000
22/22 - 1s - loss: 1.3519 - acc: 0.4520 - val_loss: 1.2659 - val_acc: 0.4358 - 710ms/epoch - 32ms/step
Epoch 3/1000
22/22 - 1s - loss: 1.1525 - acc: 0.5365 - val_loss: 1.0946 - val_acc: 0.5297 - 704ms/epoch - 32ms/step
Epoch 4/1000
22/22 - 1s - loss: 0.9582 - acc: 0.7056 - val_loss: 0.9392 - val_acc: 0.6969 - 768ms/epoch - 35ms/step
Epoch 5/1000
22/22 - 1s - loss: 0.7836 - acc: 0.8330 - val_loss: 0.8107 - val_acc: 0.7724 - 782ms/epoch - 36ms/step
Epoch 6/1000
22/22 - 1s - loss: 0.6377 - acc: 0.8759 - val_loss: 0.7114 - val_acc: 0.8080 - 684ms/epoch - 31ms/step
Epoch 7/1000
22/22 - 1s - loss: 0.5260 - acc: 0.9010 - val_loss: 0.6409 - val_acc: 0.8177 - 781ms/epoch - 36ms/step
Epoch 8/1000
22/22 - 1s - loss: 0.4535 - acc: 0.9172 - val_loss: 0.5886 - val_acc: 0.8285 - 716ms/epoch - 33ms/step
Epoch 9/1000
22/22 - 1s - loss: 0.3819 - acc: 0.9298 - val_loss: 0.5504 - 

In [None]:
val_acc, val_loss = train_ngram_model(x_train_smt, x_val_dense, y_train_smt, y_val)

Epoch 1/1000
47/47 - 2s - loss: 1.4536 - acc: 0.7210 - val_loss: 1.3140 - val_acc: 0.7648 - 2s/epoch - 48ms/step
Epoch 2/1000
47/47 - 1s - loss: 0.9832 - acc: 0.8954 - val_loss: 0.9424 - val_acc: 0.8414 - 1s/epoch - 30ms/step
Epoch 3/1000
47/47 - 1s - loss: 0.5906 - acc: 0.9356 - val_loss: 0.6910 - val_acc: 0.8479 - 1s/epoch - 29ms/step
Epoch 4/1000
47/47 - 1s - loss: 0.3798 - acc: 0.9474 - val_loss: 0.5656 - val_acc: 0.8447 - 1s/epoch - 31ms/step
Epoch 5/1000
47/47 - 1s - loss: 0.2679 - acc: 0.9641 - val_loss: 0.4999 - val_acc: 0.8490 - 1s/epoch - 30ms/step
Epoch 6/1000
47/47 - 2s - loss: 0.1997 - acc: 0.9717 - val_loss: 0.4655 - val_acc: 0.8544 - 2s/epoch - 37ms/step
Epoch 7/1000
47/47 - 4s - loss: 0.1600 - acc: 0.9750 - val_loss: 0.4476 - val_acc: 0.8501 - 4s/epoch - 83ms/step
Epoch 8/1000
47/47 - 2s - loss: 0.1319 - acc: 0.9799 - val_loss: 0.4375 - val_acc: 0.8501 - 2s/epoch - 51ms/step
Epoch 9/1000
47/47 - 2s - loss: 0.1056 - acc: 0.9854 - val_loss: 0.4342 - val_acc: 0.8468 - 2s/e

In [None]:
val_acc, val_loss = train_ngram_model(x_train_ada, x_val_dense, y_train_ada, y_val)

Epoch 1/1000
48/48 - 4s - loss: 1.4500 - acc: 0.6916 - val_loss: 1.3029 - val_acc: 0.7940 - 4s/epoch - 74ms/step
Epoch 2/1000
48/48 - 2s - loss: 0.9916 - acc: 0.8980 - val_loss: 0.9287 - val_acc: 0.8231 - 2s/epoch - 34ms/step
Epoch 3/1000
48/48 - 1s - loss: 0.6086 - acc: 0.9267 - val_loss: 0.6784 - val_acc: 0.8371 - 1s/epoch - 31ms/step
Epoch 4/1000
48/48 - 1s - loss: 0.3909 - acc: 0.9452 - val_loss: 0.5565 - val_acc: 0.8479 - 1s/epoch - 31ms/step
Epoch 5/1000
48/48 - 1s - loss: 0.2711 - acc: 0.9649 - val_loss: 0.4938 - val_acc: 0.8533 - 1s/epoch - 30ms/step
Epoch 6/1000
48/48 - 1s - loss: 0.2012 - acc: 0.9733 - val_loss: 0.4590 - val_acc: 0.8608 - 1s/epoch - 30ms/step
Epoch 7/1000
48/48 - 1s - loss: 0.1582 - acc: 0.9797 - val_loss: 0.4401 - val_acc: 0.8598 - 1s/epoch - 30ms/step
Epoch 8/1000
48/48 - 2s - loss: 0.1245 - acc: 0.9831 - val_loss: 0.4313 - val_acc: 0.8576 - 2s/epoch - 45ms/step
Epoch 9/1000
48/48 - 3s - loss: 0.1050 - acc: 0.9862 - val_loss: 0.4287 - val_acc: 0.8565 - 3s/e

### 3-7. Hyperparameter Tuning(Optional)
* Manual Search, Grid search, Bayesian Optimization, TPE...
> * [grid search tutorial sklearn](https://scikit-learn.org/stable/modules/grid_search.html)
> * [optuna tutorial](https://optuna.org/#code_examples)
> * [ray-tune tutorial](https://docs.ray.io/en/latest/tune/examples/tune-sklearn.html)

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

# Hyperparameter
param_grid = {
    'C': [0.1, 1, 10],
    # 'kernel': ['linear', 'rbf','sigmoid'],
    'gamma': [0.1, 1, 'scale']
}

# 교차 검증
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# GridSearchCV 객체를 생성
grid_search = GridSearchCV(SVC(), param_grid, cv=skf)
grid_search.fit(x_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# 평가
y_pred = best_model.predict(x_val)

accuracy = accuracy_score(y_val, y_pred)
report = classification_report(y_val, y_pred)

print(f"Best Hyperparameters: {best_params}")
print(f"Test Set Accuracy: {accuracy:.2f}")
print(report)


Best Hyperparameters: {'C': 10, 'gamma': 1}
Test Set Accuracy: 0.83
              precision    recall  f1-score   support

           0       0.83      0.92      0.87       392
           1       0.86      0.83      0.84       179
           2       0.79      0.74      0.77       195
           3       0.88      0.79      0.83       130
           4       1.00      0.58      0.73        31

    accuracy                           0.83       927
   macro avg       0.87      0.77      0.81       927
weighted avg       0.84      0.83      0.83       927



## 4. Deep Learning(Sequence)
* Sequence로 전처리한 데이터를 이용하여 DNN, 1-D CNN, LSTM 등 3가지 이상의 deep learning 모델 학습 및 성능 분석
> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)

### 4-1. DNN

https://github.com/google/eng-edu/blob/main/ml/guides/text_classification/vectorize_data.py

In [None]:
data_path = '/content/drive/MyDrive/미프4_2/'
x_train_seq = np.load(data_path + 'X_mor_sequence_train.npy')
x_val_seq = np.load(data_path + 'X_mor_sequence_val.npy')

print(x_train_seq.shape, x_val_seq.shape)

(2779, 500) (927, 500)


In [None]:
# Get max sequence length.
MAX_SEQUENCE_LENGTH = 500

max_length = len(max(x_train_dense, key=len))
if max_length > MAX_SEQUENCE_LENGTH:
    max_length = MAX_SEQUENCE_LENGTH

NameError: ignored

In [None]:
max_lenght

In [None]:
!pip install -U imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.11.0-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.6/235.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: imbalanced-learn
  Attempting uninstall: imbalanced-learn
    Found existing installation: imbalanced-learn 0.10.1
    Uninstalling imbalanced-learn-0.10.1:
      Successfully uninstalled imbalanced-learn-0.10.1
Successfully installed imbalanced-learn-0.11.0


ADASYN을 사용하여 데이터를 증강

In [None]:
from imblearn.over_sampling import ADASYN
from imblearn.over_sampling import SMOTE

In [None]:
adasyn = ADASYN(random_state=42)
smote =SMOTE(random_state=42)

x_train_rsm, y_train_rsm = adasyn.fit_resample(x_train_seq, y_train)
x_train_rsm, y_train_rsm = smote.fit_resample(x_train_seq, y_train)

In [None]:
x_train_rsm.shape, y_train_rsm.shape

((5965, 500), (5965,))

### 4-2. 1-D CNN

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import load_model

def sepcnn_model(max_length):

  embedding_dim = 256 # 임베딩 벡터의 차원
  dropout_ratio = 0.3 # 드롭아웃 비율
  num_filters = 256 # 커널의 수
  kernel_size = 3 # 커널의 크기
  hidden_units = 128 # 뉴런의 수

  model = Sequential()
  model.add(Embedding(max_length, embedding_dim))
  model.add(Dropout(dropout_ratio))
  model.add(Conv1D(num_filters, kernel_size, padding='valid', activation='relu'))
  model.add(GlobalMaxPooling1D())
  model.add(Dense(hidden_units, activation='relu'))
  model.add(Dropout(dropout_ratio))
  model.add(Dense(1, activation='sigmoid'))

  return model

In [None]:
TOP_K = 20000  # Limit on the number of features. We use the top 20K features.

def train_sequence_model(x_train, x_val, train_labels, val_labels,
                         learning_rate=1e-3,
                         epochs=1000,
                         embedding_dim=200,
                         kernel_size=3,
                         pool_size=3):

    # Verify that validation labels are in the same range as training labels.
    num_classes = len(np.unique(val_labels))
    unexpected_labels = [v for v in val_labels if v not in range(num_classes)]
    if len(unexpected_labels):
        raise ValueError('Unexpected label values found in the validation set:'
                         ' {unexpected_labels}. Please make sure that the '
                         'labels in the validation set are in the same range '
                         'as training labels.'.format(
                             unexpected_labels=unexpected_labels))

    # Number of features will be the embedding input dimension. Add 1 for the
    # reserved index 0.
    num_features = min(len(word_index) + 1, TOP_K)

    # Get max sequence length.
    MAX_SEQUENCE_LENGTH = 500
    max_length = len(max(x_train_dense, key=len))
    if max_length > MAX_SEQUENCE_LENGTH:
        max_length = MAX_SEQUENCE_LENGTH

    # Create model instance.
    model = sepcnn_model(max_length)

    # Compile model with learning parameters.
    if num_classes == 2:
        loss = 'binary_crossentropy'
    else:
        loss = 'sparse_categorical_crossentropy'
    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss=loss, metrics=['acc'])

    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=3)
    mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
    # Train and validate model.
    history = model.fit(
            x_train,
            train_labels,
            epochs=epochs,
            callbacks=[es, mc],
            validation_data=(x_val, val_labels),
            verbose=2)

    # Print results.
    loaded_model = load_model('1D_CNN.h5')
    print("\n 테스트 정확도: %.4f" % (loaded_model.evaluate(x_val, val_labels)[1]))

In [None]:
val_acc, val_loss = train_ngram_model(x_train_rsm, x_val_seq, y_train_rsm, y_val)

Epoch 1/1000
47/47 - 3s - loss: 260.3794 - acc: 0.2543 - val_loss: 120.2219 - val_acc: 0.2805 - 3s/epoch - 60ms/step
Epoch 2/1000
47/47 - 0s - loss: 105.6900 - acc: 0.2877 - val_loss: 68.1104 - val_acc: 0.2718 - 435ms/epoch - 9ms/step
Epoch 3/1000
47/47 - 0s - loss: 56.1612 - acc: 0.3014 - val_loss: 40.9331 - val_acc: 0.2708 - 423ms/epoch - 9ms/step
Epoch 4/1000
47/47 - 0s - loss: 31.4046 - acc: 0.3039 - val_loss: 22.8873 - val_acc: 0.2621 - 386ms/epoch - 8ms/step
Epoch 5/1000
47/47 - 0s - loss: 17.5530 - acc: 0.2895 - val_loss: 13.1867 - val_acc: 0.2093 - 415ms/epoch - 9ms/step
Epoch 6/1000
47/47 - 0s - loss: 9.9230 - acc: 0.2875 - val_loss: 8.6239 - val_acc: 0.1931 - 328ms/epoch - 7ms/step
Epoch 7/1000
47/47 - 0s - loss: 6.3845 - acc: 0.2753 - val_loss: 6.5632 - val_acc: 0.1704 - 286ms/epoch - 6ms/step
Epoch 8/1000
47/47 - 0s - loss: 4.7363 - acc: 0.2773 - val_loss: 5.1706 - val_acc: 0.1629 - 324ms/epoch - 7ms/step
Epoch 9/1000
47/47 - 0s - loss: 3.8106 - acc: 0.2652 - val_loss: 4.41

### 4-3. LSTM

## 5. Using pre-trained model(Optional)
* 한국어 pre-trained model로 fine tuning 및 성능 분석
> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)
> * [HuggingFace-Korean](https://huggingface.co/models?language=korean)