# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계3 : Text classification

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * Machine Learning
>> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
> * Deep Learning
>> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
>> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
>> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)
>> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)

In [25]:
# !sudo apt-get install -y fonts-nanum
# !sudo fc-cache -fv
# !rm ~/.cache/matplotlib -rf

In [26]:
import matplotlib.pyplot as plt

# plt.rc('font', family='NanumBarunGothic')

## 1. 개발 환경 설정

### 1-1. 라이브러리 설치

In [27]:
# 필요 라이브러리부터 설치할께요.
!pip install konlpy pandas seaborn gensim wordcloud python-mecab-ko wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### 1-2. 라이브러리 import

In [28]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import wget,os
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
import tensorflow as tf
import nltk
import wget,os

In [29]:
import tensorflow as tf
import numpy as np
import random
import os

def my_seed_everywhere(seed: int = 42):
    random.seed(seed) # random
    np.random.seed(seed) # np
    os.environ["PYTHONHASHSEED"] = str(seed) # os
    tf.random.set_seed(seed) # tensorflow

seed = 42
my_seed_everywhere(seed)

### 1-4. 구글드라이브 연결(Colab)

In [30]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2. 전처리한 데이터 불러오기
* 1, 2일차에 전처리한 데이터를 불러옵니다.
* sparse data에 대해서는 scipy.sparse.load_npz 활용

In [31]:
import scipy
data_path = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/NNG_NNP_SL_TFIDF/'
x_train = scipy.sparse.load_npz(data_path + 'x_train.npz')
x_test = scipy.sparse.load_npz(data_path + 'x_test.npz')
y_train = np.load(data_path + 'y_train.npy')
y_test = np.load(data_path + 'y_test.npy')
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((2964, 3735), (742, 3735), (2964,), (742,))

## 3. Machine Learning

In [32]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

### 3-1. Model 1

In [33]:
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [34]:
from lightgbm import LGBMClassifier

In [35]:
%%time
model = LGBMClassifier(random_state=2023)
model.fit(x_train, y_train)

CPU times: user 4.78 s, sys: 71.9 ms, total: 4.85 s
Wall time: 2.89 s


In [36]:
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.82      0.85      0.84       317
           1       0.77      0.74      0.76       147
           2       0.72      0.70      0.71       146
           3       0.82      0.78      0.80       112
           4       0.87      1.00      0.93        20

    accuracy                           0.79       742
   macro avg       0.80      0.81      0.81       742
weighted avg       0.79      0.79      0.79       742

[[271  17  20   8   1]
 [ 18 109  14   4   2]
 [ 31   6 102   7   0]
 [ 10   9   6  87   0]
 [  0   0   0   0  20]]
79.38005390835579


### 3-2. Model 2

In [37]:
from sklearn.ensemble import RandomForestClassifier

In [38]:
%%time
model = RandomForestClassifier(random_state=2023)
model.fit(x_train, y_train)

CPU times: user 2.07 s, sys: 11.3 ms, total: 2.08 s
Wall time: 2.09 s


In [39]:
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.78      0.85      0.82       317
           1       0.75      0.69      0.72       147
           2       0.72      0.64      0.68       146
           3       0.83      0.80      0.81       112
           4       0.86      0.95      0.90        20

    accuracy                           0.77       742
   macro avg       0.79      0.79      0.79       742
weighted avg       0.77      0.77      0.77       742

[[271  21  18   6   1]
 [ 30 101  10   4   2]
 [ 36   7  94   9   0]
 [  9   5   8  90   0]
 [  0   1   0   0  19]]
77.49326145552561


### 3-3. Model 3

In [40]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [41]:
from catboost import CatBoostClassifier

In [42]:
%%time
moel = CatBoostClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.78      0.85      0.82       317
           1       0.75      0.69      0.72       147
           2       0.72      0.64      0.68       146
           3       0.83      0.80      0.81       112
           4       0.86      0.95      0.90        20

    accuracy                           0.77       742
   macro avg       0.79      0.79      0.79       742
weighted avg       0.77      0.77      0.77       742

[[271  21  18   6   1]
 [ 30 101  10   4   2]
 [ 36   7  94   9   0]
 [  9   5   8  90   0]
 [  0   1   0   0  19]]
77.49326145552561
CPU times: user 2.48 s, sys: 17 ms, total: 2.49 s
Wall time: 2.53 s


### 3-4. SVC

In [43]:
from sklearn.svm import SVC

In [44]:
%%time
model = SVC()
model.fit(x_train, y_train)

CPU times: user 2.91 s, sys: 13 ms, total: 2.93 s
Wall time: 2.99 s


In [45]:
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.82      0.91      0.86       317
           1       0.89      0.81      0.85       147
           2       0.77      0.72      0.74       146
           3       0.88      0.79      0.83       112
           4       0.94      0.85      0.89        20

    accuracy                           0.83       742
   macro avg       0.86      0.82      0.84       742
weighted avg       0.84      0.83      0.83       742

[[290   4  18   5   0]
 [ 17 119   8   2   1]
 [ 35   2 105   4   0]
 [ 12   6   6  88   0]
 [  0   2   0   1  17]]
83.42318059299191


### 3-5. Logistic Regression

In [46]:
from sklearn.linear_model import LogisticRegression

In [47]:
model = LogisticRegression()
model.fit(x_train, y_train)

In [48]:
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.83      0.90      0.87       317
           1       0.88      0.82      0.85       147
           2       0.76      0.73      0.74       146
           3       0.86      0.79      0.82       112
           4       0.94      0.80      0.86        20

    accuracy                           0.83       742
   macro avg       0.85      0.81      0.83       742
weighted avg       0.83      0.83      0.83       742

[[286   7  18   6   0]
 [ 14 121   9   2   1]
 [ 33   1 106   6   0]
 [  9   7   7  89   0]
 [  1   2   0   1  16]]
83.28840970350404


### 3-5. Hyperparameter Tuning(Optional) 
* Manual Search, Grid search, Bayesian Optimization, TPE...
> * [grid search tutorial sklearn](https://scikit-learn.org/stable/modules/grid_search.html)
> * [optuna tutorial](https://optuna.org/#code_examples)
> * [ray-tune tutorial](https://docs.ray.io/en/latest/tune/examples/tune-sklearn.html)