# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계1 : 데이터 탐색

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * 한국어 처리
>> * [konlpy - 한국어 처리 라이브러리](https://konlpy.org/ko/latest/)
>> * [한국어 품사 태그 비교표](https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ_-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0)
>> * [한국어 품사 태깅 성능 비교](https://konlpy.org/ko/latest/morph/#comparison-between-pos-tagging-classes)
>> * [한국어 시스템 사전](https://konlpy.org/ko/latest/data/#corpora)

> * 자연어 처리
>> * [NLTK](https://www.nltk.org/book/)
>> * [gensim](https://radimrehurek.com/gensim/)
>> * [Google guide](https://developers.google.com/machine-learning/guides/text-classification/step-2)
>> * [WordCloud](https://amueller.github.io/word_cloud/)

In [None]:
# !sudo apt-get install -y fonts-nanum
# !sudo fc-cache -fv
# !rm ~/.cache/matplotlib -rf

In [1]:
import matplotlib.pyplot as plt

# plt.rc('font', family='NanumBarunGothic') 

## 1. 개발 환경 설정

* 세부 요구사항
  - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
  - 필요하다고 판단되는 라이브러리를 추가하세요.
  - konlpy, mecab 설치 후 형태소 분석 함수 생성
  - mecab 설치할 때 윈도우 pc에서 설치는 다른 방법으로 진행
  - 윈도우 환경일 경우 KoNLPy의 라이브러리 설치가 제대로 이루어지지 않을 수 있습니다
  - 윈도우 설치를 위한 참고 링크
    - https://liveyourit.tistory.com/56

### 1-1. 라이브러리 설치

In [2]:
# 필요 라이브러리부터 설치할께요.
!pip install konlpy pandas seaborn gensim wordcloud python-mecab-ko wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
Collecting python-mecab-ko
  Downloading python_mecab_ko-1.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (575 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m575.6/575.6 KB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting JPype1>=0.7.0
  Downloading JPype1-1.4.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.3/465.3 KB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
Collecting python-mecab-ko-dic
  Downloading python_mecab_ko_dic-2.1.1.post2-py3-none-any.whl (34.5 MB)


### 1-2. 라이브러리 import

In [3]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
from wordcloud import WordCloud
from collections import Counter
import wget, os

### 1-4. 구글드라이브 연결(Colab)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2.데이터 불러오기

* 주어진 데이터
 - 학습 및 검증용 데이터 : train.csv

### 2-1. 데이터 로딩

* 다음 데이터를 불러옵니다.
    * 학습 및 검증용 데이터 : train.csv
    * shape를 확인합니다.

In [5]:
data_path = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/data/train.csv'
data = pd.read_csv(data_path)

In [6]:
data.shape

(3706, 2)

### 2-2. 데이터 확인하기
* 문의 유형 분포 확인
* data type, 결측치 확인

In [8]:
data.head()

Unnamed: 0,text,label
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",코드2
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,웹
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,코드2
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",코드2
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,코드2


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3706 entries, 0 to 3705
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3706 non-null   object
 1   label   3706 non-null   object
dtypes: object(2)
memory usage: 58.0+ KB


In [10]:
data.describe()

Unnamed: 0,text,label
count,3706,3706
unique,3706,6
top,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",코드2
freq,1,1097


# 전처리

In [73]:
label_dict = {
    '코드1': 0,
    '코드2': 0,
    '웹': 1,
    '이론': 2,
    '시스템 운영': 3,
    '원격': 4
}

df = data.replace(label_dict)
df.head()

Unnamed: 0,text,label
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0


### 일부 특수 문자 제거

In [74]:
import re
test_str = '/adafasfd...........'
re.sub(r"\.+", ".", test_str)
# re.sub(r"\s+"," ", test_str)
# mecab.pos(test_str)

'/adafasfd.'

In [75]:
df['text'] = df['text'].apply(lambda x: re.sub(r"\.+", ".", x))
df['text'] = df['text'].apply(lambda x: re.sub(r"\,+", ",", x))

### 단어 품사가 NNG(일반 명사)와 NNP(고유 명사)인 것만 가져오기 + 외국어(SL), 기타 기호(SY)

In [76]:
%%time
tags = ['NNG', 'NNP', 'SL', 'SY']
mecab = MeCab()
result = []
for i in df['text']:
    temp = np.array(mecab.pos(i))
    temp2 = [i.lower() for i, j in temp if j in tags]
    result.append(' '.join(temp2))

CPU times: user 4.1 s, sys: 49.6 ms, total: 4.15 s
Wall time: 4.11 s


In [77]:
df['train'] = result

In [78]:
df.head()

Unnamed: 0,text,label,train
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0,self . convs = nn . modulelist nn . conv d co ...
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1,이미지 업로드 자바 스크립트 동적 폼 생성 클릭 기본 예제 코드 이유
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0,glob . glob path 사용 때 질문 path 포함 작동 질문 제공 파일 a...
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0,"tmpp = tmp . groupby by = ' addr ', as _ index..."
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0,filename = test _ image + str round frame _ se...


### 불용어 제거

In [83]:
filename = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/data/불용어.txt'
with open(filename) as f:
    stop = f.read()
stop_words = list(set(stop.split('\n')))

In [84]:
filename = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/data/희귀단어.txt'
with open(filename) as f:
    sparse = f.read()
sparse_words = list(set(sparse.split('\n')))

In [85]:
stop_words.extend(sparse_words)

In [86]:
result = []
for i in df['train']:
    temp = i.split()
    result.append(' '.join([j for j in temp if j not in stop_words]))

In [87]:
df['train'] = result

In [88]:
df.head()

Unnamed: 0,text,label,train
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0,self . = nn . nn . conv d co k for k in 커널 사이즈...
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1,이미지 업로드 자바 스크립트 동적 폼 생성 클릭 기본 예제 코드 이유
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0,glob . glob path 사용 질문 path 포함 작동 질문 제공 파일 aiv...
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0,"tmpp = tmp . groupby by = ' addr ', as _ index..."
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0,filename = test _ image + str round frame _ se...


## Train Test Split

In [89]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    df['train'], df['label'], test_size=0.2, 
    random_state=2023, stratify=df['label'])

## TF-IDF + N-Gram

In [90]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [91]:
x_train = x_train.astype(str)
x_test = x_test.astype(str)

In [92]:
Tfidf_vect = TfidfVectorizer(ngram_range=(1, 2))
Tfidf_vect.fit(x_train)

In [93]:
x_train_tfidf = Tfidf_vect.transform(x_train)
x_test_tfidf = Tfidf_vect.transform(x_test)

# 모델

In [94]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

## 1) Logistic Regression

In [95]:
from sklearn.linear_model import LogisticRegression

In [96]:
model = LogisticRegression()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.82      0.91      0.86       317
           1       0.88      0.79      0.83       147
           2       0.78      0.72      0.75       146
           3       0.86      0.79      0.82       112
           4       0.94      0.80      0.86        20

    accuracy                           0.83       742
   macro avg       0.85      0.80      0.83       742
weighted avg       0.83      0.83      0.83       742

[[290   6  16   5   0]
 [ 18 116   8   4   1]
 [ 34   2 105   5   0]
 [ 11   6   6  89   0]
 [  1   2   0   1  16]]
Accuracy : 0.8301886792452831
F1 Score : 0.8264385277100498


## 2) SVC

In [97]:
from sklearn.svm import SVC

model = SVC()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.78      0.92      0.85       317
           1       0.90      0.78      0.83       147
           2       0.80      0.66      0.72       146
           3       0.89      0.80      0.85       112
           4       0.89      0.80      0.84        20

    accuracy                           0.82       742
   macro avg       0.85      0.79      0.82       742
weighted avg       0.83      0.82      0.82       742

[[293   5  13   6   0]
 [ 22 114   7   2   2]
 [ 44   2  97   3   0]
 [ 13   4   5  90   0]
 [  2   2   0   0  16]]
Accuracy : 0.8221024258760108
F1 Score : 0.8182438761450583


## 3) LGBM

In [98]:
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [99]:
from lightgbm import LGBMClassifier

model = LGBMClassifier(random_state=2023)
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.80      0.82      0.81       317
           1       0.76      0.70      0.73       147
           2       0.71      0.68      0.69       146
           3       0.77      0.80      0.79       112
           4       0.83      1.00      0.91        20

    accuracy                           0.77       742
   macro avg       0.77      0.80      0.79       742
weighted avg       0.77      0.77      0.77       742

[[260  20  26  10   1]
 [ 22 103  12   7   3]
 [ 33   4  99  10   0]
 [ 11   8   3  90   0]
 [  0   0   0   0  20]]
Accuracy : 0.77088948787062
F1 Score : 0.7853260863822796


## 4) RandomForestClassifier

In [100]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=2023)
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.81      0.88      0.85       317
           1       0.79      0.76      0.77       147
           2       0.80      0.62      0.70       146
           3       0.77      0.84      0.80       112
           4       0.86      0.90      0.88        20

    accuracy                           0.80       742
   macro avg       0.81      0.80      0.80       742
weighted avg       0.80      0.80      0.80       742

[[280  17  11   8   1]
 [ 20 111   7   7   2]
 [ 36   6  91  13   0]
 [  9   4   5  94   0]
 [  0   2   0   0  18]]
Accuracy : 0.8005390835579514
F1 Score : 0.8001816395641465


## 5) CatBoost

In [101]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [102]:
from catboost import CatBoostClassifier

model = CatBoostClassifier()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

Learning rate set to 0.083635
0:	learn: 1.5463752	total: 1.03s	remaining: 17m 11s
1:	learn: 1.5022807	total: 1.59s	remaining: 13m 14s
2:	learn: 1.4589139	total: 2.13s	remaining: 11m 49s
3:	learn: 1.4224804	total: 2.69s	remaining: 11m 11s
4:	learn: 1.3863395	total: 3.24s	remaining: 10m 44s
5:	learn: 1.3601942	total: 3.79s	remaining: 10m 27s
6:	learn: 1.3346596	total: 4.33s	remaining: 10m 14s
7:	learn: 1.3111512	total: 5.03s	remaining: 10m 23s
8:	learn: 1.2914698	total: 5.94s	remaining: 10m 54s
9:	learn: 1.2716455	total: 6.82s	remaining: 11m 15s
10:	learn: 1.2519877	total: 7.69s	remaining: 11m 31s
11:	learn: 1.2327189	total: 8.63s	remaining: 11m 50s
12:	learn: 1.2203280	total: 9.18s	remaining: 11m 36s
13:	learn: 1.2058634	total: 9.72s	remaining: 11m 24s
14:	learn: 1.1877611	total: 10.3s	remaining: 11m 14s
15:	learn: 1.1758477	total: 10.8s	remaining: 11m 4s
16:	learn: 1.1650205	total: 11.4s	remaining: 10m 56s
17:	learn: 1.1549191	total: 11.9s	remaining: 10m 49s
18:	learn: 1.1441229	total: