# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계1 : 데이터 탐색

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * 한국어 처리
>> * [konlpy - 한국어 처리 라이브러리](https://konlpy.org/ko/latest/)
>> * [한국어 품사 태그 비교표](https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ_-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0)
>> * [한국어 품사 태깅 성능 비교](https://konlpy.org/ko/latest/morph/#comparison-between-pos-tagging-classes)
>> * [한국어 시스템 사전](https://konlpy.org/ko/latest/data/#corpora)

> * 자연어 처리
>> * [NLTK](https://www.nltk.org/book/)
>> * [gensim](https://radimrehurek.com/gensim/)
>> * [Google guide](https://developers.google.com/machine-learning/guides/text-classification/step-2)
>> * [WordCloud](https://amueller.github.io/word_cloud/)

In [1]:
# !sudo apt-get install -y fonts-nanum
# !sudo fc-cache -fv
# !rm ~/.cache/matplotlib -rf

In [2]:
import matplotlib.pyplot as plt

# plt.rc('font', family='NanumBarunGothic') 

## 1. 개발 환경 설정

* 세부 요구사항
  - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
  - 필요하다고 판단되는 라이브러리를 추가하세요.
  - konlpy, mecab 설치 후 형태소 분석 함수 생성
  - mecab 설치할 때 윈도우 pc에서 설치는 다른 방법으로 진행
  - 윈도우 환경일 경우 KoNLPy의 라이브러리 설치가 제대로 이루어지지 않을 수 있습니다
  - 윈도우 설치를 위한 참고 링크
    - https://liveyourit.tistory.com/56

### 1-1. 라이브러리 설치

In [3]:
# 필요 라이브러리부터 설치할께요.
!pip install konlpy pandas seaborn gensim wordcloud python-mecab-ko wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
Collecting python-mecab-ko
  Downloading python_mecab_ko-1.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (575 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m575.6/575.6 KB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting JPype1>=0.7.0
  Downloading JPype1-1.4.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.3/465.3 KB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
Collecting python-mecab-ko-dic
  Downloading python_mecab_ko_dic-2.1.1.post2-py3-none-any.whl (34.5 MB)


### 1-2. 라이브러리 import

In [4]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
from wordcloud import WordCloud
from collections import Counter
import wget, os

### 1-4. 구글드라이브 연결(Colab)

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2.데이터 불러오기

* 주어진 데이터
 - 학습 및 검증용 데이터 : train.csv

### 2-1. 데이터 로딩

* 다음 데이터를 불러옵니다.
    * 학습 및 검증용 데이터 : train.csv
    * shape를 확인합니다.

In [18]:
data_path = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/train.csv'
data = pd.read_csv(data_path)

In [19]:
data.shape

(3706, 2)

### 2-2. 데이터 확인하기
* 문의 유형 분포 확인
* data type, 결측치 확인

In [20]:
data.head()

Unnamed: 0,text,label
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",코드2
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,웹
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,코드2
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",코드2
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,코드2


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3706 entries, 0 to 3705
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3706 non-null   object
 1   label   3706 non-null   object
dtypes: object(2)
memory usage: 58.0+ KB


In [22]:
data.describe()

Unnamed: 0,text,label
count,3706,3706
unique,3706,6
top,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",코드2
freq,1,1097


# 전처리

In [23]:
label_dict = {
    '코드1': 0,
    '코드2': 0,
    '웹': 1,
    '이론': 2,
    '시스템 운영': 3,
    '원격': 4
}

df = data.replace(label_dict)
df.head()

Unnamed: 0,text,label
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0


### 단어 품사가 NNG(일반 명사)와 NNP(고유 명사)인 것만 가져오기 + 외국어(SL)

In [24]:
%%time
nouns = ['NNG', 'NNP', 'SL']
mecab = MeCab()
result = []
for i in df['text']:
    temp = np.array(mecab.pos(i))
    temp2 = [i.lower() for i, j in temp if j in nouns]
    result.append(' '.join(temp2))

CPU times: user 5.75 s, sys: 30.2 ms, total: 5.78 s
Wall time: 5.83 s


In [25]:
df['nouns'] = result

In [26]:
df.head()

Unnamed: 0,text,label,nouns
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0,self convs nn modulelist nn conv d co k for k ...
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1,이미지 업로드 자바 스크립트 동적 폼 생성 클릭 기본 예제 코드 이유
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0,glob glob path 사용 때 질문 path 포함 작동 질문 제공 파일 aiv...
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0,tmpp tmp groupby by addr as index false catego...
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0,filename test image str round frame sec jpg te...


### 불용어 제거

In [29]:
filename = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/불용어.txt'
with open(filename) as f:
    stop = f.read()
stop_words = set(stop.split('\n'))

In [30]:
stop_words = list(stop_words)
stop_words.append('은')

In [31]:
result = []
for i in df['nouns']:
    temp = i.split()
    result.append(' '.join([j for j in temp if j not in stop_words]))

In [34]:
df['nouns'] = result

In [35]:
df.head()

Unnamed: 0,text,label,nouns
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0,self convs nn modulelist nn conv d co k for k ...
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1,이미지 업로드 자바 스크립트 동적 폼 생성 클릭 기본 예제 코드 이유
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0,glob glob path 사용 질문 path 포함 작동 질문 제공 파일 aivle...
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0,tmpp tmp groupby by addr as index false catego...
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0,filename test image str round frame sec jpg te...


## Train Test Split

In [62]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    df['nouns'], df['label'], test_size=0.2, 
    random_state=2023, stratify=df['label'])

## TF-IDF + N-Gram

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [64]:
x_train = x_train.astype(str)
x_test = x_test.astype(str)

In [65]:
Tfidf_vect = TfidfVectorizer(ngram_range=(1, 2))
Tfidf_vect.fit(x_train)

In [66]:
x_train_tfidf = Tfidf_vect.transform(x_train)
x_test_tfidf = Tfidf_vect.transform(x_test)

# 모델

In [71]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

## 1) Logistic Regression

In [67]:
from sklearn.linear_model import LogisticRegression

In [73]:
model = LogisticRegression()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.82      0.92      0.87       317
           1       0.89      0.79      0.83       147
           2       0.79      0.73      0.76       146
           3       0.88      0.79      0.84       112
           4       0.94      0.75      0.83        20

    accuracy                           0.84       742
   macro avg       0.86      0.80      0.83       742
weighted avg       0.84      0.84      0.83       742

[[293   6  14   4   0]
 [ 19 116   8   3   1]
 [ 34   1 107   4   0]
 [ 10   6   7  89   0]
 [  2   2   0   1  15]]
Accuracy : 0.8355795148247979
F1 Score : 0.8261119709965721


## 2) SVC

In [74]:
from sklearn.svm import SVC

model = SVC()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.78      0.93      0.85       317
           1       0.90      0.77      0.83       147
           2       0.80      0.66      0.72       146
           3       0.89      0.79      0.84       112
           4       0.88      0.75      0.81        20

    accuracy                           0.82       742
   macro avg       0.85      0.78      0.81       742
weighted avg       0.83      0.82      0.82       742

[[295   5  12   5   0]
 [ 23 113   7   2   2]
 [ 46   1  96   3   0]
 [ 14   4   5  89   0]
 [  2   2   0   1  15]]
Accuracy : 0.8194070080862533
F1 Score : 0.8099210503954412


## 3) CatBoost

In [75]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp39-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [76]:
from catboost import CatBoostClassifier

model = CatBoostClassifier()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

Learning rate set to 0.083635
0:	learn: 1.5504269	total: 1.16s	remaining: 19m 20s
1:	learn: 1.4951612	total: 1.75s	remaining: 14m 33s
2:	learn: 1.4546042	total: 2.36s	remaining: 13m 3s
3:	learn: 1.4149947	total: 2.95s	remaining: 12m 14s
4:	learn: 1.3772949	total: 3.53s	remaining: 11m 43s
5:	learn: 1.3534279	total: 4.14s	remaining: 11m 26s
6:	learn: 1.3316972	total: 4.73s	remaining: 11m 11s
7:	learn: 1.3090984	total: 5.35s	remaining: 11m 3s
8:	learn: 1.2912653	total: 5.95s	remaining: 10m 55s
9:	learn: 1.2723552	total: 6.53s	remaining: 10m 46s
10:	learn: 1.2533516	total: 7.17s	remaining: 10m 44s
11:	learn: 1.2351021	total: 8.11s	remaining: 11m 7s
12:	learn: 1.2199093	total: 9.05s	remaining: 11m 27s
13:	learn: 1.2037738	total: 10s	remaining: 11m 45s
14:	learn: 1.1940392	total: 10.9s	remaining: 11m 57s
15:	learn: 1.1837021	total: 11.5s	remaining: 11m 48s
16:	learn: 1.1704464	total: 12.1s	remaining: 11m 40s
17:	learn: 1.1575827	total: 12.7s	remaining: 11m 32s
18:	learn: 1.1487880	total: 13.

## 4) LGBM

In [77]:
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [78]:
from lightgbm import LGBMClassifier

model = LGBMClassifier(random_state=2023)
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.81      0.83      0.82       317
           1       0.74      0.70      0.72       147
           2       0.70      0.69      0.70       146
           3       0.81      0.81      0.81       112
           4       0.83      0.95      0.88        20

    accuracy                           0.78       742
   macro avg       0.78      0.80      0.79       742
weighted avg       0.78      0.78      0.78       742

[[263  21  26   6   1]
 [ 20 103  13   8   3]
 [ 30   7 101   8   0]
 [  9   8   4  91   0]
 [  1   0   0   0  19]]
Accuracy : 0.7776280323450134
F1 Score : 0.7862632527078197


## 5) RandomForestClassifier

In [79]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=2023)
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.78      0.90      0.83       317
           1       0.91      0.63      0.75       147
           2       0.79      0.60      0.68       146
           3       0.66      0.84      0.74       112
           4       0.86      0.90      0.88        20

    accuracy                           0.78       742
   macro avg       0.80      0.77      0.78       742
weighted avg       0.79      0.78      0.77       742

[[285   4  10  17   1]
 [ 30  93   6  16   2]
 [ 41   2  87  16   0]
 [  9   2   7  94   0]
 [  1   1   0   0  18]]
Accuracy : 0.7776280323450134
F1 Score : 0.77530651499172
