# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계1 : 데이터 탐색

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * 한국어 처리
>> * [konlpy - 한국어 처리 라이브러리](https://konlpy.org/ko/latest/)
>> * [한국어 품사 태그 비교표](https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ_-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0)
>> * [한국어 품사 태깅 성능 비교](https://konlpy.org/ko/latest/morph/#comparison-between-pos-tagging-classes)
>> * [한국어 시스템 사전](https://konlpy.org/ko/latest/data/#corpora)

> * 자연어 처리
>> * [NLTK](https://www.nltk.org/book/)
>> * [gensim](https://radimrehurek.com/gensim/)
>> * [Google guide](https://developers.google.com/machine-learning/guides/text-classification/step-2)
>> * [WordCloud](https://amueller.github.io/word_cloud/)

In [None]:
# !sudo apt-get install -y fonts-nanum
# !sudo fc-cache -fv
# !rm ~/.cache/matplotlib -rf

In [None]:
import matplotlib.pyplot as plt

# plt.rc('font', family='NanumBarunGothic') 

## 1. 개발 환경 설정

* 세부 요구사항
  - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
  - 필요하다고 판단되는 라이브러리를 추가하세요.
  - konlpy, mecab 설치 후 형태소 분석 함수 생성
  - mecab 설치할 때 윈도우 pc에서 설치는 다른 방법으로 진행
  - 윈도우 환경일 경우 KoNLPy의 라이브러리 설치가 제대로 이루어지지 않을 수 있습니다
  - 윈도우 설치를 위한 참고 링크
    - https://liveyourit.tistory.com/56

### 1-1. 라이브러리 설치

In [None]:
# 필요 라이브러리부터 설치할께요.
!pip install konlpy pandas seaborn gensim wordcloud python-mecab-ko wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### 1-2. 라이브러리 import

In [None]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
from wordcloud import WordCloud
from collections import Counter
import wget, os

### 1-4. 구글드라이브 연결(Colab)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2.데이터 불러오기

* 주어진 데이터
 - 학습 및 검증용 데이터 : train.csv

### 2-1. 데이터 로딩

* 다음 데이터를 불러옵니다.
    * 학습 및 검증용 데이터 : train.csv
    * shape를 확인합니다.

In [None]:
data_path = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/data/train.csv'
data = pd.read_csv(data_path)

In [None]:
data.shape

(3706, 2)

### 2-2. 데이터 확인하기
* 문의 유형 분포 확인
* data type, 결측치 확인

In [None]:
data.head()

Unnamed: 0,text,label
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",코드2
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,웹
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,코드2
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",코드2
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,코드2


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3706 entries, 0 to 3705
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3706 non-null   object
 1   label   3706 non-null   object
dtypes: object(2)
memory usage: 58.0+ KB


In [None]:
data.describe()

Unnamed: 0,text,label
count,3706,3706
unique,3706,6
top,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",코드2
freq,1,1097


# 전처리

In [None]:
label_dict = {
    '코드1': 0,
    '코드2': 0,
    '웹': 1,
    '이론': 2,
    '시스템 운영': 3,
    '원격': 4
}

df = data.replace(label_dict)
df.head()

Unnamed: 0,text,label
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0


### 단어 품사가 NNG(일반 명사)와 NNP(고유 명사)인 것만 가져오기 + 외국어(SL)

In [None]:
%%time
nouns = ['NNG', 'NNP', 'SL']
mecab = MeCab()
result = []
for i in df['text']:
    temp = np.array(mecab.pos(i))
    temp2 = [i.lower() for i, j in temp if j in nouns]
    result.append(' '.join(temp2))

CPU times: user 5.54 s, sys: 24.3 ms, total: 5.57 s
Wall time: 5.66 s


In [None]:
df['nouns'] = result

In [None]:
df.head()

Unnamed: 0,text,label,nouns
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0,self convs nn modulelist nn conv d co k for k ...
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1,이미지 업로드 자바 스크립트 동적 폼 생성 클릭 기본 예제 코드 이유
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0,glob glob path 사용 때 질문 path 포함 작동 질문 제공 파일 aiv...
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0,tmpp tmp groupby by addr as index false catego...
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0,filename test image str round frame sec jpg te...


### 불용어 제거

In [None]:
filename = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/data/불용어.txt'
with open(filename) as f:
    stop = f.read()
stop_words = list(set(stop.split('\n')))

filename = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/data/희귀단어.txt'
with open(filename) as f:
    sparse = f.read()
sparse_words = list(set(sparse.split('\n')))

stop_words.extend(sparse_words)

In [None]:
result = []
for i in df['nouns']:
    temp = i.split()
    result.append(' '.join([j for j in temp if j not in stop_words]))

In [None]:
df['nouns'] = result

In [None]:
df.head()

Unnamed: 0,text,label,nouns
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0,self nn nn conv d co k for k in 커널 사이즈 k 은 단어 ...
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1,이미지 업로드 자바 스크립트 동적 폼 생성 클릭 기본 예제 코드 이유
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0,glob glob path 사용 질문 path 포함 작동 질문 제공 파일 aivle...
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0,tmpp tmp groupby by addr as index false catego...
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0,filename test image str round frame sec jpg te...


## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    df['nouns'], df['label'], test_size=0.2, 
    random_state=2023, stratify=df['label'])

## TF-IDF + N-Gram

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
x_train = x_train.astype(str)
x_test = x_test.astype(str)

In [None]:
Tfidf_vect = TfidfVectorizer(ngram_range=(1, 3))
Tfidf_vect.fit(x_train)

In [None]:
x_train_tfidf = Tfidf_vect.transform(x_train)
x_test_tfidf = Tfidf_vect.transform(x_test)

# 모델

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

## 1) Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.81      0.93      0.86       317
           1       0.88      0.79      0.83       147
           2       0.79      0.69      0.74       146
           3       0.85      0.78      0.81       112
           4       0.93      0.65      0.76        20

    accuracy                           0.82       742
   macro avg       0.85      0.77      0.80       742
weighted avg       0.83      0.82      0.82       742

[[295   5  13   4   0]
 [ 18 116   8   4   1]
 [ 38   2 101   5   0]
 [ 13   6   6  87   0]
 [  2   3   0   2  13]]
Accuracy : 0.8247978436657682
F1 Score : 0.8020787016164552


## 2) SVC

In [None]:
from sklearn.svm import SVC

model = SVC()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.76      0.94      0.84       317
           1       0.89      0.77      0.82       147
           2       0.80      0.62      0.70       146
           3       0.90      0.77      0.83       112
           4       0.86      0.60      0.71        20

    accuracy                           0.81       742
   macro avg       0.84      0.74      0.78       742
weighted avg       0.82      0.81      0.80       742

[[298   5  10   4   0]
 [ 23 113   7   2   2]
 [ 49   3  90   4   0]
 [ 17   4   5  86   0]
 [  6   2   0   0  12]]
Accuracy : 0.807277628032345
F1 Score : 0.7789467972870779


## 3) LGBM

In [None]:
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from lightgbm import LGBMClassifier

model = LGBMClassifier(random_state=2023)
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.83      0.84      0.84       317
           1       0.76      0.74      0.75       147
           2       0.72      0.69      0.70       146
           3       0.78      0.79      0.79       112
           4       0.87      1.00      0.93        20

    accuracy                           0.79       742
   macro avg       0.79      0.81      0.80       742
weighted avg       0.79      0.79      0.79       742

[[267  19  25   5   1]
 [ 18 109  11   7   2]
 [ 27   5 101  13   0]
 [  9  10   4  89   0]
 [  0   0   0   0  20]]
Accuracy : 0.7897574123989218
F1 Score : 0.8020781327528242


## 4) RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=2023)
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.80      0.88      0.84       317
           1       0.76      0.71      0.74       147
           2       0.81      0.58      0.68       146
           3       0.73      0.83      0.78       112
           4       0.86      0.90      0.88        20

    accuracy                           0.78       742
   macro avg       0.79      0.78      0.78       742
weighted avg       0.78      0.78      0.78       742

[[279  18   9  10   1]
 [ 24 105   6  10   2]
 [ 37   9  85  15   0]
 [  9   5   5  93   0]
 [  1   1   0   0  18]]
Accuracy : 0.7816711590296496
F1 Score : 0.7807526863099554


## 5) CatBoost

In [None]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp39-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [None]:
from catboost import CatBoostClassifier

model = CatBoostClassifier()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

Learning rate set to 0.083635
0:	learn: 1.5456717	total: 1.51s	remaining: 25m 9s
1:	learn: 1.4964686	total: 2.87s	remaining: 23m 53s
2:	learn: 1.4596484	total: 4.19s	remaining: 23m 11s
3:	learn: 1.4264169	total: 5.85s	remaining: 24m 16s
4:	learn: 1.4008876	total: 8s	remaining: 26m 31s
5:	learn: 1.3769488	total: 9.7s	remaining: 26m 46s
6:	learn: 1.3474899	total: 11.1s	remaining: 26m 8s
7:	learn: 1.3186513	total: 12.4s	remaining: 25m 39s
8:	learn: 1.3035506	total: 13.8s	remaining: 25m 15s
9:	learn: 1.2825091	total: 15.1s	remaining: 24m 58s
10:	learn: 1.2635593	total: 16.5s	remaining: 24m 40s
11:	learn: 1.2502193	total: 17.8s	remaining: 24m 26s
12:	learn: 1.2316449	total: 19.3s	remaining: 24m 21s
13:	learn: 1.2116725	total: 21.4s	remaining: 25m 9s
14:	learn: 1.2013314	total: 23.3s	remaining: 25m 28s
15:	learn: 1.1924720	total: 24.6s	remaining: 25m 14s
16:	learn: 1.1842177	total: 25.9s	remaining: 24m 59s
17:	learn: 1.1721749	total: 27.3s	remaining: 24m 50s
18:	learn: 1.1621083	total: 28.7s