# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계1 : 데이터 탐색

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * 한국어 처리
>> * [konlpy - 한국어 처리 라이브러리](https://konlpy.org/ko/latest/)
>> * [한국어 품사 태그 비교표](https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ_-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0)
>> * [한국어 품사 태깅 성능 비교](https://konlpy.org/ko/latest/morph/#comparison-between-pos-tagging-classes)
>> * [한국어 시스템 사전](https://konlpy.org/ko/latest/data/#corpora)

> * 자연어 처리
>> * [NLTK](https://www.nltk.org/book/)
>> * [gensim](https://radimrehurek.com/gensim/)
>> * [Google guide](https://developers.google.com/machine-learning/guides/text-classification/step-2)
>> * [WordCloud](https://amueller.github.io/word_cloud/)

In [None]:
# !sudo apt-get install -y fonts-nanum
# !sudo fc-cache -fv
# !rm ~/.cache/matplotlib -rf

In [None]:
# import matplotlib.pyplot as plt

# plt.rc('font', family='NanumBarunGothic') 

## 1. 개발 환경 설정

* 세부 요구사항
  - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
  - 필요하다고 판단되는 라이브러리를 추가하세요.
  - konlpy, mecab 설치 후 형태소 분석 함수 생성
  - mecab 설치할 때 윈도우 pc에서 설치는 다른 방법으로 진행
  - 윈도우 환경일 경우 KoNLPy의 라이브러리 설치가 제대로 이루어지지 않을 수 있습니다
  - 윈도우 설치를 위한 참고 링크
    - https://liveyourit.tistory.com/56

### 1-1. 라이브러리 설치

In [1]:
# 필요 라이브러리부터 설치할께요.
!pip install konlpy pandas seaborn gensim wordcloud python-mecab-ko wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
Collecting python-mecab-ko
  Downloading python_mecab_ko-1.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (575 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m575.6/575.6 KB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting JPype1>=0.7.0
  Downloading JPype1-1.4.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.3/465.3 KB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Collecting python-mecab-ko-dic
  Downloading python_mecab_ko_dic-2.1.1.post2-py3-none-any.whl (34.5 MB)


### 1-2. 라이브러리 import

In [2]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
from wordcloud import WordCloud
from collections import Counter
import wget, os

### 1-4. 구글드라이브 연결(Colab)

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2.데이터 불러오기

* 주어진 데이터
 - 학습 및 검증용 데이터 : train.csv

### 2-1. 데이터 로딩

* 다음 데이터를 불러옵니다.
    * 학습 및 검증용 데이터 : train.csv
    * shape를 확인합니다.

In [4]:
data_path = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/data/train.csv'
data = pd.read_csv(data_path)

In [5]:
data.shape

(3706, 2)

### 2-2. 데이터 확인하기
* 문의 유형 분포 확인
* data type, 결측치 확인

In [6]:
data.head()

Unnamed: 0,text,label
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",코드2
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,웹
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,코드2
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",코드2
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,코드2


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3706 entries, 0 to 3705
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3706 non-null   object
 1   label   3706 non-null   object
dtypes: object(2)
memory usage: 58.0+ KB


In [8]:
data.describe()

Unnamed: 0,text,label
count,3706,3706
unique,3706,6
top,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",코드2
freq,1,1097


# 전처리

In [9]:
label_dict = {
    '코드1': 0,
    '코드2': 0,
    '웹': 1,
    '이론': 2,
    '시스템 운영': 3,
    '원격': 4
}

df = data.replace(label_dict)
df.head()

Unnamed: 0,text,label
0,"self.convs1 = nn.ModuleList([nn.Conv2d(1, Co, ...",0
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...,1
2,glob.glob(PATH) 를 사용할 때 질문입니다.\n\nPATH에 [ ] 가 ...,0
3,"tmpp = tmp.groupby(by = 'Addr1', as_index=Fals...",0
4,filename = TEST_IMAGE + str(round(frame_sec)) ...,0


### 명사만 가져오기

In [10]:
def nouns_mecab(text):
    mecab = MeCab()
    nouns = mecab.nouns(text)
    return nouns
data['text_k_nouns'] = data['text'].apply(nouns_mecab)

### 영어 명사만 가져오기

In [11]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
import re

def nouns_e_nltk(text):
    text = text.lower()
    word_tokens = nltk.word_tokenize(text)
    tokens_pos = nltk.pos_tag(word_tokens)
    NN_words = []
    for word, pos in tokens_pos:
        if 'NN' in pos:
            NN_words.append(word)
  
    wlem = nltk.WordNetLemmatizer()
    lemmatized_words = []
    for word in NN_words:
        new_word = wlem.lemmatize(word)
        lemmatized_words.append(new_word)

    stopwords_list = stopwords.words('english') #nltk에서 제공하는 불용어사전 이용
    unique_NN_words = set(lemmatized_words)
    final_NN_words = lemmatized_words
    for word in unique_NN_words:
        if word in stopwords_list:
            while word in final_NN_words: 
                final_NN_words.remove(word)
    final_NN_words = list(map(str, re.sub('[^a-zA-Z0-9]',' ',str(final_NN_words)).strip().split()))
    return final_NN_words

data['text_e_nouns'] = data['text'].apply(nouns_e_nltk)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


### 한국어 명사 + 영어 명사

In [12]:
data['text_nouns'] = data['text_k_nouns'] + data['text_e_nouns']

## Train Test Split

In [16]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    data['text_nouns'], data['label'], test_size=0.2, 
    random_state=2023, stratify=data['label'])

## TF-IDF + N-Gram

In [17]:
filename = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/data/불용어.txt'
with open(filename) as f:
    stop = f.read()
stop_words = list(set(stop.split('\n')))

filename = '/content/drive/MyDrive/에이블스쿨/실습파일/2023.04.03_미니프로젝트4차_실습자료/data/희귀단어.txt'
with open(filename) as f:
    sparse = f.read()
sparse_words = list(set(sparse.split('\n')))

stop_words.extend(sparse_words)

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:
x_train = x_train.astype(str)
x_test = x_test.astype(str)

In [20]:
Tfidf_vect = TfidfVectorizer(ngram_range=(1, 2), stop_words=stop_words)
Tfidf_vect.fit(x_train)



In [21]:
x_train_tfidf = Tfidf_vect.transform(x_train)
x_test_tfidf = Tfidf_vect.transform(x_test)

# 모델

In [22]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

## 1) Logistic Regression

In [23]:
from sklearn.linear_model import LogisticRegression

In [24]:
model = LogisticRegression()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

      시스템 운영       0.92      0.70      0.79       112
          원격       0.93      0.65      0.76        20
           웹       0.79      0.75      0.77       146
          이론       0.75      0.73      0.74       146
         코드1       0.94      0.77      0.84        98
         코드2       0.69      0.89      0.78       220

    accuracy                           0.78       742
   macro avg       0.84      0.75      0.78       742
weighted avg       0.79      0.78      0.78       742

[[ 78   0   8   7   3  16]
 [  0  13   4   1   0   2]
 [  2   1 110  10   0  23]
 [  2   0   7 106   1  30]
 [  2   0   2   3  75  16]
 [  1   0   9  14   1 195]]
Accuracy : 0.7776280323450134
F1 Score : 0.7806799736421773


## 2) SVC

In [25]:
from sklearn.svm import SVC

model = SVC()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

      시스템 운영       0.92      0.63      0.75       112
          원격       0.94      0.75      0.83        20
           웹       0.84      0.75      0.79       146
          이론       0.75      0.71      0.73       146
         코드1       0.95      0.74      0.83        98
         코드2       0.66      0.90      0.76       220

    accuracy                           0.77       742
   macro avg       0.84      0.75      0.78       742
weighted avg       0.80      0.77      0.77       742

[[ 71   0   6  11   3  21]
 [  0  15   2   1   0   2]
 [  2   1 110   7   0  26]
 [  2   0   6 104   0  34]
 [  1   0   0   3  73  21]
 [  1   0   7  12   1 199]]
Accuracy : 0.77088948787062
F1 Score : 0.7844257092860193


## 3) LGBM

In [26]:
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [27]:
from lightgbm import LGBMClassifier

model = LGBMClassifier(random_state=2023)
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

      시스템 운영       0.76      0.69      0.72       112
          원격       0.91      1.00      0.95        20
           웹       0.64      0.67      0.65       146
          이론       0.68      0.68      0.68       146
         코드1       0.80      0.82      0.81        98
         코드2       0.71      0.70      0.71       220

    accuracy                           0.71       742
   macro avg       0.75      0.76      0.75       742
weighted avg       0.72      0.71      0.71       742

[[ 77   0  13  10   4   8]
 [  0  20   0   0   0   0]
 [  9   1  98  12   4  22]
 [  7   0  13 100   3  23]
 [  1   0   3   5  80   9]
 [  7   1  27  21   9 155]]
Accuracy : 0.7142857142857143
F1 Score : 0.7544090080840101


## 4) RandomForestClassifier

In [28]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=2023)
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

      시스템 운영       0.79      0.75      0.77       112
          원격       0.89      0.85      0.87        20
           웹       0.62      0.74      0.68       146
          이론       0.73      0.67      0.70       146
         코드1       0.89      0.85      0.87        98
         코드2       0.74      0.72      0.73       220

    accuracy                           0.74       742
   macro avg       0.78      0.76      0.77       742
weighted avg       0.74      0.74      0.74       742

[[ 84   0  13   7   2   6]
 [  1  17   1   1   0   0]
 [  7   1 108  11   3  16]
 [  6   0  18  98   0  24]
 [  0   0   2   3  83  10]
 [  9   1  32  15   5 158]]
Accuracy : 0.738544474393531
F1 Score : 0.7681079338309296


## 5) CatBoost

In [29]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp39-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [30]:
from catboost import CatBoostClassifier

model = CatBoostClassifier()
model.fit(x_train_tfidf, y_train)

y_pred = model.predict(x_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print('Accuracy :', accuracy_score(y_test, y_pred))
print('F1 Score :', f1_score(y_test, y_pred, average='macro'))

Learning rate set to 0.083635
0:	learn: 1.7211642	total: 791ms	remaining: 13m 10s
1:	learn: 1.6665397	total: 1.34s	remaining: 11m 7s
2:	learn: 1.6275366	total: 2s	remaining: 11m 5s
3:	learn: 1.5925692	total: 2.95s	remaining: 12m 13s
4:	learn: 1.5661446	total: 3.79s	remaining: 12m 35s
5:	learn: 1.5325586	total: 4.78s	remaining: 13m 11s
6:	learn: 1.5047102	total: 5.62s	remaining: 13m 16s
7:	learn: 1.4818843	total: 6.17s	remaining: 12m 44s
8:	learn: 1.4606002	total: 6.71s	remaining: 12m 19s
9:	learn: 1.4406856	total: 7.27s	remaining: 11m 59s
10:	learn: 1.4172894	total: 7.8s	remaining: 11m 41s
11:	learn: 1.4031936	total: 8.36s	remaining: 11m 28s
12:	learn: 1.3892300	total: 8.9s	remaining: 11m 15s
13:	learn: 1.3721973	total: 9.47s	remaining: 11m 6s
14:	learn: 1.3602857	total: 10s	remaining: 10m 57s
15:	learn: 1.3452632	total: 10.6s	remaining: 10m 49s
16:	learn: 1.3320357	total: 11.1s	remaining: 10m 42s
17:	learn: 1.3203177	total: 11.6s	remaining: 10m 35s
18:	learn: 1.3093430	total: 12.2s	re