# Text topic classification
- 크게 두 부분으로 나눌 수 있음
- feature 생성과 classifier 선택
- feature는 count vector와 TF-IDF방법을 활용
- 그외에 해당 문장에 속한 단어의 갯수, 동사/명사/부사 등의 갯수, 특정단어의 갯수 등 numeric value를 가지는 feature 생성

## feature
- Count Vectors
- TF-IDF Vectors
    - Word level
    - N-Gram level
    - Character level

## classifier
- Naive Bayes Classifier
- Linear Classifier
- Support Vector Machine
- Bagging Models
- Boosting Models
- Shallow Neural Networks
- Deep Neural Networks
    - Convolutional Neural Network (CNN)
    - Long Short Term Modelr (LSTM)
    - Gated Recurrent Unit (GRU)
    - Bidirectional RNN
    - Recurrent Convolutional Neural Network (RCNN)
    - Other Variants of Deep Neural Networks

# 기사 category 분류

In [1]:
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn import decomposition, ensemble

## load the result of crawling data

# cat_ls = ['정치','세계','생활문화','사회','경제','IT과학']
# data_ls = list()
# for cat in cat_ls:
#       with open('a_ls_{}.pickle'.format(cat), 'rb') as f:
#         temp = pickle.load(f)
#         data_ls.append([{'label': cat, 'article': elem} for elem in temp])
        
# temp_df = pd.DataFrame(data_ls[0])
# temp_df = temp_df.append([pd.DataFrame(data_ls[1]), 
#           pd.DataFrame(data_ls[2]),
#           pd.DataFrame(data_ls[3]),
#           pd.DataFrame(data_ls[4]), 
#           pd.DataFrame(data_ls[5])], ignore_index=True)

# with open('article_df.pickle', 'wb') as f:
#     pickle.dump(temp_df, f)

# pandas==0.24
with open('article_df.pickle', 'rb') as f:
    df = pickle.load(f)        

In [2]:
print(df.head())

                               article label
0    文대통령, 반년새 7번 PK행...이번엔 김경수 지사도 만나    정치
1     이번엔 수소차…빨라지는 文대통령 '3대 新산업' 육성 행보    정치
2            문 대통령 "미세먼지 원인은 석탄발전·경유차"    정치
3  [전문]文, 환경의날에 "미세먼지 배출 30%감축·수소차 확대"    정치
4   文정부, 800만 달러 첫 대북 인도적 지원···쌀도 추진 중    정치


In [3]:
print('data size: ', len(df))

data size:  10794


In [4]:
x = df['article']
y = df['label']
lb = LabelEncoder()
y_label = lb.fit_transform(y)

x_train, x_test, y_train, y_test = train_test_split(x, y_label, test_size=0.2)
print(' size of x_train:, ', len(x_train), '\n', 'size of y_train:, ', len(y_train))

 size of x_train:,  8635 
 size of y_train:,  8635


In [5]:
lb.classes_

array(['IT과학', '경제', '사회', '생활문화', '세계', '정치'], dtype=object)

# TF-IDF

In [6]:
clf = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB(alpha=0.09))])

model = clf.fit(x_train, y_train)

y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.69      0.72       278
           1       0.65      0.64      0.64       329
           2       0.77      0.69      0.73       376
           3       0.83      0.73      0.77       408
           4       0.70      0.85      0.77       391
           5       0.79      0.86      0.83       377

   micro avg       0.75      0.75      0.75      2159
   macro avg       0.75      0.74      0.74      2159
weighted avg       0.75      0.75      0.75      2159



In [7]:
with open('clf_model.pickle', 'wb') as f:
    pickle.dump(model, f)