# 안녕하세요^^ 
# AIVLE 미니 프로젝트에 오신 여러분을 환영합니다.
* 본 과정에서는 실제 사례와 데이터를 기반으로 문제를 해결하는 전체 과정을 자기 주도형 실습으로 진행해볼 예정입니다.
* 앞선 교육과정을 정리하는 마음과 지금까지 배운 내용을 바탕으로 문제 해결을 해볼게요!
* 미니 프로젝트를 통한 문제 해결 과정 'A에서 Z까지', 지금부터 시작합니다!

## Text Preprocessing
### reference
> * [Google guide](https://developers.google.com/machine-learning/guides/text-classification/step-3)
> * N-grams
>> * [scikit-learn working with text data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#)
>> * [scikit-learn text feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
>> * [한글 자료](https://datascienceschool.net/03%20machine%20learning/03.01.03%20Scikit-Learn%EC%9D%98%20%EB%AC%B8%EC%84%9C%20%EC%A0%84%EC%B2%98%EB%A6%AC%20%EA%B8%B0%EB%8A%A5.html)
> * Sequence
>> * [keras text classification](https://keras.io/examples/nlp/text_classification_from_scratch/)
>> * [tensorflow text classification](https://www.tensorflow.org/tutorials/keras/text_classification)

### 0. 라이브러리 설치 및 불러오기

In [1]:
## import sklearn
import pandas as pd
import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
import tensorflow as tf
import nltk

from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

#fm.findSystemFonts()
plt.rcParams['font.family']= ["Malgun Gothic"]
plt.rcParams["axes.unicode_minus"]=False

# GPU 환경 설정하기
# assert tf.test.is_gpu_available() == True, 'GPU 설정을 확인하세요.'
print(tf.config.list_physical_devices('GPU'))
print(tf.config.list_logical_devices('GPU'))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[LogicalDevice(name='/device:GPU:0', device_type='GPU')]


### 1.  데이터 가져오기

In [2]:
# 데이터를 가져옵니다.
data = pd.read_csv('./data/spam.csv')
data.dropna(axis=0, inplace=True)

#### 1-1. processing label

In [None]:
# label 데이터를 수치형으로 변환합니다.
data['label'].loc[data['label']=='ham'] = 0
data['label'].loc[data['label']=='spam'] = 1
data.head()

In [4]:
x = data['text']
y = data['label']

### 2. Train Validation(Test) Split

In [5]:
# train validation set으로 분리합니다.
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=.2)

### 3. Vectorize texts

#### 3-1. N-grams Vectorize [참고](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files)

In [6]:
# 1. Count Vectorize
from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer()
x_train_c = count_vec.fit_transform(x_train)
x_val_c = count_vec.transform(x_val)
x_train_c.shape 

(16071, 29532)

In [7]:
# 2. Tf-idf Transform
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_vec = TfidfTransformer()
x_train_tfidf = tfidf_vec.fit_transform(x_train_c)
x_val_tfidf = tfidf_vec.transform(x_val_c)
x_train_tfidf.shape

(16071, 29532)

In [8]:
# Select top 'k' of the vectorized features
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

TOP_K = 20000
selector = SelectKBest(f_classif, k=min(TOP_K, x_train_tfidf.shape[1]))
selector.fit(x_train_tfidf, y_train)
x_train_ngram = selector.transform(x_train_tfidf).astype('float32')
x_val_ngram = selector.transform(x_val_tfidf).astype('float32')

In [9]:
x_train_ngram.shape

(16071, 20000)

#### 3-2. Sequence Vectorize [참고](https://developers.google.com/machine-learning/guides/text-classification/step-3)

In [None]:
# !pip install torchtext

In [10]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from eunjeon import Mecab

mecab = Mecab()
train_iter = x

def yield_tokens(data_iter):
    for text in data_iter:
        yield mecab.morphs(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

### 4. Save data

In [11]:
import pickle

#### 4-1. Save N-grams

In [12]:
# N-grams 방식으로 vectorize한 데이터의 shape을 확인해봅니다.
print('X_train seq shape:', x_train_ngram.shape)
print('X_val seq shape:', x_val_ngram.shape)

X_train seq shape: (16071, 20000)
X_val seq shape: (4018, 20000)


In [13]:
# 모델 학습시에 활용 가능하도록 전처리 데이터를 저장해보도록 하겠습니다.
with open('./data/x_train_ngram.pickle', 'wb') as f:
    pickle.dump(x_train_ngram, f)

with open('./data/x_val_ngram.pickle', 'wb') as f:
    pickle.dump(x_val_ngram, f)

#### 4-2. Save sequence

In [14]:
with open('./data/vocab.pickle', 'wb') as f:
    pickle.dump(vocab, f)

#### 4-3. Save label

In [15]:
# label 데이터의 shape을 확인하고 저장합니다.
print('Y_train seq shape:', y_train.shape)
print('Y_val seq shape:', y_val.shape)

Y_train seq shape: (16071,)
Y_val seq shape: (4018,)


In [16]:
with open('./data/y_train.pickle', 'wb') as f:
    pickle.dump(y_train, f)

with open('./data/y_val.pickle', 'wb') as f:
    pickle.dump(y_val, f)