In [None]:
임베딩 알고리즘 변경
분류 알고리즘 변경 

In [None]:
from tensorflow.python.client import device_lib 
device_lib.list_local_devices()

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

In [None]:
import re 

def clean_text(texts): 
  corpus = [] 
  for i in range(0, len(texts)): 

    review = re.sub(r'[@%\\*=()/~#&\+á?\xc3\xa1\-\|\.\:\;\!\-\,\_\~\$\'\"\n\]\[\>]', '',texts[i]) #@%*=()/+ 와 같은 문장부호 제거
    review = re.sub(r'\d+','', review)#숫자 제거
    review = review.lower() #소문자 변환
    review = re.sub(r'\s+', ' ', review) #extra space 제거
    review = re.sub(r'<[^>]+>','',review) #Html tags 제거
    review = re.sub(r'\s+', ' ', review) #spaces 제거
    review = re.sub(r"^\s+", '', review) #space from start 제거
    review = re.sub(r'\s+$', '', review) #space from the end 제거
    review = re.sub(r'_', ' ', review) #space from the end 제거
    corpus.append(review) 
  
  return corpus

In [None]:
temp = clean_text(train['text']) #메소드 적용
train['text'] = temp
train.head()

In [None]:
temp = clean_text(test['text']) #메소드 적용
test['text'] = temp
test.head()

In [None]:
train_X = train['text']
train_y = train['target']

In [None]:
train_X

# baseline

## CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()

In [None]:
vectorizer.fit(train_X)

In [None]:
train_X = vectorizer.transform(train_X)

In [None]:
train_X

In [None]:
vectorizer.inverse_transform(train_X[0])

In [None]:
test_X = test.text #문서 데이터 생성

test_X_vect = vectorizer.transform(test_X) #문서 데이터 transform 
#test 데이터를 대상으로 fit_transform 메소드를 실행하는 것은 test 데이터를 활용해 vectorizer 를 학습 시키는 것으롤 data leakage 에 해당합니다.

In [None]:
test_X_vect

## TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
vectorizer.fit(train_X)

In [None]:
train_X = vectorizer.transform(train_X)

In [None]:
train_X

In [None]:
vectorizer.inverse_transform(train_X[0])

# 알고리즘 실험실

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.callbacks import *

In [None]:
ohe = OneHotEncoder(sparse = False)
y = ohe.fit_transform(train[['target']])

In [None]:
model = Sequential()
model.add(Dense(256, input_dim=train_X.shape[1], activation = 'elu'))
model.add(Dense(128, activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(20, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

In [None]:
model.fit(train_X, y, epochs=10, batch_size=128)

In [None]:
pred = model.predict(test_X_vect)

In [None]:
pred

In [None]:
np.argmax(pred, axis = 1)

In [None]:
submission = pd.read_csv('sample_submission.csv')

submission['target'] = np.argmax(pred, axis = 1)

submission

submission.to_csv('submission_ngram(1,2).csv',index=False)

# Kfold

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.callbacks import *

In [None]:
skf = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)

In [None]:
nn_acc = []
nn_pred = np.zeros((y.shape[0], y.shape[1]))

for i, (tr_idx, val_idx) in enumerate(skf.split(train_X, train_y)) :
    print(f'{i + 1} Fold Training.....')
    tr_x, tr_y = train_X[tr_idx], y[tr_idx]
    val_x, val_y = train_X[val_idx], y[val_idx]
    
    ### NN 모델
    model = Sequential()
    model.add(Dense(256, input_dim=train_X.shape[1], activation = 'elu'))
    model.add(Dense(128, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='softmax'))

    mc = ModelCheckpoint(f'model_{i + 1}.h5', save_best_only = True, monitor = 'val_accuracy', mode = 'max', verbose = 0)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

    result = model.fit(tr_x, tr_y, validation_data = (val_x, val_y), epochs = 10, batch_size = 128, callbacks = [mc], verbose = 1)

    ### 최고 성능 기록 모델 Load
    best = load_model(f'model_{i + 1}.h5')
    ### validation predict
    val_pred = best.predict(val_x)
    ### 확률값 중 최대값을 클래스로 매칭
    val_cls = np.argmax(val_pred, axis = 1)
    ### Fold별 val_mae 산출
    fold_nn_acc = accuracy_score(np.argmax(val_y, axis = 1), val_cls)
    nn_acc.append(fold_nn_acc)
    print(f'{i + 1} Fold nn acc = {fold_nn_acc}\n')

    ### Fold별 test 데이터에 대한 예측값 생성 및 앙상블
    fold_pred = best.predict(test_X_vect) / skf.n_splits
    nn_pred += fold_pred

In [None]:
np.mean(nn_acc)

In [None]:
submission = pd.read_csv('sample_submission.csv')

submission['target'] = np.argmax(nn_pred, axis = 1)

submission

submission.to_csv('submission.csv',index=False)

## 불용어처리 

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

In [None]:
import re 

def clean_text(texts): 
  corpus = [] 
  for i in range(0, len(texts)): 

    review = re.sub(r'[@%\\*=()/~#&\+á?\xc3\xa1\-\|\.\:\;\!\-\,\_\~\$\'\"\n\]\[\>]', '',texts[i]) #@%*=()/+ 와 같은 문장부호 제거
    review = re.sub(r'\d+','', review)#숫자 제거
    review = review.lower() #소문자 변환
    review = re.sub(r'\s+', ' ', review) #extra space 제거
    review = re.sub(r'<[^>]+>','',review) #Html tags 제거
    review = re.sub(r'\s+', ' ', review) #spaces 제거
    review = re.sub(r"^\s+", '', review) #space from start 제거
    review = re.sub(r'\s+$', '', review) #space from the end 제거
    review = re.sub(r'_', ' ', review) #space from the end 제거
    corpus.append(review) 
  
  return corpus

In [None]:
temp = clean_text(train['text']) #메소드 적용
train['text'] = temp
train.head()

In [None]:
temp = clean_text(test['text']) #메소드 적용
test['text'] = temp
test.head()

In [None]:
train_X = train['text']
train_y = train['target']

In [None]:
train_X

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

stop_words = set(stopwords.words('english')) 

In [None]:
for i in range(len(train_X)):
    a = train_X[i].split(' ')
    result = ''
    for word in a: 
        if word not in stop_words: 
            result = result + ' ' + word 
    train_X[i] = result

train_X

In [None]:
for i in range(len(test.text)):
    a = test.text[i].split(' ')
    result = ''
    for word in a: 
        if word not in stop_words: 
            result = result + ' ' + word 
    test.text[i] = result

test

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()

In [None]:
vectorizer.fit(train_X)

In [None]:
train_X = vectorizer.transform(train_X)

In [None]:
train_X

In [None]:
vectorizer.inverse_transform(train_X[0])

In [None]:
test_X = test.text #문서 데이터 생성

test_X_vect = vectorizer.transform(test_X) #문서 데이터 transform 
#test 데이터를 대상으로 fit_transform 메소드를 실행하는 것은 test 데이터를 활용해 vectorizer 를 학습 시키는 것으롤 data leakage 에 해당합니다.

In [None]:
test_X_vect

In [None]:
skf = StratifiedKFold(n_splits = 6, random_state = 1, shuffle = True)

In [None]:
nn_acc = []
nn_pred = np.zeros((y.shape[0], y.shape[1]))

for i, (tr_idx, val_idx) in enumerate(skf.split(train_X, train_y)) :
    print(f'{i + 1} Fold Training.....')
    tr_x, tr_y = train_X[tr_idx], y[tr_idx]
    val_x, val_y = train_X[val_idx], y[val_idx]
    
    ### NN 모델
    model = Sequential()
    model.add(Dense(256, input_dim=train_X.shape[1], activation = 'elu'))
    model.add(Dense(128, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='softmax'))

    mc = ModelCheckpoint(f'model_{i + 1}.h5', save_best_only = True, monitor = 'val_accuracy', mode = 'max', verbose = 0)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

    result = model.fit(tr_x, tr_y, validation_data = (val_x, val_y), epochs = 10, batch_size = 128, callbacks = [mc], verbose = 1)

    ### 최고 성능 기록 모델 Load
    best = load_model(f'model_{i + 1}.h5')
    ### validation predict
    val_pred = best.predict(val_x)
    ### 확률값 중 최대값을 클래스로 매칭
    val_cls = np.argmax(val_pred, axis = 1)
    ### Fold별 val_mae 산출
    fold_nn_acc = accuracy_score(np.argmax(val_y, axis = 1), val_cls)
    nn_acc.append(fold_nn_acc)
    print(f'{i + 1} Fold nn acc = {fold_nn_acc}\n')

    ### Fold별 test 데이터에 대한 예측값 생성 및 앙상블
    fold_pred = best.predict(test_X_vect) / skf.n_splits
    nn_pred += fold_pred

In [None]:
np.mean(nn_acc)

In [None]:
submission = pd.read_csv('sample_submission.csv')

submission['target'] = np.argmax(nn_pred, axis = 1)

submission

submission.to_csv('submission.csv',index=False)

## 불용어 + Tfidf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer(ngram_range=(1, 1))

vectorizer.fit(np.array(train_X))

In [None]:
train_X_vec = vectorizer.transform(train_X)

In [None]:
test_vec = vectorizer.transform(test)

In [None]:
test_vec[0]

In [None]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))

vectorizer.fit(np.array(train["text"]))

train_vec = vectorizer.transform(train["text"])
train_y = train["target"]

test_vec = vectorizer.transform(test["text"])

In [None]:
pd.DataFrame(train_vec)

In [None]:
skf = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)

In [None]:
nn_acc = []
nn_pred = np.zeros((y.shape[0], y.shape[1]))

for i, (tr_idx, val_idx) in enumerate(skf.split(train_X_vec, train_y)) :
    print(f'{i + 1} Fold Training.....')
    tr_x, tr_y = train_X_vec[tr_idx], y[tr_idx]
    val_x, val_y = train_X_vec[val_idx], y[val_idx]
    
    ### NN 모델
    model = Sequential()
    model.add(Dense(256, input_dim=train_X_vec.shape[1], activation = 'elu'))
    model.add(Dense(128, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='softmax'))

    mc = ModelCheckpoint(f'model_{i + 1}.h5', save_best_only = True, monitor = 'val_accuracy', mode = 'max', verbose = 0)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

    model.fit(tr_x, tr_y, validation_data = (val_x, val_y), epochs = 10, batch_size = 128, callbacks = [mc], verbose = 1)

    ### 최고 성능 기록 모델 Load
    best = load_model(f'model_{i + 1}.h5')
    ### validation predict
    val_pred = best.predict(val_x)
    ### 확률값 중 최대값을 클래스로 매칭
    val_cls = np.argmax(val_pred, axis = 1)
    ### Fold별 val_mae 산출
    fold_nn_acc = accuracy_score(np.argmax(val_y, axis = 1), val_cls)
    nn_acc.append(fold_nn_acc)
    print(f'{i + 1} Fold nn acc = {fold_nn_acc}\n')

    ### Fold별 test 데이터에 대한 예측값 생성 및 앙상블
    fold_pred = best.predict(test_vec) / skf.n_splits
    nn_pred += fold_pred

In [None]:
val_x

In [None]:
tr_x

In [None]:
tr_y

In [None]:
len(val_y)

## 토크나이저 + RNN

In [None]:
train_X

In [None]:
test

In [None]:
X_train = train_X

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import urllib.request
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_encoded = tokenizer.texts_to_sequences(X_train)
print(X_train_encoded[:5])

In [None]:
# 각 정수가 어떤 단어에 부여되었는지 확인해봅시다.
word_to_index = tokenizer.word_index
print(word_to_index)

In [None]:
threshold = 2
total_cnt = len(word_to_index) # 단어의 수
rare_cnt = 0 # 등장 빈도수가 threshold보다 작은 단어의 개수를 카운트
total_freq = 0 # 훈련 데이터의 전체 단어 빈도수 총 합
rare_freq = 0 # 등장 빈도수가 threshold보다 작은 단어의 등장 빈도수의 총 합

# 단어와 빈도수의 쌍(pair)을 key와 value로 받는다.
for key, value in tokenizer.word_counts.items():
    total_freq = total_freq + value

    # 단어의 등장 빈도수가 threshold보다 작으면
    if(value < threshold):
        rare_cnt = rare_cnt + 1
        rare_freq = rare_freq + value

print('등장 빈도가 %s번 이하인 희귀 단어의 수: %s'%(threshold - 1, rare_cnt))
print("단어 집합(vocabulary)에서 희귀 단어의 비율:", (rare_cnt / total_cnt)*100)
print("전체 등장 빈도에서 희귀 단어 등장 빈도 비율:", (rare_freq / total_freq)*100)

In [None]:
vocab_size = len(word_to_index) + 1
print('단어 집합의 크기: {}'.format((vocab_size)))

In [None]:
print('메일의 최대 길이 : %d' % max(len(sample) for sample in X_train_encoded))
print('메일의 평균 길이 : %f' % (sum(map(len, X_train_encoded))/len(X_train_encoded)))
plt.hist([len(sample) for sample in X_train], bins=50)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()

In [None]:
max_len = 6737
X_train_padded = pad_sequences(X_train_encoded, maxlen = max_len)
print("훈련 데이터의 크기(shape):", X_train_padded.shape)

In [None]:
from tensorflow.keras.layers import SimpleRNN, Embedding, Dense
from tensorflow.keras.models import Sequential

embedding_dim = 32
hidden_units = 32

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=6737))
model.add(SimpleRNN(hidden_units))
model.add(Dense(20, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
history = model.fit(X_train_padded, y, epochs=4, batch_size=256, validation_split=0.3)

# 너무 오래걸리고 정확도도 낮음. 

## 워드투벡터

In [None]:
# countvectorizer 사용 9233 * 143522
train_X

In [None]:
train_X

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
sent_text = sent_tokenize(train_X[0])

In [None]:
sent_text

In [None]:
result = word_tokenize(sent_text[0])

In [None]:
result

In [None]:
len(train_X)

In [None]:
train_X_token = []
for i in range(len(train_X)):
    try:
        sent_text = sent_tokenize(train_X[i])
        result = word_tokenize(sent_text[0])
        train_X_token.append(result)
    except:
        print(i)
        pass

In [None]:
train_X_token

In [None]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

model = Word2Vec(sentences=result, vector_size=100, sg=0)


In [None]:
model.

## 보팅

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

In [None]:
import re 

def clean_text(texts): 
  corpus = [] 
  for i in range(0, len(texts)): 

    review = re.sub(r'[@%\\*=()/~#&\+á?\xc3\xa1\-\|\.\:\;\!\-\,\_\~\$\'\"\n\]\[\>]', '',texts[i]) #@%*=()/+ 와 같은 문장부호 제거
    review = re.sub(r'\d+','', review)#숫자 제거
    review = review.lower() #소문자 변환
    review = re.sub(r'\s+', ' ', review) #extra space 제거
    review = re.sub(r'<[^>]+>','',review) #Html tags 제거
    review = re.sub(r'\s+', ' ', review) #spaces 제거
    review = re.sub(r"^\s+", '', review) #space from start 제거
    review = re.sub(r'\s+$', '', review) #space from the end 제거
    review = re.sub(r'_', ' ', review) #space from the end 제거
    corpus.append(review) 
  
  return corpus

In [None]:
temp = clean_text(train['text']) #메소드 적용
train['text'] = temp
train.head()

In [None]:
temp = clean_text(test['text']) #메소드 적용
test['text'] = temp
test.head()

In [None]:
train_X = train['text']
train_y = train['target']

In [None]:
train_X

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

stop_words = set(stopwords.words('english')) 

In [None]:
for i in range(len(train_X)):
    a = train_X[i].split(' ')
    result = ''
    for word in a: 
        if word not in stop_words: 
            result = result + ' ' + word 
    train_X[i] = result

train_X

In [None]:
for i in range(len(test.text)):
    a = test.text[i].split(' ')
    result = ''
    for word in a: 
        if word not in stop_words: 
            result = result + ' ' + word 
    test.text[i] = result

test

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words = 'english')

vectorizer.fit(np.array(train["text"]))

train_vec = vectorizer.transform(train["text"])
train_y = train["target"]

test_vec = vectorizer.transform(test["text"])

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import MultinomialNB
from lightgbm import LGBMClassifier
from sklearn.ensemble import VotingClassifier

In [None]:
MLP = MLPClassifier()
NB = MultinomialNB()
LGBM = LGBMClassifier()

VC = VotingClassifier(estimators=[('mlp',MLP),('nb',NB),('lgbm',LGBM)],voting = 'soft')

In [None]:
VC.fit(train_vec,train_y)

In [None]:
model = MLPClassifier()
model.fit(train_vec, train_y)

## bertclassifier

In [None]:
from bert_sklearn import BertClassifier

In [None]:
model = BertClassifier(bert_model="bert-large-cased")

In [None]:
model.fit(train_X, train_y)