In [None]:
임베딩 알고리즘 변경
분류 알고리즘 변경 

In [None]:
from tensorflow.python.client import device_lib 
device_lib.list_local_devices()

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

In [3]:
import re 

def clean_text(texts): 
  corpus = [] 
  for i in range(0, len(texts)): 

    review = re.sub(r'[@%\\*=()/~#&\+á?\xc3\xa1\-\|\.\:\;\!\-\,\_\~\$\'\"\n\]\[\>]', '',texts[i]) #@%*=()/+ 와 같은 문장부호 제거
    review = re.sub(r'\d+','', review)#숫자 제거
    review = review.lower() #소문자 변환
    review = re.sub(r'\s+', ' ', review) #extra space 제거
    review = re.sub(r'<[^>]+>','',review) #Html tags 제거
    review = re.sub(r'\s+', ' ', review) #spaces 제거
    review = re.sub(r"^\s+", '', review) #space from start 제거
    review = re.sub(r'\s+$', '', review) #space from the end 제거
    review = re.sub(r'_', ' ', review) #space from the end 제거
    corpus.append(review) 
  
  return corpus

In [4]:
temp = clean_text(train['text']) #메소드 적용
train['text'] = temp
train.head()

Unnamed: 0,text,target
0,they were and even if washington might conside...,10
1,we run spacenews views on our stareach bbs a l...,14
2,not to worry the masons have been demonized an...,19
3,only brendan mckay or maybe arf would come to ...,17
4,help i am running some sample problems from or...,5


In [5]:
temp = clean_text(test['text']) #메소드 적용
test['text'] = temp
test.head()

Unnamed: 0,text
0,the vlide adapter can be much faster then the ...
1,yeah in a fire that reportedly burned hotter t...
2,judge i grant you immunity from whatever may b...
3,i too put a corbin seat on my hawk i got the s...
4,do i ever after years of having health problem...


In [None]:
train_X = train['text']
train_y = train['target']

In [None]:
train_X

# baseline

## CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()

In [None]:
vectorizer.fit(train_X)

In [None]:
train_X = vectorizer.transform(train_X)

In [None]:
train_X

In [None]:
vectorizer.inverse_transform(train_X[0])

In [None]:
test_X = test.text #문서 데이터 생성

test_X_vect = vectorizer.transform(test_X) #문서 데이터 transform 
#test 데이터를 대상으로 fit_transform 메소드를 실행하는 것은 test 데이터를 활용해 vectorizer 를 학습 시키는 것으롤 data leakage 에 해당합니다.

In [None]:
test_X_vect

## TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
vectorizer.fit(train_X)

In [None]:
train_X = vectorizer.transform(train_X)

In [None]:
train_X

In [None]:
vectorizer.inverse_transform(train_X[0])

# 알고리즘 실험실

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.callbacks import *

In [36]:
ohe = OneHotEncoder(sparse = False)
y = ohe.fit_transform(train[['target']])

In [None]:
model = Sequential()
model.add(Dense(256, input_dim=train_X.shape[1], activation = 'elu'))
model.add(LSTM(128))
#model.add(Dense(128, activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(20, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

In [None]:
model.fit(train_X, y, epochs=10, batch_size=128)

In [None]:
pred = model.predict(test_X_vect)

In [None]:
pred

In [None]:
np.argmax(pred, axis = 1)

In [None]:
submission = pd.read_csv('sample_submission.csv')

submission['target'] = np.argmax(pred, axis = 1)

submission

submission.to_csv('submission.csv',index=False)

# Kfold

In [24]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.callbacks import *

In [None]:
skf = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)

In [None]:
nn_acc = []
nn_pred = np.zeros((y.shape[0], y.shape[1]))

for i, (tr_idx, val_idx) in enumerate(skf.split(train_X, train_y)) :
    print(f'{i + 1} Fold Training.....')
    tr_x, tr_y = train_X[tr_idx], y[tr_idx]
    val_x, val_y = train_X[val_idx], y[val_idx]
    
    ### NN 모델
    model = Sequential()
    model.add(Dense(256, input_dim=train_X.shape[1], activation = 'elu'))
    model.add(Dense(128, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='softmax'))

    mc = ModelCheckpoint(f'model_{i + 1}.h5', save_best_only = True, monitor = 'val_accuracy', mode = 'max', verbose = 0)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

    result = model.fit(tr_x, tr_y, validation_data = (val_x, val_y), epochs = 10, batch_size = 128, callbacks = [mc], verbose = 1)

    ### 최고 성능 기록 모델 Load
    best = load_model(f'model_{i + 1}.h5')
    ### validation predict
    val_pred = best.predict(val_x)
    ### 확률값 중 최대값을 클래스로 매칭
    val_cls = np.argmax(val_pred, axis = 1)
    ### Fold별 val_mae 산출
    fold_nn_acc = accuracy_score(np.argmax(val_y, axis = 1), val_cls)
    nn_acc.append(fold_nn_acc)
    print(f'{i + 1} Fold nn acc = {fold_nn_acc}\n')

    ### Fold별 test 데이터에 대한 예측값 생성 및 앙상블
    fold_pred = best.predict(test_X_vect) / skf.n_splits
    nn_pred += fold_pred

In [None]:
np.mean(nn_acc)

In [None]:
submission = pd.read_csv('sample_submission.csv')

submission['target'] = np.argmax(nn_pred, axis = 1)

submission

submission.to_csv('submission.csv',index=False)

## 불용어처리 

In [6]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

In [7]:
import re 

def clean_text(texts): 
  corpus = [] 
  for i in range(0, len(texts)): 

    review = re.sub(r'[@%\\*=()/~#&\+á?\xc3\xa1\-\|\.\:\;\!\-\,\_\~\$\'\"\n\]\[\>]', '',texts[i]) #@%*=()/+ 와 같은 문장부호 제거
    review = re.sub(r'\d+','', review)#숫자 제거
    review = review.lower() #소문자 변환
    review = re.sub(r'\s+', ' ', review) #extra space 제거
    review = re.sub(r'<[^>]+>','',review) #Html tags 제거
    review = re.sub(r'\s+', ' ', review) #spaces 제거
    review = re.sub(r"^\s+", '', review) #space from start 제거
    review = re.sub(r'\s+$', '', review) #space from the end 제거
    review = re.sub(r'_', ' ', review) #space from the end 제거
    corpus.append(review) 
  
  return corpus

In [8]:
temp = clean_text(train['text']) #메소드 적용
train['text'] = temp
train.head()

Unnamed: 0,text,target
0,they were and even if washington might conside...,10
1,we run spacenews views on our stareach bbs a l...,14
2,not to worry the masons have been demonized an...,19
3,only brendan mckay or maybe arf would come to ...,17
4,help i am running some sample problems from or...,5


In [9]:
temp = clean_text(test['text']) #메소드 적용
test['text'] = temp
test.head()

Unnamed: 0,text
0,the vlide adapter can be much faster then the ...
1,yeah in a fire that reportedly burned hotter t...
2,judge i grant you immunity from whatever may b...
3,i too put a corbin seat on my hawk i got the s...
4,do i ever after years of having health problem...


In [10]:
train_X = train['text']
train_y = train['target']

In [11]:
train_X

0       they were and even if washington might conside...
1       we run spacenews views on our stareach bbs a l...
2       not to worry the masons have been demonized an...
3       only brendan mckay or maybe arf would come to ...
4       help i am running some sample problems from or...
                              ...                        
9228    precisely why not cuba why not the hatians are...
9229    your custom resume on disk macintosh or ibm co...
9230    throughout the years of the israelarabpalestin...
9231    does anyone know if there are any devices avai...
9232    give me a break chum are you telling me that c...
Name: text, Length: 9233, dtype: object

In [12]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

stop_words = set(stopwords.words('english')) 

In [13]:
for i in range(len(train_X)):
    a = train_X[i].split(' ')
    result = ''
    for word in a: 
        if word not in stop_words: 
            result = result + ' ' + word 
    train_X[i] = result

train_X

0        even washington might consider patty bust id ...
1        run spacenews views stareach bbs localoperati...
2        worry masons demonized harrassed almost every...
3        brendan mckay maybe arf would come rescue naz...
4        help running sample problems oreilly volume x...
                              ...                        
9228     precisely cuba hatians ruled thugs elected le...
9229     custom resume disk macintosh ibm compatible n...
9230     throughout years israelarabpalestinian confli...
9231     anyone know devices available mac whichwill i...
9232     give break chum telling clinton reno know bat...
Name: text, Length: 9233, dtype: object

In [14]:
for i in range(len(test.text)):
    a = test.text[i].split(' ')
    result = ''
    for word in a: 
        if word not in stop_words: 
            result = result + ' ' + word 
    test.text[i] = result

test

Unnamed: 0,text
0,vlide adapter much faster normal ide depends ...
1,yeah fire reportedly burned hotter degrees ho...
2,judge grant immunity whatever may learned key...
3,put corbin seat hawk got solo seat whichcould...
4,ever years health problems beencleared waller...
...,...
9228,texas cannot carry handgun period either conc...
9229,yes want concentrate development issues ive c...
9230,know megadrives worked perfectly mymac plus p...
9231,oops quite right got busy saved franks last p...


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()

In [None]:
vectorizer.fit(train_X)

In [None]:
train_X = vectorizer.transform(train_X)

In [None]:
train_X

In [None]:
vectorizer.inverse_transform(train_X[0])

In [None]:
test_X = test.text #문서 데이터 생성

test_X_vect = vectorizer.transform(test_X) #문서 데이터 transform 
#test 데이터를 대상으로 fit_transform 메소드를 실행하는 것은 test 데이터를 활용해 vectorizer 를 학습 시키는 것으롤 data leakage 에 해당합니다.

In [None]:
test_X_vect

In [26]:
skf = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)

In [None]:
nn_acc = []
nn_pred = np.zeros((y.shape[0], y.shape[1]))

for i, (tr_idx, val_idx) in enumerate(skf.split(train_X, train_y)) :
    print(f'{i + 1} Fold Training.....')
    tr_x, tr_y = train_X[tr_idx], y[tr_idx]
    val_x, val_y = train_X[val_idx], y[val_idx]
    
    ### NN 모델
    model = Sequential()
    model.add(Dense(256, input_dim=train_X.shape[1], activation = 'elu'))
    model.add(Dense(128, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='softmax'))

    mc = ModelCheckpoint(f'model_{i + 1}.h5', save_best_only = True, monitor = 'val_accuracy', mode = 'max', verbose = 0)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

    result = model.fit(tr_x, tr_y, validation_data = (val_x, val_y), epochs = 10, batch_size = 128, callbacks = [mc], verbose = 1)

    ### 최고 성능 기록 모델 Load
    best = load_model(f'model_{i + 1}.h5')
    ### validation predict
    val_pred = best.predict(val_x)
    ### 확률값 중 최대값을 클래스로 매칭
    val_cls = np.argmax(val_pred, axis = 1)
    ### Fold별 val_mae 산출
    fold_nn_acc = accuracy_score(np.argmax(val_y, axis = 1), val_cls)
    nn_acc.append(fold_nn_acc)
    print(f'{i + 1} Fold nn acc = {fold_nn_acc}\n')

    ### Fold별 test 데이터에 대한 예측값 생성 및 앙상블
    fold_pred = best.predict(test_X_vect) / skf.n_splits
    nn_pred += fold_pred

In [None]:
np.mean(nn_acc)

In [None]:
submission = pd.read_csv('sample_submission.csv')

submission['target'] = np.argmax(nn_pred, axis = 1)

submission

submission.to_csv('submission.csv',index=False)

## 불용어 + 버트 토크나이저

In [None]:
from transformers import BertTokenizer

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [15]:
import numpy
import tensorflow as tf
from numpy import array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Flatten,Embedding
  

In [None]:
token = Tokenizer()
token.fit_on_texts(train_X)
print(token.word_index)

In [None]:
train_X

In [None]:
word_size = len(token.word_index)+1

In [None]:
word_size

In [None]:
print('뉴스 기사의 최대 길이 :{}'.format(max(len(sample) for sample in train_X)))
print('뉴스 기사의 평균 길이 :{}'.format(sum(map(len, train_X))/len(train_X)))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
plt.hist([len(sample) for sample in train_X], bins=50)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()

In [16]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_X)
train_X_encoded = tokenizer.texts_to_sequences(train_X)
print(train_X_encoded[:5])

[[20, 747, 57, 416, 14464, 14465, 137, 40487, 923, 1824, 25889, 620, 6063, 14465, 20200, 1157], [123, 40488, 1459, 40489, 1290, 40490, 254, 40491, 100, 2348, 1853, 40492, 10677, 773, 716, 40493, 381, 365, 450, 277, 6584, 40494, 16698, 1419, 1246, 40495, 865, 7, 287, 919, 474, 334, 245, 40496, 1661, 8111, 727, 10678, 20201, 1309, 1016, 265, 141, 3287, 40497, 312, 33], [2017, 9903, 20202, 16699, 373, 119, 402, 20203, 290, 3444, 42, 1017, 16700, 12787, 10679, 2824, 20204, 40498, 16, 20, 841, 102, 107, 5, 214, 1782, 20204, 1083, 16701, 1125, 2184, 804, 14466, 9903, 420, 291, 17, 1966, 2, 92, 362, 25890, 8112, 2824, 40499, 9903, 9903, 231, 2074, 47, 3222, 25891, 40500, 40, 40501, 231, 16702, 3223, 366, 25891, 14467, 5823, 16, 214, 40502, 2872, 25892, 20205, 2824, 20204, 1083, 290, 76, 1017, 123, 16703, 25893, 169, 901, 1190, 2633, 5619, 1639, 1485, 2322, 2678, 2018, 440, 122, 1295, 16704, 40503, 7686, 40504, 9904, 162, 20206, 4126, 40505, 40506, 7, 379, 3445, 2634, 16705, 40507, 674, 79, 83

In [32]:
test_X_encoded = tokenizer.texts_to_sequences(test.text)

In [17]:
word_to_index = tokenizer.word_index
print(word_to_index)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [18]:
vocab_size = len(word_to_index) + 1
print('단어 집합의 크기: {}'.format((vocab_size)))

단어 집합의 크기: 143566


In [None]:
print('메일의 최대 길이 : %d' % max(len(sample) for sample in train_X_encoded))
print('메일의 평균 길이 : %f' % (sum(map(len, train_X_encoded))/len(train_X_encoded)))

In [None]:
max_len = 6737
train_X_padded = pad_sequences(train_X_encoded, maxlen = max_len)
print("훈련 데이터의 크기(shape):", train_X_padded.shape)

In [None]:
embedding_dim = 128
hidden_units = 128
num_classes = 20

model = Sequential()
model.add(Embedding(143566, embedding_dim))
model.add(LSTM(hidden_units))
model.add(Dense(num_classes, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
history = model.fit(train_X_padded, y, epochs=10, batch_size=128)

In [19]:
train_X

0        even washington might consider patty bust id ...
1        run spacenews views stareach bbs localoperati...
2        worry masons demonized harrassed almost every...
3        brendan mckay maybe arf would come rescue naz...
4        help running sample problems oreilly volume x...
                              ...                        
9228     precisely cuba hatians ruled thugs elected le...
9229     custom resume disk macintosh ibm compatible n...
9230     throughout years israelarabpalestinian confli...
9231     anyone know devices available mac whichwill i...
9232     give break chum telling clinton reno know bat...
Name: text, Length: 9233, dtype: object

In [23]:
train_X_encoded[0]

[20,
 747,
 57,
 416,
 14464,
 14465,
 137,
 40487,
 923,
 1824,
 25889,
 620,
 6063,
 14465,
 20200,
 1157]

In [26]:
skf = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)

In [37]:
nn_acc = []
nn_pred = np.zeros((y.shape[0], y.shape[1]))

for i, (tr_idx, val_idx) in enumerate(skf.split(train_X_encoded, train_y)) :
    print(f'{i + 1} Fold Training.....')
    tr_x, tr_y = train_X_encoded[tr_idx], y[tr_idx]
    val_x, val_y = train_X_encoded[val_idx], y[val_idx]
    
    ### NN 모델
    model = Sequential()
    model.add(Dense(256, input_dim=train_X_encoded.shape[1], activation = 'elu'))
    model.add(Dense(128, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='softmax'))

    mc = ModelCheckpoint(f'model_{i + 1}.h5', save_best_only = True, monitor = 'val_accuracy', mode = 'max', verbose = 0)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

    result = model.fit(tr_x, tr_y, validation_data = (val_x, val_y), epochs = 10, batch_size = 128, callbacks = [mc], verbose = 1)

    ### 최고 성능 기록 모델 Load
    best = load_model(f'model_{i + 1}.h5')
    ### validation predict
    val_pred = best.predict(val_x)
    ### 확률값 중 최대값을 클래스로 매칭
    val_cls = np.argmax(val_pred, axis = 1)
    ### Fold별 val_mae 산출
    fold_nn_acc = accuracy_score(np.argmax(val_y, axis = 1), val_cls)
    nn_acc.append(fold_nn_acc)
    print(f'{i + 1} Fold nn acc = {fold_nn_acc}\n')

    ### Fold별 test 데이터에 대한 예측값 생성 및 앙상블
    fold_pred = best.predict(test_X_encoded) / skf.n_splits
    nn_pred += fold_pred

1 Fold Training.....


TypeError: only integer scalar arrays can be converted to a scalar index

In [47]:
train_X_encoded

[[20,
  747,
  57,
  416,
  14464,
  14465,
  137,
  40487,
  923,
  1824,
  25889,
  620,
  6063,
  14465,
  20200,
  1157],
 [123,
  40488,
  1459,
  40489,
  1290,
  40490,
  254,
  40491,
  100,
  2348,
  1853,
  40492,
  10677,
  773,
  716,
  40493,
  381,
  365,
  450,
  277,
  6584,
  40494,
  16698,
  1419,
  1246,
  40495,
  865,
  7,
  287,
  919,
  474,
  334,
  245,
  40496,
  1661,
  8111,
  727,
  10678,
  20201,
  1309,
  1016,
  265,
  141,
  3287,
  40497,
  312,
  33],
 [2017,
  9903,
  20202,
  16699,
  373,
  119,
  402,
  20203,
  290,
  3444,
  42,
  1017,
  16700,
  12787,
  10679,
  2824,
  20204,
  40498,
  16,
  20,
  841,
  102,
  107,
  5,
  214,
  1782,
  20204,
  1083,
  16701,
  1125,
  2184,
  804,
  14466,
  9903,
  420,
  291,
  17,
  1966,
  2,
  92,
  362,
  25890,
  8112,
  2824,
  40499,
  9903,
  9903,
  231,
  2074,
  47,
  3222,
  25891,
  40500,
  40,
  40501,
  231,
  16702,
  3223,
  366,
  25891,
  14467,
  5823,
  16,
  214,
  40502,
  287

In [44]:
tr_idx

array([   1,    2,    3, ..., 9230, 9231, 9232])