In [None]:
임베딩 알고리즘 변경
분류 알고리즘 변경 

In [1]:
from tensorflow.python.client import device_lib 
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 17558348593734113906
 xla_global_id: -1]

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [200]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

In [201]:
import re 

def clean_text(texts): 
  corpus = [] 
  for i in range(0, len(texts)): 

    review = re.sub(r'[@%\\*=()/~#&\+á?\xc3\xa1\-\|\.\:\;\!\-\,\_\~\$\'\"\n\]\[\>]', '',texts[i]) #@%*=()/+ 와 같은 문장부호 제거
    review = re.sub(r'\d+','', review)#숫자 제거
    review = review.lower() #소문자 변환
    review = re.sub(r'\s+', ' ', review) #extra space 제거
    review = re.sub(r'<[^>]+>','',review) #Html tags 제거
    review = re.sub(r'\s+', ' ', review) #spaces 제거
    review = re.sub(r"^\s+", '', review) #space from start 제거
    review = re.sub(r'\s+$', '', review) #space from the end 제거
    review = re.sub(r'_', ' ', review) #space from the end 제거
    corpus.append(review) 
  
  return corpus

In [202]:
temp = clean_text(train['text']) #메소드 적용
train['text'] = temp
train.head()

Unnamed: 0,text,target
0,they were and even if washington might conside...,10
1,we run spacenews views on our stareach bbs a l...,14
2,not to worry the masons have been demonized an...,19
3,only brendan mckay or maybe arf would come to ...,17
4,help i am running some sample problems from or...,5


In [203]:
temp = clean_text(test['text']) #메소드 적용
test['text'] = temp
test.head()

Unnamed: 0,text
0,the vlide adapter can be much faster then the ...
1,yeah in a fire that reportedly burned hotter t...
2,judge i grant you immunity from whatever may b...
3,i too put a corbin seat on my hawk i got the s...
4,do i ever after years of having health problem...


In [204]:
train_X = train['text']
train_y = train['target']

In [205]:
train_X

0       they were and even if washington might conside...
1       we run spacenews views on our stareach bbs a l...
2       not to worry the masons have been demonized an...
3       only brendan mckay or maybe arf would come to ...
4       help i am running some sample problems from or...
                              ...                        
9228    precisely why not cuba why not the hatians are...
9229    your custom resume on disk macintosh or ibm co...
9230    throughout the years of the israelarabpalestin...
9231    does anyone know if there are any devices avai...
9232    give me a break chum are you telling me that c...
Name: text, Length: 9233, dtype: object

# baseline

## CountVectorizer

In [206]:
from sklearn.feature_extraction.text import CountVectorizer

In [207]:
vectorizer = CountVectorizer()

In [208]:
vectorizer.fit(train_X)

CountVectorizer()

In [209]:
train_X = vectorizer.transform(train_X)

In [210]:
train_X

<9233x143548 sparse matrix of type '<class 'numpy.int64'>'
	with 861642 stored elements in Compressed Sparse Row format>

In [211]:
vectorizer.inverse_transform(train_X[0])

[array(['and', 'been', 'bust', 'complete', 'consider', 'druce', 'even',
        'goals', 'has', 'hereonly', 'id', 'if', 'in', 'might', 'minute',
        'patty', 'reworkthat', 'they', 'trade', 'utter', 'washington',
        'were'], dtype='<U650')]

In [212]:
test_X = test.text #문서 데이터 생성

test_X_vect = vectorizer.transform(test_X) #문서 데이터 transform 
#test 데이터를 대상으로 fit_transform 메소드를 실행하는 것은 test 데이터를 활용해 vectorizer 를 학습 시키는 것으롤 data leakage 에 해당합니다.

In [213]:
test_X_vect

<9233x143548 sparse matrix of type '<class 'numpy.int64'>'
	with 740685 stored elements in Compressed Sparse Row format>

## TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
vectorizer.fit(train_X)

In [None]:
train_X = vectorizer.transform(train_X)

In [None]:
train_X

In [None]:
vectorizer.inverse_transform(train_X[0])

# 알고리즘 실험실

In [3]:
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.callbacks import *

In [11]:
ohe = OneHotEncoder(sparse = False)
y = ohe.fit_transform(train[['target']])

In [None]:
model = Sequential()
model.add(Dense(256, input_dim=train_X.shape[1], activation = 'elu'))
model.add(Dense(128, activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(20, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

In [None]:
model.fit(train_X, y, epochs=10, batch_size=128)

In [None]:
pred = model.predict(test_X_vect)

In [None]:
pred

In [None]:
np.argmax(pred, axis = 1)

In [None]:
submission = pd.read_csv('sample_submission.csv')

submission['target'] = np.argmax(pred, axis = 1)

submission

submission.to_csv('submission.csv',index=False)

# Kfold

In [236]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.callbacks import *

In [237]:
from sklearn.model_selection import StratifiedKFold

In [None]:
skf = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)

In [None]:
nn_acc = []
nn_pred = np.zeros((y.shape[0], y.shape[1]))

for i, (tr_idx, val_idx) in enumerate(skf.split(train_X, train_y)) :
    print(f'{i + 1} Fold Training.....')
    tr_x, tr_y = train_X[tr_idx], y[tr_idx]
    val_x, val_y = train_X[val_idx], y[val_idx]
    
    ### NN 모델
    model = Sequential()
    model.add(Dense(256, input_dim=train_X.shape[1], activation = 'elu'))
    model.add(Dense(128, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='softmax'))

    mc = ModelCheckpoint(f'model_{i + 1}.h5', save_best_only = True, monitor = 'val_accuracy', mode = 'max', verbose = 0)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

    result = model.fit(tr_x, tr_y, validation_data = (val_x, val_y), epochs = 10, batch_size = 128, callbacks = [mc], verbose = 1)

    ### 최고 성능 기록 모델 Load
    best = load_model(f'model_{i + 1}.h5')
    ### validation predict
    val_pred = best.predict(val_x)
    ### 확률값 중 최대값을 클래스로 매칭
    val_cls = np.argmax(val_pred, axis = 1)
    ### Fold별 val_mae 산출
    fold_nn_acc = accuracy_score(np.argmax(val_y, axis = 1), val_cls)
    nn_acc.append(fold_nn_acc)
    print(f'{i + 1} Fold nn acc = {fold_nn_acc}\n')

    ### Fold별 test 데이터에 대한 예측값 생성 및 앙상블
    fold_pred = best.predict(test_X_vect) / skf.n_splits
    nn_pred += fold_pred

In [None]:
np.mean(nn_acc)

In [None]:
submission = pd.read_csv('sample_submission.csv')

submission['target'] = np.argmax(nn_pred, axis = 1)

submission

submission.to_csv('submission.csv',index=False)

## 불용어처리 

In [240]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

In [241]:
import re 

def clean_text(texts): 
  corpus = [] 
  for i in range(0, len(texts)): 

    review = re.sub(r'[@%\\*=()/~#&\+á?\xc3\xa1\-\|\.\:\;\!\-\,\_\~\$\'\"\n\]\[\>]', '',texts[i]) #@%*=()/+ 와 같은 문장부호 제거
    review = re.sub(r'\d+','', review)#숫자 제거
    review = review.lower() #소문자 변환
    review = re.sub(r'\s+', ' ', review) #extra space 제거
    review = re.sub(r'<[^>]+>','',review) #Html tags 제거
    review = re.sub(r'\s+', ' ', review) #spaces 제거
    review = re.sub(r"^\s+", '', review) #space from start 제거
    review = re.sub(r'\s+$', '', review) #space from the end 제거
    review = re.sub(r'_', ' ', review) #space from the end 제거
    corpus.append(review) 
  
  return corpus

In [242]:
temp = clean_text(train['text']) #메소드 적용
train['text'] = temp
train.head()

Unnamed: 0,text,target
0,they were and even if washington might conside...,10
1,we run spacenews views on our stareach bbs a l...,14
2,not to worry the masons have been demonized an...,19
3,only brendan mckay or maybe arf would come to ...,17
4,help i am running some sample problems from or...,5


In [243]:
temp = clean_text(test['text']) #메소드 적용
test['text'] = temp
test.head()

Unnamed: 0,text
0,the vlide adapter can be much faster then the ...
1,yeah in a fire that reportedly burned hotter t...
2,judge i grant you immunity from whatever may b...
3,i too put a corbin seat on my hawk i got the s...
4,do i ever after years of having health problem...


In [244]:
train_X = train['text']
train_y = train['target']

In [245]:
train_X

0       they were and even if washington might conside...
1       we run spacenews views on our stareach bbs a l...
2       not to worry the masons have been demonized an...
3       only brendan mckay or maybe arf would come to ...
4       help i am running some sample problems from or...
                              ...                        
9228    precisely why not cuba why not the hatians are...
9229    your custom resume on disk macintosh or ibm co...
9230    throughout the years of the israelarabpalestin...
9231    does anyone know if there are any devices avai...
9232    give me a break chum are you telling me that c...
Name: text, Length: 9233, dtype: object

In [246]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

stop_words = set(stopwords.words('english')) 

In [247]:
for i in range(len(train_X)):
    a = train_X[i].split(' ')
    result = ''
    for word in a: 
        if word not in stop_words: 
            result = result + ' ' + word 
    train_X[i] = result

train_X

0        even washington might consider patty bust id ...
1        run spacenews views stareach bbs localoperati...
2        worry masons demonized harrassed almost every...
3        brendan mckay maybe arf would come rescue naz...
4        help running sample problems oreilly volume x...
                              ...                        
9228     precisely cuba hatians ruled thugs elected le...
9229     custom resume disk macintosh ibm compatible n...
9230     throughout years israelarabpalestinian confli...
9231     anyone know devices available mac whichwill i...
9232     give break chum telling clinton reno know bat...
Name: text, Length: 9233, dtype: object

In [248]:
for i in range(len(test.text)):
    a = test.text[i].split(' ')
    result = ''
    for word in a: 
        if word not in stop_words: 
            result = result + ' ' + word 
    test.text[i] = result

test

Unnamed: 0,text
0,vlide adapter much faster normal ide depends ...
1,yeah fire reportedly burned hotter degrees ho...
2,judge grant immunity whatever may learned key...
3,put corbin seat hawk got solo seat whichcould...
4,ever years health problems beencleared waller...
...,...
9228,texas cannot carry handgun period either conc...
9229,yes want concentrate development issues ive c...
9230,know megadrives worked perfectly mymac plus p...
9231,oops quite right got busy saved franks last p...


In [249]:
from sklearn.feature_extraction.text import CountVectorizer

In [250]:
vectorizer = CountVectorizer()

In [251]:
vectorizer.fit(train_X)

CountVectorizer()

In [252]:
train_X = vectorizer.transform(train_X)

In [253]:
train_X

<9233x143522 sparse matrix of type '<class 'numpy.int64'>'
	with 649556 stored elements in Compressed Sparse Row format>

In [254]:
vectorizer.inverse_transform(train_X[0])

[array(['bust', 'complete', 'consider', 'druce', 'even', 'goals',
        'hereonly', 'id', 'might', 'minute', 'patty', 'reworkthat',
        'trade', 'utter', 'washington'], dtype='<U650')]

In [255]:
test_X = test.text #문서 데이터 생성

test_X_vect = vectorizer.transform(test_X) #문서 데이터 transform 
#test 데이터를 대상으로 fit_transform 메소드를 실행하는 것은 test 데이터를 활용해 vectorizer 를 학습 시키는 것으롤 data leakage 에 해당합니다.

In [256]:
test_X_vect

<9233x143522 sparse matrix of type '<class 'numpy.int64'>'
	with 530833 stored elements in Compressed Sparse Row format>

In [257]:
skf = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)

In [258]:
nn_acc = []
nn_pred = np.zeros((y.shape[0], y.shape[1]))

for i, (tr_idx, val_idx) in enumerate(skf.split(train_X, train_y)) :
    print(f'{i + 1} Fold Training.....')
    tr_x, tr_y = train_X[tr_idx], y[tr_idx]
    val_x, val_y = train_X[val_idx], y[val_idx]
    
    ### NN 모델
    model = Sequential()
    model.add(Dense(256, input_dim=train_X.shape[1], activation = 'elu'))
    model.add(Dense(128, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='elu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='softmax'))

    mc = ModelCheckpoint(f'model_{i + 1}.h5', save_best_only = True, monitor = 'val_accuracy', mode = 'max', verbose = 0)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

    result = model.fit(tr_x, tr_y, validation_data = (val_x, val_y), epochs = 10, batch_size = 128, callbacks = [mc], verbose = 1)

    ### 최고 성능 기록 모델 Load
    best = load_model(f'model_{i + 1}.h5')
    ### validation predict
    val_pred = best.predict(val_x)
    ### 확률값 중 최대값을 클래스로 매칭
    val_cls = np.argmax(val_pred, axis = 1)
    ### Fold별 val_mae 산출
    fold_nn_acc = accuracy_score(np.argmax(val_y, axis = 1), val_cls)
    nn_acc.append(fold_nn_acc)
    print(f'{i + 1} Fold nn acc = {fold_nn_acc}\n')

    ### Fold별 test 데이터에 대한 예측값 생성 및 앙상블
    fold_pred = best.predict(test_X_vect) / skf.n_splits
    nn_pred += fold_pred

1 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
1 Fold nn acc = 0.7153679653679653

2 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
2 Fold nn acc = 0.696969696969697

3 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
3 Fold nn acc = 0.7056277056277056

4 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
4 Fold nn acc = 0.7009750812567714

5 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
5 Fold nn acc = 0.71397616468039

6 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10


Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
6 Fold nn acc = 0.6923076923076923

7 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
7 Fold nn acc = 0.7150595882990249

8 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
8 Fold nn acc = 0.7085590465872156

9 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
9 Fold nn acc = 0.6836403033586133

10 Fold Training.....
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
10 Fold nn acc = 0.7313109425785482



In [259]:
np.mean(nn_acc)

0.7063794187033625

In [260]:
submission = pd.read_csv('sample_submission.csv')

submission['target'] = np.argmax(nn_pred, axis = 1)

submission

submission.to_csv('submission.csv',index=False)