# Approaching (Almost) Any NLP Problem on Kaggle
https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle

매우 기본적인 첫 번째 모델을 만든 다음 다른 기능을 이용하여 개선할 것입니다. 또한 신경망들이 얼마나 깊이 사용될 수 있는지 살펴볼 것입니다. 그리고 일반적인 조합에 대한 몇가지 아이디어로 게시물을 마칩니다.

In [4]:
import pandas as pd
import numpy as np
import xgboost as xgb
from tqdm import tqdm
from sklearn.svm import SVC
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [5]:
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
sample = pd.read_csv('./input/sample_submission.csv')

In [6]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [7]:
test.head()

Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


In [8]:
sample.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.403494,0.287808,0.308698
1,id24541,0.403494,0.287808,0.308698
2,id00134,0.403494,0.287808,0.308698
3,id27757,0.403494,0.287808,0.308698
4,id04081,0.403494,0.287808,0.308698


이 문제는 저자를 예측해야 합니다. 즉 EAP, HPL, MWS가 텍스트로 주어집니다. 간단히 말해서, 텍스트 분류는 3개의 클래스로 분류됩니다.

이 문제의 평가 지표는 다중 클래스 로그 손실입니다. 

In [18]:
def multiclass_logloss(actual, predicted, eps=1e-15):
    # 'actual'을 이진 어레이로 변환
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[-1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2
        
    clip = np.clip(predicted, eps, 1-eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

사이킷런의 레이블인코더를 사용해 텍스트 레이블을 0,1,2로 바꿉니다.

In [19]:
lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(train.author.values)

더 나아가기 전에 데이터를 교육 및 검증 세트로 나누는 것이 중요합니다. 사이킷런의 train_test_split을 사용하여 할 수 있습니다.

In [20]:
xtrain, xvalid, ytrain, yvalid = train_test_split(train.text.values, y, stratify=y,
                                                  random_state=42,
                                                  test_size=0.1, shuffle=True)

In [21]:
print(xtrain.shape)
print(xvalid.shape)

(17621,)
(1958,)


## Building Basic Models
첫 번째 모델을 만들어 보겠습니다.

첫 번쨰 모델은 단순 TF-IDF(용어 빈도- 역문서 빈도)이고 그 다음은 단순 로지스틱 회귀 분석입니다.

In [22]:
tfv = TfidfVectorizer(min_df=3, max_features=None,
                      strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                      ngram_range=(1,3), use_idf=1, smooth_idf=1, sublinear_tf=1,
                      stop_words='english')

# 교육, 테스트 셋에 적합
tfv.fit(list(xtrain) +list(xvalid))
xtrain_tfv = tfv.transform(xtrain)
xvalid_tfv = tfv.transform(xvalid)

In [23]:
# 로지스틱회귀 적합
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print('logloss: %0.3f' % multiclass_logloss(yvalid, predictions))

logloss: 0.572


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.572의 다중 클래스 로그 손실을 가진 첫 번째 모델을 얻었습니다.

하지만 더 좋은 점수를 원하기 때문에, 다른 데이터로 동일한 모형을 살펴보겠습니다.

TF-IDF를 사용하는 대신 단어 수를 피쳐로 사용할 수도 있습니다. 이 작업은 사이킷런의 Vectorizer를 사용하여 쉽게 수행할수 있습니다.

In [24]:
ctv = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}',
                      ngram_range=(1, 3), stop_words='english')
# 카운트 벡터라이저를 훈련, 테스트 셋에 적합
ctv.fit(list(xtrain) + list(xvalid))
xtrain_ctv = ctv.transform(xtrain)
xvalid_ctv = ctv.transform(xvalid)

In [28]:
# 단순 로지스틱 회귀를 적용
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print('logloss: %0.3f' % multiclass_logloss(yvalid, predictions))

logloss: 0.527


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


첫 번째 모델을 0.045 개선했습니다.

다음으로 예전에 유명했던 매우 단순한 모델 Naive bayse 모델을 사용해보겠습니다.

다음 두 데이터 셋에서 나이브 베이즈를 사용할 때 어떤 일이 일어나는지 살펴보겠습니다.

In [30]:
clf = MultinomialNB()
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print('logloss: %0.3f' % multiclass_logloss(yvalid, predictions))

logloss: 0.578


카운트에 대한 로지스틱 회귀 분석이 더 좋습니다. 대신 카운트 데이터에 이 모델을 사용할 때 어떻게 되는지 살펴보겠습니다.

In [31]:
clf = MultinomialNB()
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)
print('logloss: %0.3f' % multiclass_logloss(yvalid, predictions))

logloss: 0.485


오래된 방법이 여전히 효과가 있는 것 같습니다. 추가로 유명한 알고리즘 중 하나는 SVM입니다. 

SVM은 시간이 많이 걸리기 때문에 적용하기 전 Single Value Decomposition을 사용하여 TF-IDF의 피쳐수를 줄입니다.

또한 SVM을 적용하기 전에 데이터를 표준화해야 합니다.

In [33]:
# SVD 적용, 120개의 요소를 선택, 120-200개의 요소가 SVM 모델에 적합
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(xtrain_tfv)
xtrain_svd = svd.transform(xtrain_tfv)
xvalid_svd = svd.transform(xvalid_tfv)

# SVD로 얻은 데이터 스케일링. 
scl = preprocessing.StandardScaler()
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)

이제 SVM을 적용해야 합니다.

In [34]:
# SVM
clf = SVC(C=1.0, probability=True) 
clf.fit(xtrain_svd_scl, ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print('logloss: %0.3f' % multiclass_logloss(yvalid, predictions))

logloss: 0.729


점수가 매우 좋지 않게 나왔습니다. SVM이 현재 데이터에 적합하지 않는 방법인것 같습니다.

더 나아가서 xgboost를 사용해보겠습니다.

In [35]:
# xgboost를 tf-idf에 적합
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_tfv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_tfv.tocsc())

print('logloss: %0.3f' % multiclass_logloss(yvalid, predictions))

logloss: 0.782


xgboost도 정확하지 않습니다. 아직 하이퍼 파라미터 최적화를 하지 않았습니다. 이 문제는 다음 섹션에서 설명합니다.

## Grid Search
하이퍼 파라미터 최적화를 위한 기술입니다. 그리 효과적이진 않지만 사용할 그리드를 알고 있으면 좋은 결과를 얻을 수 있습니다. 

이 섹션에서는 로지스틱 회귀 분석을 사용한 그리드 검색에 대해 설명합니다.

그리드 검색을 시작하기 전에 점수 매기기 기능을 만들어야 합니다. 이 작업은 사이킷런의 make_scorer 함수를 사용합니다.

In [36]:
mll_scorer = metrics.make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True)

다음으로 파이프라인이 필요합니다. 여기서 실행하기 위해 SVD, 스케일링 및 로지스틱 회귀로 구성된 파이프라인을 사용합니다. 하나의 모듈보다 더 많은 모듈을 파이프라인에 배치하여 이해하는 것이 좋습니다.

In [42]:
# SVD 초기화
svd = TruncatedSVD()

# 정규화 초기화
scl = preprocessing.StandardScaler()

# 로지스틱 회귀
lr_model = LogisticRegression()

# 파이프라인 구축
clf = pipeline.Pipeline([('svd', svd),
                         ('scl', scl),
                         ('lr', lr_model)])

다음으로 파라미터 그리드를 생성합니다.

In [43]:
param_grid = {'svd__n_components': [120, 180],
              'lr__C': [0.1, 1.0, 10],
              'lr__penalty': ['l1', 'l2']}

SVD의 경우 120개 및 180개의 성분을 평가하고, 로지스틱 회귀 분석의 경우 l1 및 l2 패널티로 세 가지 다른 C 값에 대해 평가합니다. 이제 이러한 매개 변수에 대한 그리드 검색을 시작할 수 있습니다.

In [44]:
# 그리드 검색 모델 초기화
model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
                     verbose=10, n_jobs=-1, refit=True, cv=2)

# 그리드 검색 모델 적합
model.fit(xtrain_tfv, ytrain) 
print('Best Score: %0.3f' %model.best_score_)
print('Best parameters set:')
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print('\t%s: %r' % (param_name, best_parameters[param_name]))

Fitting 2 folds for each of 12 candidates, totalling 24 fits


 -0.78048248 -0.73947463         nan         nan -0.7692571  -0.73928645]


Best Score: -0.739
Best parameters set:
	lr__C: 10
	lr__penalty: 'l2'
	svd__n_components: 180


점수는 SVM과 비슷합니다. 이 기술은 xgboost를 파인튜닝하거나 다항 나이브 베이즈를 미세조정하는데 사용할 수 있습니다. 여기서 tfidf 데이터를 사용합니다

In [46]:
nb_model = MultinomialNB()

# 파이프라인 생성
clf = pipeline.Pipeline([('nb', nb_model)])

# 파라미터 그리드
param_grid = {'nb__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# 그리드 검색 모델 초기화
model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
                     verbose=10, n_jobs=-1, refit=True, cv=2)

# 그리드 검색 모델 적합
model.fit(xtrain_tfv, ytrain)
print('Best Score: %0.3f' %model.best_score_)
print('Best parameters set:')
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print('\t%s: %r' % (param_name, best_parameters[param_name]))

Fitting 2 folds for each of 6 candidates, totalling 12 fits
Best Score: -0.492
Best parameters set:
	nb__alpha: 0.1


원래 나이브 베이즈랑 비슷합니다.

NLP 문제에서는 일반적으로 단어 벡터를 살펴봅니다. 단어 벡터는 데이터에 대한 많은 통찰력을 제공합니다. 그 점에 대해 자세히 알아보겠습니다.

## Word vectors
너무 자세하게 설명하지 않고, 어떻게 하면 문장 벡터를 만들 수 있는지, 그 위에 기계 학습 모델을 만드는지 설명하겠습니다. 이 게시물에서는 Glove 벡터를 사용합니다.
http://www-nlp.stanford.edu/data/glove.840B.300d.zip

In [53]:
# Glove 벡터 로드
embeddings_index = {}
f = open('./input/glove.840B.300d.txt', 'rt', encoding='UTF8')
for line in tqdm(f):
    try:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    except:
        f.__next__()
f.close()

print('Found %s word vectors.' % len(embeddings_index))


0it [00:00, ?it/s][A
1193it [00:00, 11919.18it/s][A
2408it [00:00, 11984.28it/s][A
3611it [00:00, 11994.68it/s][A
4828it [00:00, 12043.50it/s][A
6042it [00:00, 12068.96it/s][A
7259it [00:00, 12095.87it/s][A
8470it [00:00, 12096.79it/s][A
9681it [00:00, 12097.46it/s][A
10898it [00:00, 12115.82it/s][A
12100it [00:01, 12083.61it/s][A
13275it [00:01, 11941.67it/s][A
14479it [00:01, 11967.77it/s][A
15689it [00:01, 12003.87it/s][A
16880it [00:01, 11972.26it/s][A
18091it [00:01, 12010.00it/s][A
19303it [00:01, 12039.53it/s][A
20505it [00:01, 12030.38it/s][A
21706it [00:01, 12020.97it/s][A
22907it [00:01, 11871.76it/s][A
24115it [00:02, 11930.25it/s][A
25308it [00:02, 11926.92it/s][A
26513it [00:02, 11960.34it/s][A
27730it [00:02, 12019.24it/s][A
28944it [00:02, 12051.94it/s][A
30153it [00:02, 12060.05it/s][A
31359it [00:02, 12056.74it/s][A
32568it [00:02, 12063.41it/s][A
33775it [00:02, 12062.10it/s][A
34982it [00:02, 12061.18it/s][A
36189it [00:03, 12060.53it/s

293914it [00:24, 11805.32it/s][A
295136it [00:24, 11923.56it/s][A
296331it [00:24, 11928.23it/s][A
297539it [00:24, 11970.12it/s][A
298748it [00:24, 12002.57it/s][A
299958it [00:24, 12028.37it/s][A
301170it [00:25, 12052.43it/s][A
302391it [00:25, 12096.01it/s][A
303602it [00:25, 12096.91it/s][A
304817it [00:25, 12109.48it/s][A
306030it [00:25, 12112.34it/s][A
307253it [00:25, 12144.10it/s][A
308468it [00:25, 12142.56it/s][A
309688it [00:25, 12156.43it/s][A
310904it [00:25, 12117.81it/s][A
312116it [00:25, 12115.16it/s][A
313328it [00:26, 12113.32it/s][A
314542it [00:26, 12118.00it/s][A
315765it [00:26, 12148.09it/s][A
316980it [00:26, 12145.35it/s][A
318195it [00:26, 12143.43it/s][A
319410it [00:26, 12142.09it/s][A
320625it [00:26, 12141.15it/s][A
321840it [00:26, 12104.16it/s][A
323051it [00:26, 12102.62it/s][A
324262it [00:26, 12101.54it/s][A
325477it [00:27, 12112.74it/s][A
326689it [00:27, 12075.36it/s][A
327899it [00:27, 12079.47it/s][A
329107it [00:2

584476it [00:48, 12032.53it/s][A
585690it [00:48, 12061.30it/s][A
586900it [00:48, 12069.60it/s][A
588115it [00:48, 12090.33it/s][A
589327it [00:48, 12095.91it/s][A
590543it [00:49, 12111.77it/s][A
591755it [00:49, 12110.93it/s][A
592967it [00:49, 12074.13it/s][A
594175it [00:49, 11964.86it/s][A
595381it [00:49, 11989.99it/s][A
596589it [00:49, 12013.60it/s][A
597800it [00:49, 12039.09it/s][A
599005it [00:49, 12039.08it/s][A
600209it [00:49, 11928.66it/s][A
601403it [00:49, 11857.64it/s][A
602590it [00:50, 11822.65it/s][A
603773it [00:50, 11681.44it/s][A
604942it [00:50, 11680.81it/s][A
606147it [00:50, 11786.02it/s][A
607345it [00:50, 11840.36it/s][A
608530it [00:50, 11840.02it/s][A
609733it [00:50, 11893.18it/s][A
610924it [00:50, 11894.97it/s][A
612138it [00:50, 11964.21it/s][A
613342it [00:50, 11983.60it/s][A
614559it [00:51, 12035.66it/s][A
615772it [00:51, 12060.53it/s][A
616979it [00:51, 12024.02it/s][A
618184it [00:51, 12028.52it/s][A
619396it [00:5

874537it [01:12, 12045.02it/s][A
875742it [01:12, 12043.22it/s][A
876947it [01:12, 12041.97it/s][A
878158it [01:12, 12059.02it/s][A
879367it [01:13, 12065.01it/s][A
880578it [01:13, 12075.19it/s][A
881786it [01:13, 12073.32it/s][A
882994it [01:13, 12000.02it/s][A
884195it [01:13, 11963.84it/s][A
885399it [01:13, 11983.33it/s][A
886608it [01:13, 12011.88it/s][A
887822it [01:13, 12046.77it/s][A
889033it [01:13, 12062.38it/s][A
890241it [01:13, 12064.38it/s][A
891450it [01:14, 12068.77it/s][A
892657it [01:14, 12065.83it/s][A
893872it [01:14, 12087.69it/s][A
895081it [01:14, 12085.08it/s][A
896292it [01:14, 12089.26it/s][A
897511it [01:14, 12116.00it/s][A
898723it [01:14, 12113.90it/s][A
899938it [01:14, 12121.41it/s][A
901151it [01:14, 12120.67it/s][A
902364it [01:14, 12120.17it/s][A
903577it [01:15, 12119.81it/s][A
904794it [01:15, 12131.51it/s][A
906008it [01:15, 12130.76it/s][A
907222it [01:15, 12093.93it/s][A
908440it [01:15, 12116.31it/s][A
909652it [01:1

1161185it [01:36, 12063.27it/s][A
1162395it [01:36, 12070.97it/s][A
1163603it [01:36, 11718.78it/s][A
1164778it [01:36, 11689.91it/s][A
1165978it [01:36, 11778.09it/s][A
1167192it [01:37, 11881.20it/s][A
1168402it [01:37, 11942.78it/s][A
1169598it [01:37, 11944.69it/s][A
1170814it [01:37, 12005.24it/s][A
1172023it [01:37, 12027.28it/s][A
1173227it [01:37, 12027.82it/s][A
1174431it [01:37, 12028.18it/s][A
1175634it [01:37, 11989.46it/s][A
1176851it [01:37, 12039.80it/s][A
1178063it [01:37, 12060.47it/s][A
1179275it [01:38, 12074.99it/s][A
1180490it [01:38, 12094.11it/s][A
1181708it [01:38, 12116.45it/s][A
1182920it [01:38, 12114.22it/s][A
1184132it [01:38, 12076.40it/s][A
1185340it [01:38, 12038.06it/s][A
1186544it [01:38, 12035.35it/s][A
1187748it [01:38, 11997.45it/s][A
1188948it [01:38, 11994.95it/s][A
1190148it [01:38, 11993.18it/s][A
1191348it [01:39, 11956.08it/s][A
1192556it [01:39, 11989.76it/s][A
1193760it [01:39, 12001.50it/s][A
1194968it [01:39, 12

1444646it [02:00, 12085.28it/s][A
1445855it [02:00, 12083.41it/s][A
1447064it [02:00, 12082.08it/s][A
1448273it [02:00, 12081.15it/s][A
1449482it [02:00, 12044.39it/s][A
1450689it [02:00, 12048.77it/s][A
1451895it [02:00, 12048.84it/s][A
1453110it [02:00, 12075.75it/s][A
1454318it [02:00, 12073.72it/s][A
1455526it [02:00, 12072.32it/s][A
1456741it [02:01, 12092.23it/s][A
1457957it [02:01, 12109.18it/s][A
1459175it [02:01, 12127.05it/s][A
1460391it [02:01, 12133.61it/s][A
1461605it [02:01, 12132.21it/s][A
1462819it [02:01, 12094.94it/s][A
1464038it [02:01, 12120.02it/s][A
1465251it [02:01, 12119.71it/s][A
1466463it [02:01, 12116.49it/s][A
1467675it [02:01, 12114.24it/s][A
1468887it [02:02, 12058.27it/s][A
1470101it [02:02, 12079.38it/s][A
1471310it [02:02, 12079.28it/s][A
1472523it [02:02, 12091.16it/s][A
1473744it [02:02, 12123.28it/s][A
1474957it [02:02, 12121.98it/s][A
1476170it [02:02, 12121.08it/s][A
1477383it [02:02, 12120.46it/s][A
1478596it [02:02, 12

1728392it [02:23, 12052.70it/s][A
1729603it [02:23, 12066.55it/s][A
1730813it [02:23, 12073.28it/s][A
1732021it [02:23, 12072.00it/s][A
1733229it [02:23, 12071.11it/s][A
1734437it [02:23, 12070.48it/s][A
1735653it [02:24, 12093.93it/s][A
1736868it [02:24, 12107.38it/s][A
1738079it [02:24, 12104.87it/s][A
1739290it [02:24, 12066.89it/s][A
1740497it [02:24, 12064.53it/s][A
1741708it [02:24, 12074.86it/s][A
1742921it [02:24, 12088.05it/s][A
1744137it [02:24, 12106.25it/s][A
1745348it [02:24, 12067.86it/s][A
1746555it [02:24, 12065.21it/s][A
1747762it [02:25, 12063.35it/s][A
1748969it [02:25, 12062.06it/s][A
1750178it [02:25, 12067.15it/s][A
1751395it [02:25, 12094.53it/s][A
1752607it [02:25, 12098.86it/s][A
1753820it [02:25, 12104.89it/s][A
1755044it [02:25, 12141.82it/s][A
1756259it [02:25, 12140.96it/s][A
1757474it [02:25, 12104.04it/s][A
1758685it [02:25, 12102.53it/s][A
1759896it [02:26, 12101.46it/s][A
1761107it [02:26, 11957.22it/s][A
1762312it [02:26, 11

2011515it [02:46, 11983.37it/s][A
2012726it [02:47, 12017.83it/s][A
2013939it [02:47, 12047.99it/s][A
2015144it [02:47, 12045.31it/s][A
2016357it [02:47, 12067.31it/s][A
2017571it [02:47, 12085.74it/s][A
2018783it [02:47, 12092.70it/s][A
2019997it [02:47, 12103.56it/s][A
2021208it [02:47, 12102.19it/s][A
2022419it [02:47, 12101.23it/s][A
2023630it [02:47, 12100.56it/s][A
2024841it [02:48, 12100.09it/s][A
2026056it [02:48, 12111.72it/s][A
2027268it [02:48, 12110.90it/s][A
2028480it [02:48, 12110.33it/s][A
2029692it [02:48, 12109.92it/s][A
2030903it [02:48, 12070.41it/s][A
2032114it [02:48, 12078.97it/s][A
2033328it [02:48, 12093.93it/s][A
2034542it [02:48, 12104.41it/s][A
2035764it [02:48, 12135.58it/s][A
2036985it [02:49, 12154.51it/s][A
2038201it [02:49, 12152.83it/s][A
2039417it [02:49, 12151.66it/s][A
2040633it [02:49, 12150.85it/s][A
2041849it [02:49, 12113.93it/s][A
2043061it [02:49, 12112.45it/s][A
2044273it [02:49, 12111.41it/s][A
2045485it [02:49, 12

Found 2195864 word vectors.





In [56]:
# 전체 문장에 대해 정규화된 벡터를 생성하는 함수
def sent2vec(s):
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
    M = np.array(M) 
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())

In [57]:
# 훈련/테스트 셋을 위 함수를 사용하여 문장 벡터 생성
xtrain_glove = [sent2vec(x) for x in tqdm(xtrain)]
xvalid_glove = [sent2vec(x) for x in tqdm(xvalid)]


  0%|                                                                                        | 0/17621 [00:00<?, ?it/s][A
  2%|█▎                                                                          | 301/17621 [00:00<00:05, 3007.26it/s][A
  4%|███▏                                                                        | 743/17621 [00:00<00:05, 3325.52it/s][A
  7%|█████▏                                                                     | 1211/17621 [00:00<00:04, 3640.94it/s][A
  9%|███████                                                                    | 1671/17621 [00:00<00:04, 3882.97it/s][A
 12%|█████████▏                                                                 | 2144/17621 [00:00<00:03, 4102.45it/s][A
 15%|███████████                                                                | 2585/17621 [00:00<00:03, 4189.03it/s][A
 17%|█████████████                                                              | 3057/17621 [00:00<00:03, 4334.25it/s][A
 20%|██████████

In [58]:
xtrain_glove = np.array(xtrain_glove)
xvalid_glove = np.array(xvalid_glove)

glove 피쳐에 대한 xgboost의 성능을 살펴보겠습니다.

In [59]:
clf = xgb.XGBClassifier(nthread=10, silent=False)
clf.fit(xtrain_glove, ytrain)
predictions = clf.predict_proba(xvalid_glove)

print('logloss: %0.3f' %multiclass_logloss(yvalid, predictions))

logloss: 0.797


## Deep Learning
하지만 지금은 딥러닝의 시대입니다. 우리는 LSTM과 glove 기능에 대한 단순 dense 네트워크를 교육할 것입니다. 

In [64]:
# 신경망 구축 전 데이터 스케일링
scl = preprocessing.StandardScaler()
xtrain_glove_scl = scl.fit_transform(xtrain_glove)
xvalid_glove_scl = scl.transform(xvalid_glove)

In [65]:
# 신경망을 위한 이진 레이블 생성
ytrain_enc = np_utils.to_categorical(ytrain)
yvalid_enc = np_utils.to_categorical(yvalid)

In [66]:
# 간단한 3 시퀀셜 신경망 구축
model = Sequential()

model.add(Dense(300, input_dim=300, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(300, activation='relu'))
model.add(Dropout(0.3))
model.add(BatchNormalization())

model.add(Dense(3))
model.add(Activation('softmax'))

# 모델 컴파일
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [67]:
model.fit(xtrain_glove_scl, y=ytrain_enc, batch_size=64,
          epochs=5, verbose=1,
          validation_data=(xvalid_glove_scl, yvalid_enc))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x27c50448d48>

더 나은 결과를 얻기 위해서는 신경망의 파라미터를 조정하고, 더 많은 레이어를 추가하고, 드롭아웃을 늘려야 합니다. 여기서는 xgboost보다 구현 및 실행이 빠르며 더 나은 결과를 보여주는 것을 최적화 없이 보여줍니다.

더 나아가, LSTM을 이용하여 텍스트 데이터를 토큰화해야합니다

In [68]:
token = text.Tokenizer(num_words=None)
max_len = 70

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

In [70]:
# 임베딩 매트릭스 생성
embedding_matrix = np.zeros((len(word_index)+1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

100%|████████████████████████████████████████████████████████████████████████| 25943/25943 [00:00<00:00, 381169.47it/s]


In [71]:
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                    300,
                    weights = [embedding_matrix],
                    input_length = max_len,
                    trainable = False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')



In [72]:
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, verbose=1, validation_data=(xvalid_pad, yvalid_enc))

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x27e856a3248>

점수가 0.5보다 낮아졌습니다. 조기중지 없이 전체 에포크에 대해 진행했으나, 최적의 반복에서 중지하는 것이 좋습니다. 조기중지를 사용하여 다시한번 진행해보겠습니다.

In [73]:
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                    300,
                    weights=[embedding_matrix],
                    input_length = max_len,
                    trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

# 조기 중지 콜백을 사용하여 모델 적합
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100,
          verbose=1, validation_data=(xvalid_pad, yvalid_enc),
          callbacks=[earlystop])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100


<tensorflow.python.keras.callbacks.History at 0x27e8d438f88>

드롭아웃의 비중을 크게 하는 이유는 적게 할 시 과적합의 가능성이 있기 때문입니다.

Bi-directional LSTM이 더 나은 결과를 가져오는 것을 확인해보겠습니다.

In [74]:
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                    300,
                    weights = [embedding_matrix],
                    input_length = max_len,
                    trainable = False))
model.add(SpatialDropout1D(0.3))
model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')

model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100,
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100


<tensorflow.python.keras.callbacks.History at 0x27e97660848>

GRU를 사용해서 진행해보겠습니다.

In [75]:
model = Sequential()
model.add(Embedding(len(word_index)+1,
                    300,
                    weights=[embedding_matrix],
                    input_length=max_len,
                    trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')

model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100,
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100


<tensorflow.python.keras.callbacks.History at 0x27e9c82a088>

더 나은 결과를 보여줍니다. 지속적으로 최적화를 해서 성능을 향상시키면 됩니다. 

높은 점수를 얻기 위해서는 모델을 앙상블 해야합니다.

## Ensembling


In [77]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.preprocessing import LabelEncoder
import os
import sys
import logging

In [82]:
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold
import pandas as pd
import os
import sys
import logging

logging.basicConfig(
    level=logging.DEBUG,
    format="[%(asctime)s] %(levelname)s %(message)s",
    datefmt="%H:%M:%S", stream=sys.stdout)
logger = logging.getLogger(__name__)


class Ensembler(object):
    def __init__(self, model_dict, num_folds=3, task_type='classification', optimize=roc_auc_score,
                 lower_is_better=False, save_path=None):
        """
        Ensembler init function
        :param model_dict: model dictionary, see README for its format
        :param num_folds: the number of folds for ensembling
        :param task_type: classification or regression
        :param optimize: the function to optimize for, e.g. AUC, logloss, etc. Must have two arguments y_test and y_pred
        :param lower_is_better: is lower value of optimization function better or higher
        :param save_path: path to which model pickles will be dumped to along with generated predictions, or None
        """

        self.model_dict = model_dict
        self.levels = len(self.model_dict)
        self.num_folds = num_folds
        self.task_type = task_type
        self.optimize = optimize
        self.lower_is_better = lower_is_better
        self.save_path = save_path

        self.training_data = None
        self.test_data = None
        self.y = None
        self.lbl_enc = None
        self.y_enc = None
        self.train_prediction_dict = None
        self.test_prediction_dict = None
        self.num_classes = None

    def fit(self, training_data, y, lentrain):
        """
        :param training_data: training data in tabular format
        :param y: binary, multi-class or regression
        :return: chain of models to be used in prediction
        """

        self.training_data = training_data
        self.y = y

        if self.task_type == 'classification':
            self.num_classes = len(np.unique(self.y))
            logger.info("Found %d classes", self.num_classes)
            self.lbl_enc = LabelEncoder()
            self.y_enc = self.lbl_enc.fit_transform(self.y)
            kf = StratifiedKFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, self.num_classes)
        else:
            self.num_classes = -1
            self.y_enc = self.y
            kf = KFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, 1)

        self.train_prediction_dict = {}
        for level in range(self.levels):
            self.train_prediction_dict[level] = np.zeros((train_prediction_shape[0],
                                                          train_prediction_shape[1] * len(self.model_dict[level])))

        for level in range(self.levels):

            if level == 0:
                temp_train = self.training_data
            else:
                temp_train = self.train_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):
                validation_scores = []
                foldnum = 1
                for train_index, valid_index in kf.split(self.train_prediction_dict[0], self.y_enc):
                    logger.info("Training Level %d Fold # %d. Model # %d", level, foldnum, model_num)

                    if level != 0:
                        l_training_data = temp_train[train_index]
                        l_validation_data = temp_train[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])
                    else:
                        l0_training_data = temp_train[0][model_num]
                        if type(l0_training_data) == list:
                            l_training_data = [x[train_index] for x in l0_training_data]
                            l_validation_data = [x[valid_index] for x in l0_training_data]
                        else:
                            l_training_data = l0_training_data[train_index]
                            l_validation_data = l0_training_data[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])

                    logger.info("Predicting Level %d. Fold # %d. Model # %d", level, foldnum, model_num)

                    if self.task_type == 'classification':
                        temp_train_predictions = model.predict_proba(l_validation_data)
                        self.train_prediction_dict[level][valid_index,
                        (model_num * self.num_classes):(model_num * self.num_classes) +
                                                       self.num_classes] = temp_train_predictions

                    else:
                        temp_train_predictions = model.predict(l_validation_data)
                        self.train_prediction_dict[level][valid_index, model_num] = temp_train_predictions
                    validation_score = self.optimize(self.y_enc[valid_index], temp_train_predictions)
                    validation_scores.append(validation_score)
                    logger.info("Level %d. Fold # %d. Model # %d. Validation Score = %f", level, foldnum, model_num,
                                validation_score)
                    foldnum += 1
                avg_score = np.mean(validation_scores)
                std_score = np.std(validation_scores)
                logger.info("Level %d. Model # %d. Mean Score = %f. Std Dev = %f", level, model_num,
                            avg_score, std_score)

            logger.info("Saving predictions for level # %d", level)
            train_predictions_df = pd.DataFrame(self.train_prediction_dict[level])
            train_predictions_df.to_csv(os.path.join(self.save_path, "train_predictions_level_" + str(level) + ".csv"),
                                        index=False, header=None)

        return self.train_prediction_dict

    def predict(self, test_data, lentest):
        self.test_data = test_data
        if self.task_type == 'classification':
            test_prediction_shape = (lentest, self.num_classes)
        else:
            test_prediction_shape = (lentest, 1)

        self.test_prediction_dict = {}
        for level in range(self.levels):
            self.test_prediction_dict[level] = np.zeros((test_prediction_shape[0],
                                                         test_prediction_shape[1] * len(self.model_dict[level])))
        self.test_data = test_data
        for level in range(self.levels):
            if level == 0:
                temp_train = self.training_data
                temp_test = self.test_data
            else:
                temp_train = self.train_prediction_dict[level - 1]
                temp_test = self.test_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):

                logger.info("Training Fulldata Level %d. Model # %d", level, model_num)
                if level == 0:
                    model.fit(temp_train[0][model_num], self.y_enc)
                else:
                    model.fit(temp_train, self.y_enc)

                logger.info("Predicting Test Level %d. Model # %d", level, model_num)

                if self.task_type == 'classification':
                    if level == 0:
                        temp_test_predictions = model.predict_proba(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict_proba(temp_test)
                    self.test_prediction_dict[level][:, (model_num * self.num_classes): (model_num * self.num_classes) +
                                                                                        self.num_classes] = temp_test_predictions

                else:
                    if level == 0:
                        temp_test_predictions = model.predict(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict(temp_test)
                    self.test_prediction_dict[level][:, model_num] = temp_test_predictions

            test_predictions_df = pd.DataFrame(self.test_prediction_dict[level])
            test_predictions_df.to_csv(os.path.join(self.save_path, "test_predictions_level_" + str(level) + ".csv"),
                                       index=False, header=None)

        return self.test_prediction_dict

In [83]:
train_data_dict = {0: [xtrain_tfv, xtrain_ctv, xtrain_tfv, xtrain_ctv], 1: [xtrain_glove]}
test_data_dict = {0: [xvalid_tfv, xvalid_ctv, xvalid_tfv, xvalid_ctv], 1: [xvalid_glove]}

model_dict = {0: [LogisticRegression(), LogisticRegression(), MultinomialNB(alpha=0.1), MultinomialNB()],
              1: [xgb.XGBClassifier(silent=True, n_estimators=120, max_depth=7)]}
ens = Ensembler(model_dict=model_dict, num_folds=3, task_type='classification',
                optimize=multiclass_logloss, lower_is_better=True, save_path='')

ens.fit(train_data_dict, ytrain, lentrain=xtrain_glove.shape[0])
preds = ens.predict(test_data_dict, lentest=xvalid_glove.shape[0])

[22:24:22] INFO Found 3 classes
[22:24:22] INFO Training Level 0 Fold # 1. Model # 0
[22:24:23] INFO Predicting Level 0. Fold # 1. Model # 0
[22:24:23] INFO Level 0. Fold # 1. Model # 0. Validation Score = 0.626621
[22:24:23] INFO Training Level 0 Fold # 2. Model # 0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[22:24:24] INFO Predicting Level 0. Fold # 2. Model # 0
[22:24:24] INFO Level 0. Fold # 2. Model # 0. Validation Score = 0.616474
[22:24:24] INFO Training Level 0 Fold # 3. Model # 0
[22:24:24] INFO Predicting Level 0. Fold # 3. Model # 0
[22:24:24] INFO Level 0. Fold # 3. Model # 0. Validation Score = 0.619633
[22:24:24] INFO Level 0. Model # 0. Mean Score = 0.620909. Std Dev = 0.004239
[22:24:24] INFO Training Level 0 Fold # 1. Model # 1


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[22:24:34] INFO Predicting Level 0. Fold # 1. Model # 1
[22:24:34] INFO Level 0. Fold # 1. Model # 1. Validation Score = 0.573485
[22:24:34] INFO Training Level 0 Fold # 2. Model # 1
[22:24:44] INFO Predicting Level 0. Fold # 2. Model # 1
[22:24:44] INFO Level 0. Fold # 2. Model # 1. Validation Score = 0.563451
[22:24:44] INFO Training Level 0 Fold # 3. Model # 1
[22:24:54] INFO Predicting Level 0. Fold # 3. Model # 1
[22:24:54] INFO Level 0. Fold # 3. Model # 1. Validation Score = 0.567765
[22:24:54] INFO Level 0. Model # 1. Mean Score = 0.568233. Std Dev = 0.004110
[22:24:54] INFO Training Level 0 Fold # 1. Model # 2
[22:24:54] INFO Predicting Level 0. Fold # 1. Model # 2
[22:24:54] INFO Level 0. Fold # 1. Model # 2. Validation Score = 0.463292
[22:24:54] INFO Training Level 0 Fold # 2. Model # 2
[22:24:54] INFO Predicting Level 0. Fold # 2. Model # 2
[22:24:54] INFO Level 0. Fold # 2. Model # 2. Validation Score = 0.456477
[22:24:54] INFO Training Level 0 Fold # 3. Model # 2
[22:24:

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[22:25:21] INFO Predicting Test Level 0. Model # 1
[22:25:21] INFO Training Fulldata Level 0. Model # 2
[22:25:21] INFO Predicting Test Level 0. Model # 2
[22:25:21] INFO Training Fulldata Level 0. Model # 3
[22:25:21] INFO Predicting Test Level 0. Model # 3
[22:25:21] INFO Training Fulldata Level 1. Model # 0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[22:25:29] INFO Predicting Test Level 1. Model # 0


In [84]:
# 에러 확인
multiclass_logloss(yvalid, preds[1])

0.42416139841997214

앙상블 기법이 점수를 많이 향상시켰습니다. 