文本分类方法有:

- TF-IDF
- Count Features
- Logistic Regression
- Naive Bayes
- SVM
- Xgboost
- Grid Search
- Word Vectors
- Dense Network
- LSTM
- GRU
- Ensembling

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
from tqdm import tqdm
from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB

from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Data

In [2]:
data = pd.read_excel('./datasets/复旦大学中文文本分类语料.xlsx', sheet_name='sheet1')

In [3]:
data.sample(5)

Unnamed: 0,分类,正文
4871,农业,﻿【 文献号 】1-784\n【原文出处】农业技术经济\n【原刊地名】京\n【原刊期号】19...
6267,体育,﻿【 文献号 】1-3107\n【原文出处】启蒙\n【原刊地名】津\n【原刊期号】19950...
9082,经济,﻿【 文献号 】2-799\n【原文出处】预测\n【原刊地名】合肥\n【原刊期号】20000...
110,艺术,﻿【 文献号 】2-568\n【原文出处】太原日报\n【原刊期号】19950418\n【原刊...
756,文学,﻿【 日期 】19960906\n【 版号 】11\n【 标题 】剖析当代知识分子心灵\n【...


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9249 entries, 0 to 9248
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   分类      9249 non-null   object
 1   正文      9249 non-null   object
dtypes: object(2)
memory usage: 144.6+ KB


In [5]:
data.分类.unique()

array(['艺术', '文学', '哲学', '通信', '能源', '历史', '矿藏', '空间', '教育', '交通', '计算机',
       '环境', '电子', '农业', '体育', '时政', '医疗', '经济', '法律'], dtype=object)

### 分词

可以先过滤特殊符号，只保留汉字

In [None]:
# 本实验运行时，没有使用
# import re


# def preprocessing(txt, pattern=r'[\u4e00-\u9fa5]+'):
#     re_tokens = re.findall(pattern, txt)
#     return re_tokens


# data['正则结果'] = data['正文'].apply(lambda txt: ' '.join(preprocessing(txt)))
# # 注意修改下面分词结果，在正则结果上apply

可采用分词效果更好的分词器，如pyltp、THULAC、Hanlp等

In [7]:
import os

# ltp模型目录的路径
LTP_DATA_DIR = r'D:\ProgramData\nlp_package\ltp_v34'
# 分词模型路径，模型名称为`cws.model`
cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model')  

In [8]:
from pyltp import Segmentor


segmentor = Segmentor()  # 初始化实例
segmentor.load(cws_model_path)  # 加载模型

data['分词结果'] = data['正文'].apply(lambda i: ' '.join(segmentor.segment(i)))

In [9]:
segmentor.release()  # 释放模型

In [10]:
data.sample(5)

Unnamed: 0,分类,正文,分词结果
3122,计算机,﻿微型机与应用\nMICROCOMPUTER & ITS APPLICATIONS\n199...,﻿ 微型机 与 应用 \n MICROCOMPUTER & ITS APPLICATIONS...
7885,经济,﻿【 文献号 】2-1862\n【原文出处】上海社会科学院学术季刊\n【原刊期号】20000...,﻿ 【 文献号 】 2-1862 \n 【 原文 出处 】 上海 社会 科学院 学术季 刊\...
8731,经济,﻿【 文献号 】2-1117\n【原文出处】财政研究\n【原刊地名】京\n【原刊期号】199...,﻿ 【 文献号 】 2-1117 \n 【 原文 出处 】 财政 研究 \n 【 原刊 地名...
5498,农业,﻿湖北农业科学\nHUBEI AGRICULTURAL SCIENCES\n1998年第6期...,﻿ 湖北 农业 科学\n HUBEI AGRICULTURAL SCIENCES \n 19...
2369,计算机,﻿自动化学报\nACTA AUTOMATICA SINICA\n2000　Vol.26　No...,﻿ 自动化学 报\n ACTA AUTOMATICA SINICA \n 2000 Vol....


### Loss

In [11]:
def multiclass_logloss(actual, predicted, eps=1e-15):
    """对数损失度量（Logarithmic Loss  Metric）的多分类版本。
    :param actual: 包含actual target classes的数组
    :param predicted: 分类预测结果矩阵, 每个类别都有一个概率
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2
    
    # clip 0 and 1 for calculate
    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

### label

In [13]:
label_encoder = preprocessing.LabelEncoder()
y = label_encoder.fit_transform(data.分类.values)

### dataset split

In [15]:
xtrain, xvalid, ytrain, yvalid = train_test_split(data.分词结果.values, y, 
                                                  stratify=y, 
                                                  random_state=42, 
                                                  test_size=0.1, shuffle=True)

In [16]:
print(xtrain.shape)
print(xvalid.shape)

(8324,)
(925,)


## Models

### Basic Models

TF-IDF (Term Frequency - Inverse Document Frequency)+逻辑斯底回归（Logistic Regression）

将文本中的数字特征统一表示成"#NUMBER"，达到一定的降噪效果。

In [17]:
def number_normalizer(tokens):
    """ 将所有数字标记映射为一个占位符（Placeholder）。
    对于许多实际应用场景来说，以数字开头的tokens不是很有用，
    全部视为一类‘数字’。 通过将所有数字都表示成同一个符号，可以达到降维的目的。
    """
    return ('#NUMBER' if token[0].isdigit() else token for token in tokens)


class NumberNormalizingVectorizer(TfidfVectorizer):
    def build_tokenizer(self):
        tokenizer = super(NumberNormalizingVectorizer, self).build_tokenizer()
        return lambda doc: list(number_normalizer(tokenizer(doc)))

In [18]:
with open('datasets\stopwords.txt', 'r', encoding='utf-8') as f: 
    stopwords_list = [w.strip() for w in f.readlines()]

In [19]:
tfidf_vectorizer = NumberNormalizingVectorizer(min_df=3,  
                                  max_df=0.5,
                                  max_features=None,                 
                                  ngram_range=(1, 2), 
                                  use_idf=True,
                                  smooth_idf=True,
                                  stop_words = stopwords_list)

tfidf_vectorizer.fit(data.分词结果.values)
xtrain_tfidf = tfidf_vectorizer.transform(xtrain)
xvalid_tfidf = tfidf_vectorizer.transform(xvalid)

  'stop_words.' % sorted(inconsistent))


In [None]:
clf = LogisticRegression(C=1.0,solver='lbfgs',multi_class='multinomial')

clf.fit(xtrain_tfidf, ytrain)

In [22]:
predictions = clf.predict_proba(xvalid_tfidf)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.589 


### BOW feature

In [23]:
count_vectorizer = CountVectorizer(min_df=3,
                                    max_df=5,
                                    ngram_range=(1, 2),
                                    stop_words=stopwords_list)

count_vectorizer.fit(data.分词结果.values)
xtrain_bow = count_vectorizer.transform(xtrain)
xvalid_bow = count_vectorizer.transform(xvalid)

  'stop_words.' % sorted(inconsistent))


In [24]:
clf = LogisticRegression(C=1.0,solver='lbfgs',multi_class='multinomial')

clf.fit(xtrain_bow, ytrain)
predictions = clf.predict_proba(xvalid_bow)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.774 


### Naive Bayes

In [25]:
# tf-idf feature
clf = MultinomialNB()
clf.fit(xtrain_tfidf, ytrain)
predictions = clf.predict_proba(xvalid_tfidf)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.886 


In [27]:
# bow feature
clf = MultinomialNB()
clf.fit(xtrain_bow, ytrain)
predictions = clf.predict_proba(xvalid_bow)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 1.914 


### SVM

In [None]:
# 在使用SVM之前，我们需要将数据标准化（Standardize Data ），同时结合SVD降维
# 对于SVM来说，SVD的components的合适调整区间一般为120~200 
svd = decomposition.TruncatedSVD(n_components=120)

svd.fit(xtrain_tfidf)
xtrain_svd = svd.transform(xtrain_tfidf)
xvalid_svd = svd.transform(xvalid_tfidf)

In [30]:
scaler = preprocessing.StandardScaler()
scaler.fit(xtrain_svd)
xtrain_svd_scl = scaler.transform(xtrain_svd)
xvalid_svd_scl = scaler.transform(xvalid_svd)

In [33]:
clf = SVC(C=1.0, probability=True)

clf.fit(xtrain_svd_scl, ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.358 


### XGBoost

这部分运行很慢，lightGBM更快一些

In [38]:
# 不使用sklearn API的用法

# # tf-idf
# xtrain_xgb = xgb.DMatrix(xtrain_tfidf, label=ytrain)
# xvalid_xgb = xgb.DMatrix(xvalid_tfidf, label=yvalid)

# # setup parameters for xgboost
# param = {}
# # use softmax multi-class classification
# param['objective'] = 'multi:softmax'
# # scale weight of positive examples
# param['eta'] = 0.1
# param['max_depth'] = 7
# param['silent'] = 1
# param['nthread'] = 4
# param['num_class'] = len(data.分类.unique())
# param['eval_metric'] = 'mlogloss'
# param['colsample_bytree']  = 0.8
# param['subsample']  = 0.8

# num_round = 100

# clf = xgb.train(param, xtrain_xgb, num_boost_round=num_round)

# predictions = clf.predict(xvalid_xgb)  # class id list

In [None]:
# sklearn inferface

clf = xgb.XGBClassifier(objective='multi:softmax',
                       max_depth=7,
                       n_estimators=50, 
                       colsample_bytree=0.8, 
                       subsample=0.8, 
                       nthread=10, 
                       learning_rate=0.1)

clf.fit(xtrain_tfidf.tocsc(), ytrain)  # Sparse col
predictions = clf.predict_proba(xvalid_tfidf.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

In [None]:
# Bow feature
clf = xgb.XGBClassifier(objective='multi:softmax', 
                       max_depth=7,
                       n_estimators=50, 
                       colsample_bytree=0.8, 
                       subsample=0.8, 
                       nthread=10, 
                       learning_rate=0.1)

clf.fit(xtrain_bow.tocsc(), ytrain)  # Sparse col
predictions = clf.predict_proba(xvalid_bow.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

In [None]:
# tf-idf + svd
clf = xgb.XGBClassifier(objective='multi:softmax',
                       max_depth=7,
                       n_estimators=50, 
                       colsample_bytree=0.8, 
                       subsample=0.8, 
                       nthread=10, 
                       learning_rate=0.1)

clf.fit(xtrain_svd.tocsc(), ytrain)  # Sparse col
predictions = clf.predict_proba(xvalid_svd.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

In [None]:
# tf-idf + svd + scaled
clf = xgb.XGBClassifier(objective='multi:softmax',
                       max_depth=7,
                       n_estimators=50, 
                       colsample_bytree=0.8, 
                       subsample=0.8, 
                       nthread=10, 
                       learning_rate=0.1)

clf.fit(xtrain_svd_scl.tocsc(), ytrain)  # Sparse col
predictions = clf.predict_proba(xvalid_svd_scl.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

### LightGBM

In [49]:
import lightgbm as lgb

In [50]:
# For origin package use
# xtrain_lgb = lgb.Dataset(xtrain_tfidf, ytrain)
# xvalid_lgb = lgb.Dataset(xvalid_tfidf, yvalid, reference=xtrain_lgb)

clf = lgb.LGBMClassifier(num_leaves=31,
                        max_depth=7,
                        n_estimators=50,
                        objective='multiclass',
                        subsample=0.8,
                        colsample_bytree=0.8,
                        learning_rate=0.1)

clf.fit(xtrain_tfidf, ytrain)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.8,
               importance_type='split', learning_rate=0.1, max_depth=7,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=50, n_jobs=-1, num_leaves=31,
               objective='multiclass', random_state=None, reg_alpha=0.0,
               reg_lambda=0.0, silent=True, subsample=0.8,
               subsample_for_bin=200000, subsample_freq=0)

In [51]:
predictions = clf.predict_proba(xvalid_tfidf)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.238 


### Pipline Grid Search

In [None]:
# 运行耗时较长

# 自定义评分函数
mll_scorer = metrics.make_scorer(multiclass_logloss, 
                                    greater_is_better=False, 
                                    needs_proba=True)

#SVD初始化
svd = TruncatedSVD()
    
# Standard Scaler初始化
scl = preprocessing.StandardScaler()

# 再一次使用Logistic Regression
lr_model = LogisticRegression()

# 创建pipeline 
clf = pipeline.Pipeline([('svd', svd),
                        ('scl', scl),
                        ('lr', lr_model)])

# param for search
param_grid = {'svd__n_components' : [120, 180],
              'lr__C': [0.1, 1.0, 10], 
              'lr__penalty': ['l1', 'l2']}

# 网格搜索模型（Grid Search Model）初始化
model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
                                 verbose=2, n_jobs=1, iid=True, refit=True, cv=2)

#fit网格搜索模型
model.fit(xtrain_tfidf, ytrain)

print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

### Word2Vec Feature

In [52]:
# 一行一个单独文本
doc_word_list = [dwords.split() for dwords in data['分词结果']]

In [59]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence


# sentences = LineSentence(file_path)
model = Word2Vec(doc_word_list, min_count=5, window=7, size=100, workers=4)

# model.save('word2vec_model_100v.w2v')

In [60]:
embeddings_index = dict(zip(model.wv.index2word, model.wv.vectors))

In [62]:
def sent2vec(s):
    #该函数会将语句转化为一个标准化的向量（Normalized Vector）
    from pyltp import Segmentor

    segmentor = Segmentor()  # 初始化实例
    segmentor.load(cws_model_path)  # 加载模型
    words = segmentor.segment(s)
    segmentor.release()
    words = [w for w in words if not w in stopwords_list]
    
    M = []
    for w in words:
        try:
            #M.append(embeddings_index[w])
            M.append(model.wv.get_vector(w))
        except:
            continue
            
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())

In [68]:
from tqdm import tqdm_notebook

# 耗时较长，一小时左右
xtrain_w2v = [sent2vec(x) for x in tqdm_notebook(xtrain)]
xvalid_w2v = [sent2vec(x) for x in tqdm_notebook(xvalid)]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


HBox(children=(FloatProgress(value=0.0, max=8324.0), HTML(value='')))




Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=925.0), HTML(value='')))




In [69]:
xtrain_w2v = np.array(xtrain_w2v)
xvalid_w2v = np.array(xvalid_w2v)

### Deep Model

In [77]:
# Simple test

max_len = 70

# 对标签进行binarize处理
ytrain_enc = np_utils.to_categorical(ytrain)
yvalid_enc = np_utils.to_categorical(yvalid)

# 使用 keras tokenizer
token = text.Tokenizer(num_words=None)
token.fit_on_texts(data.分词结果.values)
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

# 对文本序列进行zero填充
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

In [78]:
#基于已有的数据集中的词汇创建一个词嵌入矩阵（Embedding Matrix）
embedding_matrix = np.zeros((len(word_index) + 1, 100))

for word, i in tqdm_notebook(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=422526.0), HTML(value='')))




In [84]:
# 基于前面训练的Word2vec词向量，构建1个2层的GRU模型
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     100,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(GRU(128, return_sequences=True))
model.add(GRU(128, dropout=0.2, recurrent_dropout=0.2))

model.add(Dense(1024, activation='selu'))
model.add(Dropout(0.8))

model.add(Dense(256, activation='selu'))
model.add(Dropout(0.8))

model.add(Dense(19))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [86]:
#在模型拟合时，使用early stopping这个回调函数（Callback Function）
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=64, epochs=50, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

Train on 8324 samples, validate on 925 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50


<keras.callbacks.callbacks.History at 0x19d0e19a748>

### Model Ensembling

采用Stacking的方式，对于多个效果相当的基础分类器的输出结果，在输入xgboost进行分类计算。

In [70]:
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold
import pandas as pd
import os
import sys
import logging


logging.basicConfig(
    level=logging.DEBUG,
    format="[%(asctime)s] %(levelname)s %(message)s",
    datefmt="%H:%M:%S", stream=sys.stdout)
logger = logging.getLogger(__name__)

In [71]:
class Ensembler(object):
    def __init__(self, model_dict, num_folds=3, task_type='classification', optimize=roc_auc_score,
                 lower_is_better=False, save_path=None):
        """
        Ensembler init function
        :param model_dict: 模型字典 
        :param num_folds: cv所用的fold数量
        :param task_type: 分类（classification） 还是回归（regression）
        :param optimize: 优化函数，比如 AUC, logloss, F1等，必须有2个函数，即y_test 和 y_pred
        :param lower_is_better: 优化函数（Optimization Function）的值越低越好还是越高越好
        :param save_path: 模型保存路径
        """
        self.model_dict = model_dict
        self.levels = len(self.model_dict)
        self.num_folds = num_folds
        self.task_type = task_type
        self.optimize = optimize
        self.lower_is_better = lower_is_better
        self.save_path = save_path

        self.training_data = None
        self.test_data = None
        self.y = None
        self.lbl_enc = None
        self.y_enc = None
        self.train_prediction_dict = None
        self.test_prediction_dict = None
        self.num_classes = None

    def fit(self, training_data, y, lentrain):
        """
        :param training_data: 二维表格形式的训练数据
        :param y: 二进制的, 多分类或回归
        :return: 用于预测的模型链（Chain of Models）
        """
        self.training_data = training_data
        self.y = y

        if self.task_type == 'classification':
            self.num_classes = len(np.unique(self.y))
            logger.info("Found %d classes", self.num_classes)
            self.lbl_enc = LabelEncoder()
            self.y_enc = self.lbl_enc.fit_transform(self.y)
            kf = StratifiedKFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, self.num_classes)
        else:
            self.num_classes = -1
            self.y_enc = self.y
            kf = KFold(n_splits=self.num_folds)
            train_prediction_shape = (lentrain, 1)

        # 每层模型的输出shape
        self.train_prediction_dict = {}
        for level in range(self.levels):
            self.train_prediction_dict[level] = np.zeros((train_prediction_shape[0],
                                                          train_prediction_shape[1] * len(self.model_dict[level])))

        for level in range(self.levels):
            if level == 0:  # 第一层基础分类器输入
                temp_train = self.training_data
            else: # 第二层基于基础分类器结果的再分类器输入
                temp_train = self.train_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):
                validation_scores = []
                foldnum = 1
                for train_index, valid_index in kf.split(self.train_prediction_dict[0], self.y_enc):
                    logger.info("Training Level %d Fold # %d. Model # %d", level, foldnum, model_num)

                    if level != 0:  # 第二层基于基础分类器结果的再分类
                        l_training_data = temp_train[train_index]
                        l_validation_data = temp_train[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])
                    else:  # 第一层基础分类
                        l0_training_data = temp_train[0][model_num]
                        if type(l0_training_data) == list:
                            l_training_data = [x[train_index] for x in l0_training_data]
                            l_validation_data = [x[valid_index] for x in l0_training_data]
                        else:
                            l_training_data = l0_training_data[train_index]
                            l_validation_data = l0_training_data[valid_index]
                        model.fit(l_training_data, self.y_enc[train_index])

                    logger.info("Predicting Level %d. Fold # %d. Model # %d", level, foldnum, model_num)

                    # valid results
                    if self.task_type == 'classification':
                        temp_train_predictions = model.predict_proba(l_validation_data)
                        self.train_prediction_dict[level][valid_index,
                            (model_num * self.num_classes):((model_num + 1) * self.num_classes)] = temp_train_predictions

                    else:
                        temp_train_predictions = model.predict(l_validation_data)
                        self.train_prediction_dict[level][valid_index, model_num] = temp_train_predictions
                    
                    validation_score = self.optimize(self.y_enc[valid_index], temp_train_predictions)
                    validation_scores.append(validation_score)
                    logger.info("Level %d. Fold # %d. Model # %d. Validation Score = %f", level, foldnum, model_num,
                                validation_score)
                    foldnum += 1
                
                # 各个基础分类器的性能不要相差太大，否则模型效果不易提升
                avg_score = np.mean(validation_scores)
                std_score = np.std(validation_scores)
                logger.info("Level %d. Model # %d. Mean Score = %f. Std Dev = %f", level, model_num,
                            avg_score, std_score)

            logger.info("Saving predictions for level # %d", level)
            train_predictions_df = pd.DataFrame(self.train_prediction_dict[level])
            train_predictions_df.to_csv(os.path.join(self.save_path, "train_predictions_level_" + str(level) + ".csv"),
                                        index=False, header=None)

        return self.train_prediction_dict

    def predict(self, test_data, lentest):
        self.test_data = test_data
        if self.task_type == 'classification':
            test_prediction_shape = (lentest, self.num_classes)
        else:
            test_prediction_shape = (lentest, 1)

        self.test_prediction_dict = {}
        for level in range(self.levels):
            self.test_prediction_dict[level] = np.zeros((test_prediction_shape[0],
                                                         test_prediction_shape[1] * len(self.model_dict[level])))
        self.test_data = test_data
        for level in range(self.levels):
            if level == 0:
                temp_test = self.test_data
            else:
                temp_test = self.test_prediction_dict[level - 1]

            for model_num, model in enumerate(self.model_dict[level]):

                if self.task_type == 'classification':
                    if level == 0:
                        temp_test_predictions = model.predict_proba(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict_proba(temp_test)
                    self.test_prediction_dict[level][:, (model_num * self.num_classes): 
                        ((model_num + 1) * self.num_classes)] = temp_test_predictions
                else:
                    if level == 0:
                        temp_test_predictions = model.predict(temp_test[0][model_num])
                    else:
                        temp_test_predictions = model.predict(temp_test)
                    self.test_prediction_dict[level][:, model_num] = temp_test_predictions

            test_predictions_df = pd.DataFrame(self.test_prediction_dict[level])
            test_predictions_df.to_csv(os.path.join(self.save_path, "test_predictions_level_" + str(level) + ".csv"),
                                       index=False, header=None)

        return self.test_prediction_dict

In [72]:
#为每个level的集成指定使用数据：
train_data_dict = {0: [xtrain_tfidf, xtrain_bow, xtrain_tfidf, xtrain_bow], 1: [xtrain_w2v]}
test_data_dict = {0: [xvalid_tfidf, xvalid_bow, xvalid_tfidf, xvalid_bow], 1: [xvalid_w2v]}

model_dict = {0: [LogisticRegression(),
                      LogisticRegression(), 
                      MultinomialNB(alpha=0.1), 
                      MultinomialNB()],
              1: [xgb.XGBClassifier(silent=True, 
                                    objective='multi:softmax',
                                    n_estimators=25, 
                                    max_depth=6,
                                    colsample_bytree=0.8, 
                                    subsample=0.8, 
                                    learning_rate=0.1)]}

ens = Ensembler(model_dict=model_dict, num_folds=3, task_type='classification',
                optimize=multiclass_logloss, lower_is_better=True, save_path='')

ens.fit(train_data_dict, ytrain, lentrain=xtrain_w2v.shape[0])

preds = ens.predict(test_data_dict, lentest=xvalid_w2v.shape[0])

[22:35:09] INFO Found 19 classes
[22:35:09] INFO Training Level 0 Fold # 1. Model # 0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[22:39:22] INFO Predicting Level 0. Fold # 1. Model # 0
[22:39:23] INFO Level 0. Fold # 1. Model # 0. Validation Score = 0.363176
[22:39:23] INFO Training Level 0 Fold # 2. Model # 0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[22:43:40] INFO Predicting Level 0. Fold # 2. Model # 0
[22:43:40] INFO Level 0. Fold # 2. Model # 0. Validation Score = 0.368816
[22:43:40] INFO Training Level 0 Fold # 3. Model # 0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[22:48:07] INFO Predicting Level 0. Fold # 3. Model # 0
[22:48:07] INFO Level 0. Fold # 3. Model # 0. Validation Score = 0.339124
[22:48:07] INFO Level 0. Model # 0. Mean Score = 0.357038. Std Dev = 0.012875
[22:48:07] INFO Training Level 0 Fold # 1. Model # 1
[22:49:49] INFO Predicting Level 0. Fold # 1. Model # 1
[22:49:50] INFO Level 0. Fold # 1. Model # 1. Validation Score = 0.974753
[22:49:50] INFO Training Level 0 Fold # 2. Model # 1
[22:51:09] INFO Predicting Level 0. Fold # 2. Model # 1
[22:51:09] INFO Level 0. Fold # 2. Model # 1. Validation Score = 0.966990
[22:51:09] INFO Training Level 0 Fold # 3. Model # 1
[22:53:00] INFO Predicting Level 0. Fold # 3. Model # 1
[22:53:00] INFO Level 0. Fold # 3. Model # 1. Validation Score = 0.960663
[22:53:00] INFO Level 0. Model # 1. Mean Score = 0.967469. Std Dev = 0.005762
[22:53:00] INFO Training Level 0 Fold # 1. Model # 2
[22:53:01] INFO Predicting Level 0. Fold # 1. Model # 2
[22:53:01] INFO Level 0. Fold # 1. Model # 2. Validation

In [73]:
# 损失
multiclass_logloss(yvalid, preds[1])

0.4194193846150403