### 1. 数据预处理

#### 1.1 导入并读取文件


In [2]:

import os
import jieba
import glob
## 读取文件列表
def get_filelist(data_dir):
    """
    :param data_dir: raw data directory
    :return: A list of news file
    """

    file_list = glob.glob(os.path.join(data_dir, '*_*.txt'))
    return file_list


## 用jieba进行中文分词
def words_parsing(file_list):
    """
    :param file_list: A list or news files
    :return:
        class_label: text label list of each news
        words: news record after parsing
    """
    class_label = []
    words = []
    for f in file_list:
        with open(f, 'rb', ) as fh:
            each_news = ''.join([l.decode(encoding='utf-8').strip() for l in fh.readlines()])
        class_label.append(os.path.basename(f).split('_')[0].lower())
        parsed_words = ' '.join(list(jieba.cut(each_news)))
        words.append(parsed_words)
    return class_label, words


#### 1.2 分词后的语料向量化

In [4]:
import sklearn.feature_extraction.text as t2v
## 使用tfidf向量化
def to_vector_Tfidf(parsed_words):
    """

    :param parsed_words: string list of parsed by jieba
    :return: vectors transformed using Tf-IDF algorithms
    """
    vectorizer = t2v.TfidfVectorizer(max_features=10000)
    vectors = vectorizer.fit_transform(parsed_words)
    return vectors


## 使用tf向量化
def to_vector_countVectors(parsed_words):
    vectorizer = t2v.CountVectorizer(max_df=0.95, min_df=2, max_features=10000)
    vectors = vectorizer.fit_transform(parsed_words)
    return vectors


In [5]:
## 创建wrapper 函数， 整合数据预处理部分
def data_prepare(data_dir, vectorizer):
    """
    :param data_dir: raw data director
    :return: vectors labels and d_label(label dict)
    """
    file_list = get_filelist(data_dir)
    class_label, words = words_parsing(file_list)
    vectors = vectorizer(words)
    s_label = set(class_label)
    d_label = dict(zip(s_label, range(len(s_label))))
    labels = [d_label[k] for k in class_label]
    return vectors, labels, d_label

用两种向量化方式进行向量化

In [7]:
from sklearn.model_selection import train_test_split

## 使用Tfidf进行向量化
data1, labels1, label_dict1 = data_prepare("data/news", to_vector_Tfidf)
train_data1, test_data1, train_label1, test_label1 = train_test_split(data1, labels1)

## 使用tf进行向量化
data2, labels2, label_dict2 = data_prepare("data/news", to_vector_countVectors)
train_data2, test_data2, train_label2, test_label2 = train_test_split(data1, labels1)

Building prefix dict from the default dictionary ...


Loading model from cache /var/folders/f4/74s3c6ln0d1dh3k95wlkysvh0000gn/T/jieba.cache


Loading model cost 1.146 seconds.


Prefix dict has been built succesfully.


### 2. 模型选择与优化


#### 2.1 使用逻辑回归作为基线模型，选择最好的向量化方法

In [13]:
from sklearn.linear_model import LogisticRegression
import time

def lr_model(train_data, train_label, test_data, test_label, penalty, C):
    lr = LogisticRegression(multi_class='multinomial', solver='newton-cg', penalty=penalty, C=C)
    train_pred, test_pred, train_score, test_score =_performance(lr, train_data, train_label, test_data, test_label)
    return train_pred, test_pred, train_score, test_score

def _performance(md, train_data, train_label, test_data, test_label):
    fit_begin = time.time()
    md.fit(train_data, train_label)
    fit_time = time.time() - fit_begin
    train_score = md.score(train_data, train_label)
    test_score = md.score(test_data, test_label)
    train_pred_begin = time.time()
    train_pred = md.predict(train_data)
    train_pred_time = time.time() - train_pred_begin
    test_pred_begin = time.time()
    test_pred = md.predict(test_data)
    test_pred_time = time.time() - test_pred_begin
    print("Predict accuracy: train "+str(train_score) + ", test: " + str(test_score))
    print("Running time: fit " + str(fit_time) + ", Predict on Train: " + str(train_pred_time) + ", Predict on Test: " + str(test_pred_time))
    return train_pred, test_pred, train_score, test_score

## 逻辑回归是我们的基模型， 使用基模型来测试两种向量化方式哪种更好
# tifidf
_, _, _, _=lr_model(train_data1, train_label1, test_data1, test_label1, penalty='l2', C=1)

# tf
_, _, _, _ = lr_model(train_data2, train_label2, test_data1, test_label1, penalty='l2', C=1)

Predict accuracy: train 0.983944444444, test: 0.968333333333
Running time: fit 6.230064868927002, Predict on Train: 0.021807193756103516, Predict on Test: 0.007370948791503906


Predict accuracy: train 0.982722222222, test: 0.9785
Running time: fit 5.664386034011841, Predict on Train: 0.034255027770996094, Predict on Test: 0.011272907257080078


由此可见对于基模型， 使用tf效果更好， 因此后续的模型调优中持续使用tf向量化结果。 后续会持续有这一组数据寻找模型。

#### 2.2 随机森林


In [18]:
from sklearn.ensemble import RandomForestClassifier

## 定义一个helper 函数， 方便参数调试
def rd_model(train_data, train_label, test_data, test_label, n, depth):
    rd = RandomForestClassifier(n_estimators=n, max_depth=depth)
    train_pred, test_pred, _, _ =_performance(rd, train_data, train_label, test_data, test_label)
    return train_pred, test_pred

_, _= rd_model(train_data2, train_label2, test_data1, test_label1, n=10,  depth=None)



Predict accuracy: train 0.9985, test: 0.984
Running time: fit 3.6753311157226562, Predict on Train: 0.1686561107635498, Predict on Test: 0.05652594566345215


结果可见训练精度比比基模型有所提升， 测试也有所提升。但训练和测试有一定差异， 说明模型有些过拟合。 所以后面尝试增大树个数

In [19]:
_, _= rd_model(train_data2, train_label2, test_data1, test_label1, n=20,  depth=None)

_, _= rd_model(train_data2, train_label2, test_data1, test_label1, n=30,  depth=None)

_, _= rd_model(train_data2, train_label2, test_data1, test_label1, n=40,  depth=None)

_, _= rd_model(train_data2, train_label2, test_data1, test_label1, n=50,  depth=None)

Predict accuracy: train 0.999611111111, test: 0.989166666667
Running time: fit 7.208452939987183, Predict on Train: 0.29341578483581543, Predict on Test: 0.10657191276550293


Predict accuracy: train 0.999777777778, test: 0.988833333333
Running time: fit 10.417667865753174, Predict on Train: 0.43949294090270996, Predict on Test: 0.15072202682495117


Predict accuracy: train 0.999833333333, test: 0.990333333333
Running time: fit 14.381725072860718, Predict on Train: 0.6063001155853271, Predict on Test: 0.20412707328796387


Predict accuracy: train 0.999888888889, test: 0.990333333333
Running time: fit 18.092178106307983, Predict on Train: 0.7303857803344727, Predict on Test: 0.2601778507232666


结果可以看到， 测试集预测结果在30棵树的时候已经达到99%, 再增加树的个数，性能提升不明显。 随机森林性能调优到此为止。

#### 2.3 尝试使用深度神经网络。
我们看到，基模型的原始性能就不错， 但本质来说逻辑回归就是一个单层神经网络， 所以这里考虑使用层次更多的神经网络来测试结果是否有提升。

In [22]:
from sklearn.neural_network import MLPClassifier

def nn_model(train_data, train_label, test_data, test_label, hidden_layer_sizes = 4, alpha=0.00001):
    nn = MLPClassifier(hidden_layer_sizes, activation='relu', alpha=alpha)
    train_pred, test_pred, _, _ =_performance(nn, train_data, train_label, test_data, test_label)
    return train_pred, test_pred


## 初始使用4个隐藏层网络，加0.00001的L2惩罚项
_, _ = nn_model(train_data2, train_label2, test_data2, test_label2, hidden_layer_sizes=4, alpha = 0.00001)

Predict accuracy: train 0.999833333333, test: 0.968666666667
Running time: fit 18.935806035995483, Predict on Train: 0.019276857376098633, Predict on Test: 0.006644010543823242


结果显示测试数据较差， 训练数据特别好， 结论是过拟合。 调优方向增大惩罚项。

In [23]:
_, _ = nn_model(train_data2, train_label2, test_data2, test_label2, hidden_layer_sizes=4, alpha = 0.0001)

Predict accuracy: train 0.999666666667, test: 0.97
Running time: fit 21.17419195175171, Predict on Train: 0.01936197280883789, Predict on Test: 0.006600141525268555


_, _ = nn_model(train_data2, train_label2, test_data2, test_label2, hidden_layer_sizes=4, alpha = 0.001)

In [25]:
_, _ = nn_model(train_data2, train_label2, test_data2, test_label2, hidden_layer_sizes=4, alpha = 0.01)

Predict accuracy: train 0.999611111111, test: 0.972833333333
Running time: fit 28.278118133544922, Predict on Train: 0.035012006759643555, Predict on Test: 0.008856058120727539


In [26]:
_, _ = nn_model(train_data2, train_label2, test_data2, test_label2, hidden_layer_sizes=4, alpha = 0.1)

Predict accuracy: train 0.996222222222, test: 0.972666666667
Running time: fit 25.884929895401, Predict on Train: 0.01914691925048828, Predict on Test: 0.006716012954711914


In [27]:
_, _ = nn_model(train_data2, train_label2, test_data2, test_label2, hidden_layer_sizes=10, alpha = 0.1)

Predict accuracy: train 0.996777777778, test: 0.9725
Running time: fit 22.74985098838806, Predict on Train: 0.03649187088012695, Predict on Test: 0.01559591293334961


通过多次调整惩罚项， 直到测试集性能不再提升后。 又加深了网络， 发现性能依然没有显著提高。 4层网络和alpha为0.01时候达到最佳性能。 

后续陆续尝试了, SVM, GN环境下的bayes，运行都很缓慢。 无法再个人电脑上很快算出结果。后尝试用LDA语义模型进行压缩，发现依然很缓慢。于是放弃对原始数据进行压缩。

#### 2.1 adaboost 和 GBDT

In [33]:
from sklearn.ensemble import AdaBoostClassifier
def ada_model(train_data, train_label, test_data, test_label, n):
    ada = AdaBoostClassifier(n_estimators=n)
    train_pred, test_pred, _, _ =_performance(ada, train_data, train_label, test_data, test_label)
    return train_pred, test_pred

for n in [5, 10, 20, 40, 80, 160]:
    print('Test with ' + str(n) + ' trees')
    _, _ = ada_model(train_data2, train_label2, test_data2, test_label2, n)
    print()

Test with 5trees


Predict accuracy: train 0.548944444444, test: 0.544666666667
Running time: fit 2.349236011505127, Predict on Train: 0.08429813385009766, Predict on Test: 0.026417016983032227

Test with 10trees


Predict accuracy: train 0.701166666667, test: 0.6975
Running time: fit 4.555819988250732, Predict on Train: 0.16355681419372559, Predict on Test: 0.055429935455322266

Test with 20trees


Predict accuracy: train 0.792444444444, test: 0.789
Running time: fit 8.74092698097229, Predict on Train: 0.2732670307159424, Predict on Test: 0.0889279842376709

Test with 40trees


Predict accuracy: train 0.824, test: 0.821666666667
Running time: fit 17.413145065307617, Predict on Train: 0.5463409423828125, Predict on Test: 0.1847219467163086

Test with 80trees


Predict accuracy: train 0.831333333333, test: 0.8255
Running time: fit 35.18995809555054, Predict on Train: 1.0784039497375488, Predict on Test: 0.3642890453338623

Test with 160trees


Predict accuracy: train 0.851388888889, test: 0.843333333333
Running time: fit 69.55302906036377, Predict on Train: 2.13319993019104, Predict on Test: 0.7491791248321533



结果可以看到，弱分类器个数到达40个开始性能分类准确度提升就比较缓慢， 但是时间开销却增加很快。 

#### 2.2 尝试使用GBDT

In [35]:
from sklearn.ensemble import GradientBoostingClassifier

def gbdt_model(train_data, train_label, test_data, test_label,n):
    gbdt = GradientBoostingClassifier()
    train_pred, test_pred, _, _ =_performance(gbdt, train_data, train_label, test_data, test_label)
    return train_pred, test_pred

## 结果跑的太慢， 放弃这个方案。 

### 总结
经过以上测试， 找到了三个还说的过去的分类器和配套参数
使用tf向量化模型。
1. 逻辑回归，配套L2惩罚项。
2. 40个分类器的随机森林。
3. 四层神经网络， 配合l2惩罚项和0.01的惩罚系数
总体对比如下：

In [40]:
## 基模型
print('base model:')
_, _, _, _ = lr_model(train_data2, train_label2, test_data1, test_label1, penalty='l2', C=1)

## 30棵树的随机森林
print('Random Forest with 50 trees:')
_, _= rd_model(train_data2, train_label2, test_data1, test_label1, n=50,  depth=None)

## 四层神经网络
print('FNN with 4 hiden layer:')
_, _ = nn_model(train_data2, train_label2, test_data2, test_label2, hidden_layer_sizes=4, alpha = 0.01)

base model:


Predict accuracy: train 0.982722222222, test: 0.9785
Running time: fit 7.0465850830078125, Predict on Train: 0.02595996856689453, Predict on Test: 0.006911039352416992
Random Forest with 50 trees:


Predict accuracy: train 0.999888888889, test: 0.990666666667
Running time: fit 17.748448848724365, Predict on Train: 0.7399890422821045, Predict on Test: 0.25372982025146484
FNN with 4 hiden layer:


Predict accuracy: train 0.999777777778, test: 0.971833333333
Running time: fit 30.017409801483154, Predict on Train: 0.019389867782592773, Predict on Test: 0.006413936614990234


回顾一下， 最初的文本向量化在精确度提升上起到很大作用。 不同的向量化对结果影响巨大。 
总体而言我们的基模型表现应该是最平衡的， 其次是随机森林算法。不过却显示预测的事件较长接近1秒钟。 
深度神经网络训练集表现远远大于测试集表现， 考虑是不是需要进一步增加数据提高模型的泛化能力。 