## 基于机器学习算法的文本分类

- TF-IDF
- Count Features
- Logistic Regression
- Naive Bayes
- SVM
- Xgboost
- Grid Search

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

### 读取数据

In [2]:
data = pd.read_excel("复旦大学中文文本分类语料.xlsx","sheet1")
data.head()

Unnamed: 0,分类,正文
0,艺术,﻿【 文献号 】1-2432\n【原文出处】出版发行研究\n【原刊地名】京\n【原刊期号】1...
1,艺术,﻿【 文献号 】1-2435\n【原文出处】扬州师院学报：社科版\n【原刊期号】199504...
2,艺术,﻿【 文献号 】1-2785\n【原文出处】南通师专学报：社科版\n【原刊期号】199503...
3,艺术,﻿【 文献号 】1-3021\n【原文出处】社会科学战线\n【原刊地名】长春\n【原刊期号】...
4,艺术,﻿【 文献号 】1-3062\n【原文出处】上海文化\n【原刊期号】199505\n【原刊页...


In [4]:
data['分类'].unique() # 共19类

array(['艺术', '文学', '哲学', '通信', '能源', '历史', '矿藏', '空间', '教育', '交通', '计算机',
       '环境', '电子', '农业', '体育', '时政', '医疗', '经济', '法律'], dtype=object)

In [5]:
data.shape

(9249, 2)

### 分词

In [8]:
import jieba

data['文本分词'] = data['正文'].apply(lambda x:jieba.cut(x)) # 生成器形式
data['文本分词'] = [' '.join(i) for i in data['文本分词']] # 空格拼接
data.head()

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\蓝沛辉\AppData\Local\Temp\jieba.cache
Loading model cost 0.660 seconds.
Prefix dict has been built succesfully.


Unnamed: 0,分类,正文,文本分词
0,艺术,﻿【 文献号 】1-2432\n【原文出处】出版发行研究\n【原刊地名】京\n【原刊期号】1...,﻿ 【 文献号 】 1 - 2432 \n 【 原文 出处 】 出版发行 研究 \n...
1,艺术,﻿【 文献号 】1-2435\n【原文出处】扬州师院学报：社科版\n【原刊期号】199504...,﻿ 【 文献号 】 1 - 2435 \n 【 原文 出处 】 扬州 师院 学报 ：...
2,艺术,﻿【 文献号 】1-2785\n【原文出处】南通师专学报：社科版\n【原刊期号】199503...,﻿ 【 文献号 】 1 - 2785 \n 【 原文 出处 】 南通 师专 学报 ：...
3,艺术,﻿【 文献号 】1-3021\n【原文出处】社会科学战线\n【原刊地名】长春\n【原刊期号】...,﻿ 【 文献号 】 1 - 3021 \n 【 原文 出处 】 社会科学 战线 \n...
4,艺术,﻿【 文献号 】1-3062\n【原文出处】上海文化\n【原刊期号】199505\n【原刊页...,﻿ 【 文献号 】 1 - 3062 \n 【 原文 出处 】 上海 文化 \n 【...


In [9]:
# 保存本地
data[['分类','文本分词']].to_csv("data.csv",index=False)

### 编码类别标签

In [10]:
from sklearn import preprocessing

encode = preprocessing.LabelEncoder()
label =  encode.fit_transform(data['分类'].values)

In [12]:
label[:10]

array([16, 16, 16, 16, 16, 16, 16, 16, 16, 16])

### 划分训练集和验证集

In [13]:
from sklearn.model_selection import train_test_split

X_train,X_valid,Y_train,Y_valid = train_test_split(data['文本分词'].values, label,stratify=label, random_state=42, test_size=0.1,shuffle=True ) 
# stratify=y : 按照数据集中y的比例分配给train和test，使得train和test中各类别数据的比例与原数据集的比例一致。
print (X_train.shape)
print (X_valid.shape)

(8324,)
(925,)


### 文本向量化

- TF-IDF

In [22]:
def number_encode(tokens):
    """将数字映射为同一个符号：#NUMBER"""
    return ("#NUMBER" if tokens[0].isdigit() else token for token in tokens)

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

class Vectorizer(TfidfVectorizer):
    def build_tokenizer(self):
        tokenize = super(Vectorizer, self).build_tokenizer()
        return lambda doc:list(number_encode(tokenize(doc)))

In [23]:
stop_words = [line.strip() for line in open('中文停用词表.txt', 'r', encoding='utf-8').readlines()] # 停用词表

tf_idf = Vectorizer(min_df=3,max_df=0.5,max_features=None,ngram_range=(1,2),use_idf=True,smooth_idf=True,stop_words=stop_words)

# min_df 构建词汇表时忽略文档频率低于该值的词(整数为绝对值，浮点数为比例，默认1)
# max_df 构建词汇表时忽略文档频率高于该值的词(整数为绝对值，浮点数为比例，默认1)
# max_feature 构建词汇表时仅考虑按语料词频排序的前N个
# ngram_range 要提取的ngram特征的上下限
# user_idf 启动idf重新计算权重
# smooth_idf 对文档频率加1以平滑权重，防止除零
# stop_words 停用词表，用于剔除停用词，若为english，启用内建词表

# 词袋模型
# count_vec = CountVectorizer(min_df=3,max_df=0.5, ngram_range=(1,2),stop_words = stwlist)

In [25]:
tf_idf.fit(list(X_train) + list(X_valid))
X_train_vec = tf_idf.transform(X_train)
X_valid_vec = tf_idf.transform(X_valid)
print (X_train_vec.shape)
print (X_valid_vec.shape)

(8324, 932205)
(925, 932205)


### 多分类交叉熵损失函数

In [28]:
def multiclass_logloss(actual, predicted, eps=1e-15):
    """
     对数损失度量的多分类版本。
    :param actual: 包含actual target classes的数组
    :param predicted: 分类预测结果矩阵, 每个类别都有一个概率
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1])) # 将真实标签转化为矩阵[0,0,1,0,0]
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps) # eps最小，1-eps最大，避免求对数出现问题
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip)) # 两个矩阵各元素乘机之和
    return -1.0 / rows * vsota # 负的交叉熵之和除以样本数

### 逻辑回归分类

In [30]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=1.0, solver='lbfgs', multi_class='multinomial') # 参数 https://blog.csdn.net/qq_27972567/article/details/81949023
lr.fit(X_train_vec,Y_train)
pred = lr.predict_proba(X_valid_vec)
print("多分类交叉熵损失：%0.3f" % multiclass_logloss(Y_valid,pred))



多分类交叉熵损失：0.607


### 朴素贝叶斯分类

In [31]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_train_vec,Y_train)
pred = nb.predict_proba(X_valid_vec)
print("多分类交叉熵损失：%0.3f" % multiclass_logloss(Y_valid,pred))

多分类交叉熵损失：0.968


### SVD降维

In [32]:
from sklearn import preprocessing, decomposition

svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(X_train_vec)

X_train_svd = svd.transform(X_train_vec)
X_valid_svd = svd.transform(X_valid_vec)

In [33]:
print (X_train_svd.shape)
print (X_valid_svd.shape)

(8324, 120)
(925, 120)


### 数据标准化

In [34]:
scale = preprocessing.StandardScaler()
scale.fit(X_train_svd)

X_train_svd_st = scale.transform(X_train_svd)
X_valid_svd_st = scale.transform(X_valid_svd)

### SVM分类

In [36]:
from sklearn.svm import SVC

svc = SVC(C=1.0, probability=True)
svc.fit(X_train_svd_st, Y_train)
pred = svc.predict_proba(X_valid_svd_st)

print("多分类交叉熵损失：%0.3f" % multiclass_logloss(Y_valid,pred))

多分类交叉熵损失：0.356


### Xgboost

In [38]:
import xgboost as xgb

boost = xgb.XGBClassifier(max_depth=7, n_estimator=200,colsample_bytree=0.8,subsample=0.8, nthread=10,learning_rate=0.1)
boost.fit(X_train_svd,Y_train)
pred = boost.predict_proba(X_valid_svd)
print("多分类交叉熵损失：%0.3f" % multiclass_logloss(Y_valid,pred))

多分类交叉熵损失：0.371


### 网格搜索参数

In [45]:
from sklearn import metrics, pipeline
from sklearn.model_selection import GridSearchCV

scorer = metrics.make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True) # 构建评分函数

# 构建pipeline
svd = decomposition.TruncatedSVD()
scale = preprocessing.StandardScaler()
lr = LogisticRegression()

pipelines = pipeline.Pipeline([('svd',svd),
                          ('scale',scale),
                          ('lr',lr)])

# 参数字典
params = {'svd__n_components':[120,180],
         'lr__C':[0.1,1.0,10],
         'lr__penalty':['l1','l2']}

In [None]:
model = GridSearchCV(estimator=pipelines, param_grid=params,scoring=scorer,verbose=10,n_jobs=-1,iid=True,refit=True,cv=2)
model.fit(X_train_vec,Y_train)
print("最佳分数：%0.3f" % model.best_score_)
print("最佳参数集：")
best_params = model.best_estimator_.get_params()
for param in sorted(params.keys()):
    print("\t%s: %r" % (param, best_params[param]))

Fitting 2 folds for each of 12 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed: 21.5min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed: 23.4min
[Parallel(n_jobs=-1)]: Done  12 out of  24 | elapsed: 23.8min remaining: 23.8min
[Parallel(n_jobs=-1)]: Done  15 out of  24 | elapsed: 24.4min remaining: 14.6min
