- 继续第七课的内容

In [1]:
import jieba
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings

In [2]:
warnings.filterwarnings('ignore')

In [3]:
%matplotlib inline

# 数据预处理

In [4]:
news = pd.read_csv('../datasource/sqlResult_1558435.csv', encoding='gb18030')

In [5]:
news.head(2)

Unnamed: 0,id,author,source,content,feature,title,url
0,89617,,快科技@http://www.kkj.cn/,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""37""...",小米MIUI 9首批机型曝光：共计15款,http://www.cnbeta.com/articles/tech/623597.htm
1,89616,,快科技@http://www.kkj.cn/,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""15""...",骁龙835在Windows 10上的性能表现有望改善,http://www.cnbeta.com/articles/tech/623599.htm


In [6]:
news_dropna = news.dropna(subset=['source', 'content'])

In [7]:
news_dropna.head(2)

Unnamed: 0,id,author,source,content,feature,title,url
0,89617,,快科技@http://www.kkj.cn/,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""37""...",小米MIUI 9首批机型曝光：共计15款,http://www.cnbeta.com/articles/tech/623597.htm
1,89616,,快科技@http://www.kkj.cn/,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""15""...",骁龙835在Windows 10上的性能表现有望改善,http://www.cnbeta.com/articles/tech/623599.htm


In [8]:
def transform(line):
    class_ = 1 if line['source'] == '新华社' else 0
    return pd.Series([class_, line['content']], index=['y', 'content'])

In [9]:
data = news_dropna.apply(transform, axis=1)

In [10]:
data.head(2)

Unnamed: 0,y,content
0,0,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...
1,0,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...


In [11]:
corpus = data.content.to_list()

In [12]:
y = data.y.values.astype(np.int)

# 使用TF-IDF进行文本向量化

In [13]:
corpus_cut = []
for sentence in tqdm(corpus):
    if not isinstance(sentence, str):
        continue
    sentence = ''.join(re.findall(r'\w+',string=sentence))
    corpus_cut.append(' '.join(jieba.cut(sentence=sentence)))

  0%|                                                                                        | 0/87052 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\JEREMY~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.799 seconds.
Prefix dict has been built succesfully.
100%|███████████████████████████████████████████████████████████████████████████| 87052/87052 [03:34<00:00, 406.18it/s]


In [14]:
corpus_cut[0]

'此外 自 本周 6 月 12 日起 除 小米 手机 6 等 15 款 机型 外 其余 机型 已 暂停 更新 发布 含 开发 版 体验版 内测 稳定版 暂不受 影响 以 确保 工程师 可以 集中 全部 精力 进行 系统优化 工作 有人 猜测 这 也 是 将 精力 主要 用到 MIUI9 的 研发 之中 MIUI8 去年 5 月 发布 距今已有 一年 有余 也 是 时候 更新换代 了 当然 关于 MIUI9 的 确切 信息 我们 还是 等待 官方消息'

In [15]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=300)

**ngram_range(min, max)是指将text分成min，min+1，min+2，...，max个不同的词组。比如‘Python is useful’中ngram_range(1, 3)之后得到‘Python’‘is’‘useful’‘Python is’‘is useful’‘Python is useful’。如果是ngram_range(1, 1)则只能得到单个单词‘Python’‘is’‘useful’。**

In [16]:
X = vectorizer.fit_transform(corpus_cut)

In [17]:
X = X.toarray()

In [18]:
X.shape, y.shape

((87052, 300), (87052,))

# 分类模型

In [19]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

In [20]:
X_train, x_test, Y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=2019)

## Naive Bayes

In [22]:
x_train, x_valid, y_train, y_valid = train_test_split(X_train, Y_train, test_size=0.15, random_state=1002)

In [22]:
gnb =GaussianNB()

In [23]:
gnb.fit(x_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [24]:
y_pred = gnb.predict(x_valid)
y_pred_proba = gnb.predict_proba(x_valid)

In [25]:
print('gnb.score is {}'.format(gnb.score(x_valid, y_valid)))
print('precision_score is {}'.format(precision_score(y_valid, y_pred)))
print('recall_score is {}'.format(recall_score(y_valid, y_pred)))
print('f1_score is {}'.format(f1_score(y_valid, y_pred)))
print('roc_auc_score is {}'.format(roc_auc_score(y_valid, y_pred_proba[:, 1])))

gnb.score is 0.807027027027027
precision_score is 0.9981153411232567
recall_score is 0.7887995233839737
f1_score is 0.8811980033277869
roc_auc_score is 0.9410166438307452


## SVM

### 默认参数

In [24]:
x_train, x_valid, y_train, y_valid = train_test_split(X_train, Y_train, test_size=0.15, random_state=45)

In [27]:
svc = SVC(verbose=5)

In [28]:
svc.fit(x_train, y_train)

[LibSVM]

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=5)

In [29]:
y_pred = svc.predict(x_valid)
y_pred_proba = svc.decision_function(x_valid)

In [30]:
print('svc.score is {}'.format(svc.score(x_valid, y_valid)))
print('precision_score is {}'.format(precision_score(y_valid, y_pred)))
print('recall_score is {}'.format(recall_score(y_valid, y_pred)))
print('f1_score is {}'.format(f1_score(y_valid, y_pred)))
print('roc_auc_score is {}'.format(roc_auc_score(y_valid, y_pred_proba)))

svc.score is 0.9061261261261261
precision_score is 0.9061261261261261
recall_score is 1.0
f1_score is 0.9507514887985632
roc_auc_score is 0.9835542624371735


### class_weight: balanced

In [32]:
svc = SVC(class_weight='balanced', verbose=5)

In [33]:
svc.fit(x_train, y_train)

[LibSVM]

SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=5)

In [34]:
y_pred = svc.predict(x_valid)
y_pred_proba = svc.decision_function(x_valid)

In [35]:
print('svc.score is {}'.format(svc.score(x_valid, y_valid)))
print('precision_score is {}'.format(precision_score(y_valid, y_pred)))
print('recall_score is {}'.format(recall_score(y_valid, y_pred)))
print('f1_score is {}'.format(f1_score(y_valid, y_pred)))
print('roc_auc_score is {}'.format(roc_auc_score(y_valid, y_pred_proba)))

svc.score is 0.8273873873873874
precision_score is 0.9995092024539878
recall_score is 0.8099025651222908
f1_score is 0.8947715289982426
roc_auc_score is 0.9860458572525036


- 正则化效果太强，减弱一点。

In [36]:
svc = SVC(C=5000, class_weight='balanced', verbose=5)   
svc.fit(x_train, y_train)

[LibSVM]

SVC(C=5000, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=5)

In [37]:
y_pred = svc.predict(x_valid)   
y_pred_proba = svc.decision_function(x_valid)

In [None]:
print('svc.score is {}'.format(svc.score(x_valid, y_valid)))
print('precision_score is {}'.format(precision_score(y_valid, y_pred)))
print('recall_score is {}'.format(recall_score(y_valid, y_pred)))
print('f1_score is {}'.format(f1_score(y_valid, y_pred)))
print('roc_auc_score is {}'.format(roc_auc_score(y_valid, y_pred_proba)))

svc.score is 0.98
precision_score is 0.9981766612641815
recall_score is 0.9797176377013322
f1_score is 0.988861013547416
roc_auc_score is 0.9958786542849936


- 换成线性核函数

In [None]:
svc = SVC(C=5000, kernel='linear', class_weight='balanced', verbose=5)    
svc.fit(x_train, y_train)

[LibSVM]

In [None]:
y_pred = svc.predict(x_valid)   
y_pred_proba = svc.decision_function(x_valid)

In [None]:
print('svc.score is {}'.format(svc.score(x_valid, y_valid)))
print('precision_score is {}'.format(precision_score(y_valid, y_pred)))
print('recall_score is {}'.format(recall_score(y_valid, y_pred)))
print('f1_score is {}'.format(f1_score(y_valid, y_pred)))
print('roc_auc_score is {}'.format(roc_auc_score(y_valid, y_pred_proba)))

- 线性核函数比高斯核函数慢了很多很多，表现差了一点点

- 调整$\gamma$，默认为$\frac{1}{300}$，这里尝试0.5

In [40]:
svc = SVC(C=5000, class_weight='balanced', gamma=0.5, verbose=5)   
svc.fit(x_train, y_train)

[LibSVM]

SVC(C=5000, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.5, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=5)

In [41]:
y_pred = svc.predict(x_valid)   
y_pred_proba = svc.decision_function(x_valid)

In [42]:
print('svc.score is {}'.format(svc.score(x_valid, y_valid)))
print('precision_score is {}'.format(precision_score(y_valid, y_pred)))
print('recall_score is {}'.format(recall_score(y_valid, y_pred)))
print('f1_score is {}'.format(f1_score(y_valid, y_pred)))
print('roc_auc_score is {}'.format(roc_auc_score(y_valid, y_pred_proba)))

svc.score is 0.9888288288288288
precision_score is 0.9935413354531002
recall_score is 0.9941340226685226
f1_score is 0.9938375906967498
roc_auc_score is 0.9961956258308337


- 表现仍在提高

## Random Forest

In [25]:
rfc = RandomForestClassifier(oob_score=True, class_weight='balanced', verbose=5, random_state=42, n_jobs=4)
rfc.fit(x_train, y_train)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.


building tree 1 of 10building tree 2 of 10
building tree 3 of 10
building tree 4 of 10

building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10


[Parallel(n_jobs=4)]: Done   6 out of  10 | elapsed:    0.6s remaining:    0.4s
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.9s finished


RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=4, oob_score=True, random_state=42,
            verbose=5, warm_start=False)

In [28]:
y_pred = rfc.predict(x_valid)   
y_pred_proba = rfc.predict_proba(x_valid)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished


In [29]:
print('rfc.score is {}'.format(rfc.score(x_valid, y_valid)))
print('precision_score is {}'.format(precision_score(y_valid, y_pred)))
print('recall_score is {}'.format(recall_score(y_valid, y_pred)))
print('f1_score is {}'.format(f1_score(y_valid, y_pred)))
print('roc_auc_score is {}'.format(roc_auc_score(y_valid, y_pred_proba[:, 1])))

rfc.score is 0.9899099099099099
precision_score is 0.9933531746031746
recall_score is 0.9955259494929409
f1_score is 0.994438375211044
roc_auc_score is 0.990183900746114


[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished


- 增加estimator的数量

In [30]:
rfc = RandomForestClassifier(n_estimators=15, oob_score=True, class_weight='balanced', verbose=5, random_state=42, n_jobs=4)
rfc.fit(x_train, y_train)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.


building tree 1 of 15building tree 2 of 15
building tree 3 of 15
building tree 4 of 15

building tree 5 of 15
building tree 6 of 15
building tree 7 of 15
building tree 8 of 15
building tree 9 of 15
building tree 10 of 15
building tree 11 of 15
building tree 12 of 15
building tree 13 of 15
building tree 14 of 15
building tree 15 of 15


[Parallel(n_jobs=4)]: Done  12 out of  15 | elapsed:    1.0s remaining:    0.2s
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:    1.2s finished


RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=15, n_jobs=4, oob_score=True, random_state=42,
            verbose=5, warm_start=False)

In [31]:
y_pred = rfc.predict(x_valid)   
y_pred_proba = rfc.predict_proba(x_valid)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  12 out of  15 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  12 out of  15 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:    0.0s finished


In [32]:
print('rfc.score is {}'.format(rfc.score(x_valid, y_valid)))
print('precision_score is {}'.format(precision_score(y_valid, y_pred)))
print('recall_score is {}'.format(recall_score(y_valid, y_pred)))
print('f1_score is {}'.format(f1_score(y_valid, y_pred)))
print('roc_auc_score is {}'.format(roc_auc_score(y_valid, y_pred_proba[:, 1])))

rfc.score is 0.9900900900900901
precision_score is 0.9928656361474435
recall_score is 0.9962219129051502
f1_score is 0.9945409429280396
roc_auc_score is 0.9916845062552742


[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  12 out of  15 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:    0.0s finished


## 聚类

In [33]:
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabaz_score

In [34]:
k_means = KMeans()
y_pred = k_means.fit_predict(X_train)

In [35]:
calinski_harabaz_score(X_train, y_pred)

2338.165960168884

- 由于 y 标签本来就是两类，所以这里尝试将 clusters 改为 2

In [36]:
k_means = KMeans(n_clusters=2, verbose=5, random_state=45, n_jobs=4)
y_pred = k_means.fit_predict(X_train)

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 65910.96608725289
start iteration
done sorting
end inner loop
Iteration 1, inertia 63389.96200838976
start iteration
done sorting
end inner loop
Iteration 2, inertia 63024.875038738
start iteration
done sorting
end inner loop
Iteration 3, inertia 62882.370797365315
start iteration
done sorting
end inner loop
Iteration 4, inertia 62833.29332917934
start iteration
done sorting
end inner loop
Iteration 5, inertia 62796.77480112829
start iteration
done sorting
end inner loop
Iteration 6, inertia 62743.44678837915
start iteration
done sorting
end inner loop
Iteration 7, inertia 62633.475632676586
start iteration
done sorting
end inner loop
Iteration 8, inertia 62466.350615440744
start iteration
done sorting
end inner loop
Iteration 9, inertia 62367.94670229849
start iteration
done sorting
end inner loop
Iteration 10, inertia 62245.25026087506
start iteration
done sorting
end inner loop
Iteration 11, ine

In [37]:
calinski_harabaz_score(X_train, y_pred)

5741.110567894496

- 表现提升

In [38]:
k_means = KMeans(n_clusters=4, random_state=45, n_jobs=4)
y_pred = k_means.fit_predict(X_train)

In [39]:
calinski_harabaz_score(X_train, y_pred)

3628.5566517046277

- 增大 cluster，表现下降

# 各模型的优缺点

## Linear Regression

- 优点：
<br>1.建模速度快，运行速度快；
<br>2.模型可解释性好。
- 缺点：
<br>1.对异常值敏感；
<br>2.无法拟合复杂饿非线性关系。

## Logistic Regression

- 优点：
<br>1.形式简答，模型可解释性很好。如果某个特征的权重特别大，代表这个特征对结果的影响很大，说明这个特征很重要；
<br>2.模型效果不错，特征工程做得好的话，在工程上都是可以接受的。特征工程可以大家并行开发，提高开发速度；
<br>3.训练速度快，计算量只和特征的数目有关，并且逻辑回归的分布式优化SGD发展较为成熟，训练的速度通过分布式优化进一步提高；
<br>4.资源占用小，值存储各个特征对应的权重；
<br>方便调整输出结果，即阈值的调整。
- 缺点：
<br>1.准确率并不是很高。因为形式非常的简单（类似线性模型），很难去拟合数据的真是分布；
<br>2.处理非线性数据比较麻烦。逻辑回归在不引入其他地方的情况下，只能处理线性可分的数据，或者进一步说，处理二分类的问题；
<br>3.逻辑回归本身无法筛选特征。有时候我们会用GBDT来筛选特征，然后再上逻辑回归。

## KNN

- 优点：
<br>1.理论成熟，思想简单，既可以用来做分类也可以用来做回归；
<br>2.可用于非线性分类；
<br>3.KNN理论简单，容易实现。
- 缺点：
<br>1.样本不平衡问题，效果差；
<br>2.需要大量内存；
<br>3.对于样本容量大的数据集计算量比较大（体现在距离计算上）；
<br>4.KNN每一次分类都会重新进行一次全局运算。

## SVM

- 优点:
<br>1.对于高维度数据非常有效；
<br>2.当特征数量多余训练数据时，表现依然非常好；
<br>3.当类别是完全可分的时候，是最好的算法；
<br>4.泛化错误率低；
<br>5.计算开销小，虽然循环计算子问题多，但是每个子问题都是解析求解，速度快；
<br>6.能够处理非线性特征的相互作用，将输入空间映射到特征空间的过程中可能发生特征之间的组合。
- 缺点：
<br>1.对参数调节和核函数的选择过于敏感；
<br>2.对噪声和缺失数据敏感。

## Naive Bayes

- 优点：
<br>1.源于古典数学理论，有稳定的分类效率；
<br>2.对小规模数据表现较好，能处理多分类任务；适合增量式训练，尤其是数据量超出内存时，可以一批批地去增量训练；
<br>3.对缺失数据不敏感。
- 缺点：
<br>1.在实际应用过程中，属性个数往往较多或者属性之间相关性较大，则过于违背特征独立性假设，导致分类效果不好；
<br>2.需要知道先验概率，且先验概率很多时候取决于假设，若假设的模型不合适，则会导致预测效果不佳；
<br>3.对输入数据的表达形式很敏感，输入数据的表达形式若较为接近则也会影响特征独立性假设。