### 文本分类

我们将使用KNN、决策树、朴素贝叶斯、SVM算法对社交媒体文本数据集进行相应的分类。

#### 实验内容

本次实验使用的基于第3节预处理和中文分词后的文本数据集weibo-processed-B.csv格式文件。


第三方库：

In [None]:
import pandas as pd
import numpy as np
import jieba

In [None]:
#数据读取
df_post = pd.read_csv(r"weibo-processed-B.csv",encoding="utf-8")
df_post.head(50)

# df_post.head(50)

In [None]:
# import finalseg
df_post=df_post.dropna(axis=0, subset= ["微博内容"])
df_post.reset_index(drop=True,inplace=True)#drop=True：删除原行索引；inplace=True:在数据上进行更新
print(df_post.isnull().any())
print(df_post)

## 数据集划分

先导入相关库

In [None]:
import itertools
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score,train_test_split

In [None]:
#设置停用词，构建词频矩阵
stopwords = []
for word in open('stopwords/cn_stopwords.txt','r',encoding='utf-8'):
    stopwords.append(word.strip())
def tokenizer(s):
    words=[]
    cut = jieba.cut(s)
    for word in cut:
        words.append(word)
    return words
count = CountVectorizer(tokenizer=tokenizer, stop_words=list(stopwords))
countvector = count.fit_transform(df_post['微博内容']).toarray()
c = 0
print(countvector)
print(countvector.shape)  #打印看看生成的词频矩阵相关参数
for i in countvector[0]:
    if i==1:
        c+=1
c

利用PCA降维查看数据集分布，代码：


步骤 6	数据集划分
数据集划分的方法有：
1.留出法；
2.交叉验证法；
3.自助法。
对于本实验的数据集和任务，应该用什么方法处理呢？请具体分析三种方法哪一种更适合本实验。
留出法：
打印数据集标签统计数据，代码：

In [None]:
#y为标签列
y = df_post['类标签']
print(y)
#X为去掉标签列的数据
X = countvector
print(X)
print("X.shape", X.shape)
print("y.shape", y.shape)

利用train_test_split方法，将X,y随机划分为训练集（train_data），训练集标签（train_labels），测试集（test_data），试集标签（test_labels），

按训练集：测试集=8:2的概率划分。

In [None]:
train_data, test_data, train_labels, test_labels = train_test_split(X, y, test_size=0.2)
print("train_data.shape", train_data.shape)
print("test_data.shape", test_data.shape)
print("train_labels.shape", train_labels.shape)
print("test_labels.shape", test_labels.shape)

### knn 分类

In [None]:
kNN_classifier = KNeighborsClassifier(n_neighbors=5)#初始化k近邻算法对象
kNN_classifier.fit(train_data,train_labels)#对训练集进行训练
label_predict = kNN_classifier.predict(test_data)#对测试集进行预测
"""预测结果"""
print("Predict_rlt:",label_predict)
"""预测准确个数"""
print("Correct_no:",sum(label_predict==test_labels))
"""预测准确率"""
print("Accuracy:",sum(label_predict==test_labels)/len(test_labels))

from sklearn.metrics import classification_report # 结果预测评估模块
print(classification_report(test_labels, label_predict))

根据上面的代码我们看到，准确率有

召回率有

### 模型评估

In [None]:
from sklearn.model_selection import cross_val_score
knn = KNeighborsClassifier(n_neighbors=5)
score_knn_accuracy = np.mean(cross_val_score(estimator=knn, X=X, y=y, cv=5, scoring='accuracy'))
print("Score_accuracy:",score_knn_accuracy)


除了KNN分类算法，还有其他的分类算法，对比分析不同算法的结果。

相关内容的原理和使用方法，请参考个人gitHub链接[https://github.com/JunTheRipper/Data-Mining-Access/blob/main/Summary-Of-Machine-Learning/B.sipervised-learning/B1.classification.ipynb](https://github.com/JunTheRipper/Data-Mining-Access/blob/main/Summary-Of-Machine-Learning/B.sipervised-learning/B1.classification.ipynb)

#### 测试1： 决策树

In [None]:
from sklearn.tree import DecisionTreeClassifier # 决策树分类器
dtree = DecisionTreeClassifier()
dtree.fit(train_data,train_labels)#对训练集进行训练
label_predict = dtree.predict(test_data)#对测试集进行预测
"""预测结果"""
print("Predict_rlt:",label_predict)
"""预测准确个数"""
print("Correct_no:",sum(label_predict==test_labels))
"""预测准确率"""
print("Accuracy:",sum(label_predict==test_labels)/len(test_labels))

# from sklearn.metrics import classification_report # 结果预测评估模块
print(classification_report(test_labels, label_predict))

In [None]:
# 对决策树模型交叉验证
from sklearn.tree import DecisionTreeClassifier # 决策树分类器
dtree = DecisionTreeClassifier()
dtree.fit(train_data,train_labels)#对训练集进行训练
label_predict = dtree.predict(test_data)#对测试集进行预测
"""预测结果"""
print("Predict_rlt:",label_predict)
"""预测准确个数"""
print("Correct_no:",sum(label_predict==test_labels))
"""预测准确率"""
print("Accuracy:",sum(label_predict==test_labels)/len(test_labels))

# from sklearn.metrics import classification_report # 结果预测评估模块
print(classification_report(test_labels, label_predict))

测试中我们可以看到：


#### 测试2： 朴素贝叶斯模型 —— 高斯模型

In [None]:
from sklearn.naive_bayes import GaussianNB
# 对高斯模型交叉验证
gNB = GaussianNB()
gNB.fit(train_data,train_labels)#对训练集进行训练
label_predict = gNB.predict(test_data)#对测试集进行预测
"""预测结果"""
print("Predict_rlt:",label_predict)
"""预测准确个数"""
print("Correct_no:",sum(label_predict==test_labels))
"""预测准确率"""
print("Accuracy:",sum(label_predict==test_labels)/len(test_labels))

# from sklearn.metrics import classification_report # 结果预测评估模块
print(classification_report(test_labels, label_predict))

测试中我们可以看到：


#### 测试3： 朴素贝叶斯模型 —— 多元伯努利模型

In [None]:
from sklearn.naive_bayes import BernoulliNB
gNB = BernoulliNB()
gNB.fit(train_data,train_labels)#对训练集进行训练
label_predict = gNB.predict(test_data)#对测试集进行预测
"""预测结果"""
print("Predict_rlt:",label_predict)
"""预测准确个数"""
print("Correct_no:",sum(label_predict==test_labels))
"""预测准确率"""
print("Accuracy:",sum(label_predict==test_labels)/len(test_labels))

# from sklearn.metrics import classification_report # 结果预测评估模块
print(classification_report(test_labels, label_predict))

测试中我们可以看到：


#### 测试4： 支持向量机

In [None]:
from sklearn import svm
clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')
## 测试的同学自己改一下数据看看效果
print('Start training SVM')
clf = clf.fit(train_data,train_labels)#对训练集进行训练
label_predict = clf.predict(test_data)#对测试集进行预测
"""预测结果"""
print("Predict_rlt:",label_predict)
"""预测准确个数"""
print("Correct_no:",sum(label_predict==test_labels))
"""预测准确率"""
print("Accuracy:",sum(label_predict==test_labels)/len(test_labels))

# from sklearn.metrics import classification_report # 结果预测评估模块
print(classification_report(test_labels, label_predict))

关于这几个模型，做个比较和总结：




