# 当代人工智能实验一：文本分类
## ——TF-IDF & Multilayer Perceptron (sklearn)

### 一. 引入必要模块
numpy将用于数据的处理。
time用于记录代码运行时间。
MLPClassifier用于运行多层感知机。
train_test_split用于进行训练集与验证集的划分。
classification_report用于衡量模型的训练表现。

In [1]:
import time
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

### 二. 载入数据

In [2]:
labelList = np.load('labelList.npy')
textList = np.load('textVectorList.npy')
result_list = np.load('result_list.npy')

### 三. 利用训练集数据训练多层感知机模型
使用sklearn中的MLPClassifier模块。
在这里，我们通过蒙特卡洛交叉验证来验证模型的正确性。

In [3]:
# 划分训练集与测试集，这里选取12.5%的数据作为测试集，剩余数据作为训练集。
# random_state数值是不同会让训练集与测试集不同，若写为None则每次都随机生成。
accuracyTotal = 0
precisionTotal = 0
recallTotal = 0
f1Total = 0
LOOP_NUMBER = 5
target_names = ['class_0', 'class_1', 'class_2', 'class_3', 'class_4', 'class_5', 'class_6', 'class_7', 'class_8', 'class_9']

start_time = time.time()
for loop in range(LOOP_NUMBER):
    text_train, text_test, label_train, label_test = train_test_split(textList, labelList, test_size=0.125, random_state=None)
    mlp_classifier = MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000, random_state=42)
    mlp_classifier.fit(text_train, label_train)
    y_pred = mlp_classifier.predict(text_test)
    accuracy = mlp_classifier.score(text_test, label_test)
    classification_rep = classification_report(label_test, y_pred, target_names=target_names, output_dict=True)
    # 提取相应的指标值
    precision = classification_rep['weighted avg']['precision']
    recall = classification_rep['weighted avg']['recall']
    f1 = classification_rep['weighted avg']['f1-score']

    accuracyTotal += accuracy
    precisionTotal += precision
    recallTotal += recall
    f1Total += f1

end_time = time.time()
# 计算平均值
runtime = (end_time - start_time) / LOOP_NUMBER
accuracy_avg = accuracyTotal / LOOP_NUMBER
precision_avg = precisionTotal / LOOP_NUMBER
recall_avg = recallTotal / LOOP_NUMBER
f1_avg = f1Total / LOOP_NUMBER

print("平均运行时间为：", runtime)
print("模型准确率：", accuracy_avg)
print("模型精确度：", precision_avg)
print("模型召回率：", recall_avg)
print("模型F1-score：", f1_avg)

平均运行时间为： 204.26693372726442
模型准确率： 0.9636000000000001
模型精确度： 0.9642711530691942
模型召回率： 0.9636000000000001
模型F1-score： 0.9635754716251934


我们发现代码的运行时间过长。一个可能的影响因素是，我们对隐藏层设置的过多的神经元，因此现在我们试图减少神经元的个数，观察效果。

In [4]:
# 划分训练集与测试集，这里选取12.5%的数据作为测试集，剩余数据作为训练集。
# random_state数值是不同会让训练集与测试集不同，若写为None则每次都随机生成。
accuracyTotal = 0
precisionTotal = 0
recallTotal = 0
f1Total = 0
LOOP_NUMBER = 5
target_names = ['class_0', 'class_1', 'class_2', 'class_3', 'class_4', 'class_5', 'class_6', 'class_7', 'class_8', 'class_9']

start_time = time.time()
for loop in range(LOOP_NUMBER):
    text_train, text_test, label_train, label_test = train_test_split(textList, labelList, test_size=0.125, random_state=None)
    mlp_classifier = MLPClassifier(hidden_layer_sizes=(5,), max_iter=1000, random_state=42)
    mlp_classifier.fit(text_train, label_train)
    y_pred = mlp_classifier.predict(text_test)
    accuracy = mlp_classifier.score(text_test, label_test)
    classification_rep = classification_report(label_test, y_pred, target_names=target_names, output_dict=True)
    # 提取相应的指标值
    precision = classification_rep['weighted avg']['precision']
    recall = classification_rep['weighted avg']['recall']
    f1 = classification_rep['weighted avg']['f1-score']

    accuracyTotal += accuracy
    precisionTotal += precision
    recallTotal += recall
    f1Total += f1

end_time = time.time()
# 计算平均值
runtime = (end_time - start_time) / LOOP_NUMBER
accuracy_avg = accuracyTotal / LOOP_NUMBER
precision_avg = precisionTotal / LOOP_NUMBER
recall_avg = recallTotal / LOOP_NUMBER
f1_avg = f1Total / LOOP_NUMBER

print("平均运行时间为：", runtime)
print("模型准确率：", accuracy_avg)
print("模型精确度：", precision_avg)
print("模型召回率：", recall_avg)
print("模型F1-score：", f1_avg)

平均运行时间为： 200.68755502700805
模型准确率： 0.9476000000000001
模型精确度： 0.9490088607438402
模型召回率： 0.9476000000000001
模型F1-score： 0.9475414237913533


发现平均运行时间有略微下降，但准确率、精确度、召回率与F1-Score却有明显的下降。
接下来，我们对多层感知机更换激活函数。我们将激活函数由默认的ReLU更换为在之前表现不错的Logistic函数。

In [8]:
# 划分训练集与测试集，这里选取12.5%的数据作为测试集，剩余数据作为训练集。
# random_state数值是不同会让训练集与测试集不同，若写为None则每次都随机生成。
accuracyTotal = 0
precisionTotal = 0
recallTotal = 0
f1Total = 0
LOOP_NUMBER = 5
target_names = ['class_0', 'class_1', 'class_2', 'class_3', 'class_4', 'class_5', 'class_6', 'class_7', 'class_8', 'class_9']

start_time = time.time()
for loop in range(LOOP_NUMBER):
    text_train, text_test, label_train, label_test = train_test_split(textList, labelList, test_size=0.125, random_state=None)
    mlp_classifier = MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000, activation='logistic', random_state=42)
    mlp_classifier.fit(text_train, label_train)
    y_pred = mlp_classifier.predict(text_test)
    accuracy = mlp_classifier.score(text_test, label_test)
    classification_rep = classification_report(label_test, y_pred, target_names=target_names, output_dict=True)
    # 提取相应的指标值
    precision = classification_rep['weighted avg']['precision']
    recall = classification_rep['weighted avg']['recall']
    f1 = classification_rep['weighted avg']['f1-score']

    accuracyTotal += accuracy
    precisionTotal += precision
    recallTotal += recall
    f1Total += f1

end_time = time.time()
# 计算平均值
runtime = (end_time - start_time) / LOOP_NUMBER
accuracy_avg = accuracyTotal / LOOP_NUMBER
precision_avg = precisionTotal / LOOP_NUMBER
recall_avg = recallTotal / LOOP_NUMBER
f1_avg = f1Total / LOOP_NUMBER

print("平均运行时间为：", runtime)
print("模型准确率：", accuracy_avg)
print("模型精确度：", precision_avg)
print("模型召回率：", recall_avg)
print("模型F1-score：", f1_avg)

平均运行时间为： 495.3832332611084
模型准确率： 0.9612
模型精确度： 0.9617417502105898
模型召回率： 0.9612
模型F1-score： 0.961100645621103


观察发现其平均运行时间高达495.3832秒，是之前训练的两个模型耗时的两倍，但模型准确率、精确度、召回率与F1-Score却没有明显优势。
这说明深度学习的确是一个效果良好，但极其消耗算力的过程，并且对参数敏感。

隐藏层具有10个神经元，且使用ReLU作为激活函数的多层感知机准确率较高，可以投入使用。故我们将全部数据投入到模型中进行训练。

In [5]:
mlp_classifier = MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000, random_state=42)
mlp_classifier.fit(textList, labelList)

### 四. 预测测试集结果
预测并观察最终的结果。

In [6]:
predictions = mlp_classifier.predict(result_list)
predictions

array([6, 1, 4, ..., 9, 6, 9])

由于该方法在多折交叉验证中表现最佳，因此我们选择该方法的结果作为进行提交。

In [7]:
with open("results.txt", "w") as file:
    file.write('id, pred\n')
    for index, item in enumerate(predictions):
        file.write(f'{index}, {item}\n')