# Text classification(Sentiment Analysis) 1dCNN
    用1维CNN实现情感分类
    运行平台 colab

In [2]:
import tensorflow as tf
tf.__version__

'2.11.0'

In [3]:
# # 解压上传的review.zip文件
# ! unzip review.zip

 `./train/pos` 和 `./train/neg` 文件夹包含.txt文本文件,分别表示训练集的正类样本和负类样本

In [4]:
import os
# 展示正类样本的数量
len(os.listdir('./train/pos/'))

1247

使用`tf.keras.utils.text_dataset_from_directory`从以文件夹为分类依据的文本文件，生成带标签的`tf.data.Dataset`对象。进一步生成训练集，验证集，测试集。其中测试集来自'./test/'文件夹，训练集和验证集按照80%，20%的比例从'./train/'文件夹下随机取出。

In [5]:
batch_size = 16
raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    "./train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)
raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    "./train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)
raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    "./test", batch_size=batch_size
)
# 观察数据的批量数
print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

Found 1440 files belonging to 2 classes.
Using 1152 files for training.
Found 1440 files belonging to 2 classes.
Using 288 files for validation.
Found 151 files belonging to 2 classes.
Number of batches in raw_train_ds: 72
Number of batches in raw_val_ds: 18
Number of batches in raw_test_ds: 10


观察一些数据样本

In [6]:

for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(5):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])

b"Author's Views on Mao are a Disgrace The author says that the reason that communist China has became skeptical and even aggressive towards American interests is because America rejected Mao, who, according to the author, wanted to reach out to America.  This is complete nonsense.  For one thing, America had reason for rejecting Mao.  Communists like Mao have harmed millions of people all over the world.  Mao literally killed millions of Chinese with his communist ideology.  Americans can have a sense of pride that we rejected Mao and wanted nothing to do with him until Richard Nixon embraced him.  Chang, the nationalist leader of China, fled to Taiwan.  The author suggests that we should never have been friends with Chang and that Mao was a far better leader.  All we have to do is look at how Taiwan turned out to see that we were right to stand with Chang, and not Mao.  Taiwan has a representative government, capitalism, a much higher standard of living than mainland China, and they 

## Prepare the data



In [7]:
from tensorflow.keras.layers import TextVectorization
import string
import re


def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )


# 模型超参数
max_features = 20000
embedding_dim = 128
sequence_length = 500

# 定义一个向量化层
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)


# 定义一个只包含文本的数据集
text_ds = raw_train_ds.map(lambda x, y: x)

vectorize_layer.adapt(text_ds)

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


In [8]:
# 向量化文本数据
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label

train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# 对数据进行异步预取以在 GPU 上获得最佳性能。
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

## 建立模型

建立一个简单的带有embedding层的1d-CNN网络

In [9]:
from tensorflow.keras import layers

# 输入一个整数，表示输入词的词表索引
inputs = tf.keras.Input(shape=(None,), dtype="int64")

# 加入词嵌入层
x = layers.Embedding(max_features, embedding_dim)(inputs)
# Dropout防止过拟合
x = layers.Dropout(0.5)(x)

# Conv1D + global max pooling
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

# dense层用作分类器
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)

# 二分类用sigmoid将输出限制在0-1之间
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model = tf.keras.Model(inputs, predictions)

# 编译模型，选择loss函数"binary_crossentropy"， 优化器"adam"
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

## 训练模型

In [13]:
# 作为一个demo只训练3个轮次
epochs = 10
model.fit(train_ds, validation_data=val_ds, epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1e8a167f5b0>

## 在测试集上测试模型效果

In [14]:
model.evaluate(test_ds)



[0.7841019034385681, 0.9072847962379456]

由于这只是一个demo，数据量非常小，可以发现在训练集上的误差几乎为0，而在验证集上的精确度在几轮训练后达到了瓶颈，存在过拟合现象。
最终，模型在测试集上达到90.72%的准确度

## 建立端到端的模型

建立一个端到端的模型，使其输入为原始的文本字符串，输出为情感分析的正类概率

In [15]:
# string input
inputs = tf.keras.Input(shape=(1,), dtype="string")
# 加入vectorize_layer把原始字符串序列转化成单词索引
indices = vectorize_layer(inputs)
# indices 作为embedding+1d-CNN模型输入
outputs = model(indices)

# 端到端模型
end_to_end_model = tf.keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# 在输出原始字符串的raw_test_ds数据集上做评估
end_to_end_model.evaluate(raw_test_ds)



[0.7841019034385681, 0.9072847962379456]