### 案例：基于BiLSTM和BERT对数据集IMDB进行分类

### 流程：

#### 第1步：加载IMDB数据集；
#### 第2步：对text进行分词；
#### 第3步：基于tf.py_function函数封装一个自定义的函数处理padding；
#### 第4步：加载预训练的 GloVe embeddings；
#### 第5步：基于GloVe创建IMDB的embedding；
#### 第6步：基于词袋定义BiLSTM模型；
#### 第7步：基于BERT实现分类；

### 第1步：加载 IMDB 数据集

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [2]:
train_dataset, info = tfds.load(name='imdb_reviews', # 数据集名称
                                split='train', # 切分为训练集
                                with_info=True, # 数据集信息
                                as_supervised=True # 返回 (input, label)
                               )

# 参考API用法：https://tensorflow.google.cn/datasets/api_docs/python/tfds/load

2021-12-26 14:32:57.729853: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-12-26 14:32:57.729875: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: tgl
2021-12-26 14:32:57.729879: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: tgl
2021-12-26 14:32:57.729924: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 470.86.0
2021-12-26 14:32:57.729938: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.86.0
2021-12-26 14:32:57.729942: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 470.86.0
2021-12-26 14:32:57.730696: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: 

In [3]:
test_dataset = tfds.load(name='imdb_reviews',
                         split='test',
                         as_supervised=True)

In [4]:
info # 数据集信息

tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset.
    This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_path='/home/tgl/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=25000, num_shards=1>,
        'train': <SplitInfo

### 第2步：对text进行分词

In [5]:
tfds_tokenizer = tfds.deprecated.text.Tokenizer()   # 默认的tokenizer

# 参考API用法：https://www.tensorflow.org/datasets/api_docs/python/tfds/deprecated/text/Tokenizer

In [6]:
vocab_set = set() # 词库表
MAX_LEN = 0  # 句子的最大长度


for text, label in train_dataset:
    tokens = tfds_tokenizer.tokenize(text.numpy())  # 对text进行分词
    if MAX_LEN < len(tokens):
        MAX_LEN = len(tokens)  # 获取最长的句子
    vocab_set.update(tokens)  # 用于修改当前集合，可以添加新的元素或集合到当前集合中，如果添加的元素在集合中已存在，则该元素只会出现一次，重复的会忽略。

In [7]:
MAX_LEN

2525

In [8]:
# 基于上述 tokenizer 和 vocab_set 创建一个 text encoder

encoder = tfds.deprecated.text.TokenTextEncoder(vocab_list=vocab_set, # 词汇表
                                                lowercase=True, # 全部小写
                                                tokenizer=tfds_tokenizer # 分词器
                                               )

# 参考API用法：https://www.tensorflow.org/datasets/api_docs/python/tfds/deprecated/text/TokenTextEncoder

In [9]:
encoder.vocab_size  # 不同的词汇

93931

In [10]:
encoder.oov_token  # 未出现的词汇

'UNK'

In [11]:
encoder.tokens[:10]  # 词汇

['tilton',
 'frantisek',
 'catastrophic',
 'deduces',
 'prematurely',
 'dumbrille',
 'unneeded',
 'inserted',
 'movieworld',
 'crutchley']

### 第3步：基于tf.py_function函数封装一个自定义的函数处理padding

In [12]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def encode_padding(text):
    encode_result = encoder.encode(text.numpy())  # 对text进行分词，并转换为int类型的数值
    pad_result = pad_sequences([encode_result],   # 填充
                               padding='post',
                               truncating='post',
                               maxlen=150)
    return np.array(pad_result[0], dtype=np.int64)

In [13]:
# 示例

encode_padding(tf.constant(b"Today is Christmas Day. Merry Christmas."))

array([88497, 70219, 71769, 93729, 59938, 71769,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,

In [14]:
def encode_function(text, label):
    encode_func = tf.py_function(encode_padding, # 函数名
                                 inp=[text], # 输入参数
                                 Tout=tf.int64 # 返回值类型
                                )
    return encode_func, label

# 参考API用法：https://www.tensorflow.org/api_docs/python/tf/py_function

In [15]:
# 对 train_dataset 和 test_dataset 进行 padding

encoded_train = train_dataset.map(encode_function,
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE # 输入元素彼此独立，因此预处理可以跨多个CPU内核并行化
                                 )

# 参考API用法：https://www.tensorflow.org/api_docs/python/tf/data/experimental

In [16]:
encoded_test = test_dataset.map(encode_function,
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE # 输入元素彼此独立，因此预处理可以跨多个CPU内核并行化
                                 )

### 第4步：加载预训练的 GloVe embeddings

In [17]:
# 先提前下载 glove 

!wget https://nlp.stanford.edu/data/glove.6B.zip

In [18]:
# 解压

!unzip glove.6B.zip

In [19]:
# 读取文件 glove.6B.50d.txt

word_embedding = dict()

with open('glove.6B.50d.txt', 'r') as file:
    for line in file:
        tokens = line.split() # 根据空格分词
        word = tokens[0] # 单词
        vector = np.array(tokens[1:], dtype=np.float32) # 向量表示
        if vector.shape[0] == 50: 
            word_embedding[word] = vector
        else:
            print("wrong word embedding")

In [20]:
len(word_embedding)  # 40万个单词

400000

### 第5步：基于GloVe创建IMDB的embedding

In [21]:
dim = 50

emb_matrix = np.zeros((encoder.vocab_size, dim))  # 初始化一个matrix

In [22]:
emb_matrix.shape

(93931, 50)

In [23]:
unknown_word_count = 0
unknown_word = set()

for word in encoder.tokens: # 获取每一个词
    vector = word_embedding.get(word)  # 根据词获取对应的embedding,如果没有，则返回None
    
    if vector is not None:
        idx = encoder.encode(word)[0] # 单词对应的索引
        emb_matrix[idx] = vector  # 获取到GloVe的词向量
    else:
        unknown_word_count += 1  # 没有匹配到的词
        unknown_word.add(word)

In [24]:
unknown_word_count

14553

### 第6步：基于词袋定义BiLSTM模型

In [25]:
# 超参数定义

vocab_size = encoder.vocab_size  # 词表大小

units = 64  # 神经元数量

batch_size = 100  # 批次大小

In [26]:
# 定义BiLSTM模型
from tensorflow.keras.layers import Layer, Embedding, LSTM, Bidirectional, Dense

def build_bilstm_model(vocab_size, embedding_dim, units, batch_size, train = False):
    model = tf.keras.Sequential([
        Embedding(vocab_size,           # 词表大小
                  embedding_dim,        # 词向量维度
                  mask_zero=True,       # mask用0填充
                  weights=[emb_matrix], # 用GloVe预训练词向量
                  trainable=train),     # 设置为False，防止在训练过程中更新参数
        Bidirectional(LSTM(units, return_sequences=True, dropout=0.5)),
        Bidirectional(LSTM(units, dropout=0.25)),
        Dense(1, activation='sigmoid')
    ])
    return model

# 参考API用法：https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

In [27]:
# 创建模型

model = build_bilstm_model(vocab_size, embedding_dim=dim, units=units, batch_size=batch_size, train=False)

In [28]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 50)          4696550   
                                                                 
 bidirectional (Bidirectiona  (None, None, 128)        58880     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 4,854,375
Trainable params: 157,825
Non-trainable params: 4,696,550
_________________________________________________________________


In [29]:
# 模型编译

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy', 'Precision', 'Recall'])

In [30]:
# 模型训练

train_batch_dataset = encoded_train.batch(batch_size).prefetch(100)  # CPU预先加载数据集

model.fit(train_batch_dataset, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f57e3a83d60>

In [31]:
# 模型验证

model.evaluate(encoded_test.batch(batch_size))



[0.41081348061561584,
 0.8373600244522095,
 0.7978528141975403,
 0.9036800265312195]

In [32]:
# 优化：针对 word 为0的向量进行训练

model_v2 = build_bilstm_model(vocab_size=vocab_size,
                              embedding_dim=dim,
                              units=units,
                              batch_size=batch_size,
                              train=True)

In [33]:
model_v2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 50)          4696550   
                                                                 
 bidirectional_2 (Bidirectio  (None, None, 128)        58880     
 nal)                                                            
                                                                 
 bidirectional_3 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 4,854,375
Trainable params: 4,854,375
Non-trainable params: 0
_________________________________________________________________


In [34]:
# 模型编译

model_v2.compile(loss='binary_crossentropy',
                 optimizer='adam',
                 metrics=['accuracy', 'Precision', 'Recall'])

In [35]:
# 模型训练

model_v2.fit(train_batch_dataset, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f57166ec100>

In [36]:
# 模型验证

model_v2.evaluate(encoded_test.batch(batch_size))



[1.028915286064148, 0.8052399754524231, 0.7785239815711975, 0.8532000184059143]

### 第7步：基于BERT实现分类

In [37]:
from transformers import BertTokenizer

In [38]:
bert_name = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(bert_name,                # 预训练模型名称
                                          add_special_tokens=True,  # 添加特殊字符，[CLS], [SEP], [PAD]
                                          do_lower_case=True,       # 全部转换为小写
                                          max_length=150,           # 自定义句子长度
                                          pad_to_max_length=True)   # 填充

# BERT理论部分，请参考：https://space.bilibili.com/474347248/channel/seriesdetail?sid=856305

In [39]:
# 示例 1 -- 单句

tokenizer.encode_plus("The Chinese New Year is coming.",
                      add_special_tokens=True,
                      max_length=15,
                      pad_to_max_length=True,
                      return_attention_mask=True,
                      return_token_type_ids=True)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


{'input_ids': [101, 1996, 2822, 2047, 2095, 2003, 2746, 1012, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}

In [40]:
# 示例 2 -- 双句

tokenizer.encode_plus("The Chinese New Year", "it is coming",
                      add_special_tokens=True,
                      max_length=15,
                      pad_to_max_length=True,
                      return_attention_mask=True,
                      return_token_type_ids=True)

{'input_ids': [101, 1996, 2822, 2047, 2095, 102, 2009, 2003, 2746, 102, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]}

In [41]:
# 根据上述示例，定义方法，将输入文本进行encode

def bert_encode(text):
    text = text.numpy().decode('utf-8')  # 先转换为array类型，再进行编码转换
    encode_result = tokenizer.encode_plus(text,
                                          add_special_tokens=True,
                                          max_length=150,
                                          pad_to_max_length=True,
                                          return_attention_mask=True,
                                          return_token_type_ids=True)
    input_ids = encode_result['input_ids']
    token_type_ids = encode_result['token_type_ids']
    attention_mask = encode_result['attention_mask']
    
    return input_ids, token_type_ids, attention_mask

In [42]:
# 查看 训练集的一个示例

for text, label in encoded_train.take(1):
    print(text)
    print(label)

tf.Tensor(
[50285 84571 82065 62217 84683 87904 53684 63366 85729 56017 52083 75692
 43839 91715 73352 49854 25771 78155 89161 82951 86212 68764 50285 78310
 67848 85729 61078 42793 91632 52083 57439 85886 61078 82951 58903 90473
 73734 52739 50285 87904 93060 79769 93824 50285 87904 70219 82065 65457
 59224 76117 64602 60352 84565 73093 83216 31704 53097 66730 88206 84565
 17887 75648 53097 52207 61078 66501 62051 72727 30739 23735 53808 58916
 70526 45151 77997 86390 91881 84738 57222 91715 84571 69561 68764 43384
 83216 90182 12640 52083 43384 87904 69365 84571 89982 91220 89077 57728
 86551 14879 63514 55000 69365 92990 89161 90962 89207 50285 79623 89851
 93060 89207 43839 91715 93060 67563 74675 14879 90473 92980 71481 54205
 53768     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0], shape=(150,), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)


In [43]:
# 对 encode_train 数据集 进行 encode 转换

bert_encode_train = [bert_encode(text) for text, label in train_dataset]  # 数据集，包含：input_ids, token_type_ids, attention_mask
bert_encode_label = [label for text, label in train_dataset]              # 标签集

bert_encode_train = np.array(bert_encode_train)                                      # 类型转换 tensor -> array
bert_encode_label = tf.keras.utils.to_categorical(bert_encode_label, num_classes=2)  # 标签类型转换

In [44]:
# 示例

tf.keras.utils.to_categorical(tf.constant([0, 1, 1, 0]), num_classes=2)

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.]], dtype=float32)

In [45]:
# 对 encode_train 进行 切分为： train 和 val

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(bert_encode_train, 
                                                  bert_encode_label, 
                                                  test_size=0.2, 
                                                  random_state=666)

In [46]:
X_train.shape

(20000, 3, 150)

In [47]:
y_train.shape

(20000, 2)

In [48]:
# 将 X_train 和 X_val 分为三部分

train_inputs_ids, train_token_type_ids, train_attention_masks = np.split(X_train, 3, axis=1)  # 拆分

val_inputs_ids, val_token_type_ids, val_attention_masks = np.split(X_val, 3, axis=1)

In [49]:
train_inputs_ids.shape

(20000, 1, 150)

In [50]:
train_token_type_ids.shape

(20000, 1, 150)

In [51]:
train_attention_masks.shape

(20000, 1, 150)

In [52]:
# 减掉 1 维

train_inputs_ids = train_inputs_ids.squeeze()
train_token_type_ids = train_token_type_ids.squeeze()
train_attention_masks = train_attention_masks.squeeze()

In [53]:
# 减掉 1 维

val_inputs_ids = val_inputs_ids.squeeze()
val_token_type_ids = val_token_type_ids.squeeze()
val_attention_masks = val_attention_masks.squeeze()

In [54]:
train_inputs_ids.shape

(20000, 150)

In [55]:
train_token_type_ids.shape

(20000, 150)

In [56]:
train_attention_masks.shape

(20000, 150)

In [57]:
### 构建 训练和验证批数据

def combine_dataset(input_ids, token_type_ids, attention_mask, label):
    data_format = {'input_ids' : input_ids,
                   'token_type_ids' :token_type_ids,
                   'attention_mask' : attention_mask}
    
    return data_format, label

In [58]:
# 训练批数据

train_ds = tf.data.Dataset.from_tensor_slices((train_inputs_ids,
                                               train_token_type_ids,
                                               train_attention_masks,
                                               y_train)).map(combine_dataset).shuffle(100).batch(16)

In [59]:
# 验证批数据

val_ds = tf.data.Dataset.from_tensor_slices((val_inputs_ids,
                                             val_token_type_ids,
                                             val_attention_masks,
                                             y_val)).map(combine_dataset).shuffle(100).batch(16)

In [60]:
# 从Hugging Face加载分类模型

from transformers import TFBertForSequenceClassification

bert_model = TFBertForSequenceClassification.from_pretrained(bert_name)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [61]:
# 定义优化器

optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)


# 定义损失函数

loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)

In [62]:
# 模型编译

bert_model.compile(optimizer=optimizer,
                   loss=loss,
                   metrics=['accuracy'])

In [63]:
bert_model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [64]:
history = bert_model.fit(train_ds,
                         epochs=1,
                         validation_data=val_ds)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


























In [65]:
# 在测试集上进行测试

bert_test = [bert_encode(text) for text ,label in test_dataset]
bert_test_label = [label for text, label in test_dataset]

bert_test = np.array(bert_test)
bert_test_label = tf.keras.utils.to_categorical(bert_test_label, num_classes=2)

test_inputs_ids, test_token_type_ids, test_attention_masks = np.split(bert_test, 3, axis=1)  # 拆分

test_inputs_ids = test_inputs_ids.squeeze()
test_token_type_ids = test_token_type_ids.squeeze()
test_attention_masks = test_attention_masks.squeeze()

test_ds = tf.data.Dataset.from_tensor_slices((test_inputs_ids,
                                              test_token_type_ids,
                                              test_attention_masks,
                                              bert_test_label)).map(combine_dataset).shuffle(100).batch(16)

In [66]:
# 模型测试

bert_model.evaluate(test_ds)



[0.25527384877204895, 0.8922799825668335]