# BERT情感分析

![jupyter](./imgs/bert_classification.png)

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

import warnings
warnings.filterwarnings('ignore')

## 1. 创建数据集

In [2]:
train_df = pd.read_csv('./dataset/train.csv')
val_df = pd.read_csv('./dataset/val.csv')
test_df = pd.read_csv('./dataset/test.csv')
val_df.head()

Unnamed: 0,text,label
0,在 韩 红 基 金 会 0 0 0 0 年 的 审 计 报 告 上 可 看 到 其 0 0 ...,1
1,感 谢 有 你 们 向 所 有 奋 战 在 一 线 的 医 护 人 员 致 敬 期 待 大 ...,2
2,0 0 省 份 一 省 包 一 市 支 援 湖 北 嘿 你 在 干 嘛 呢 何 老 师 的 ...,2
3,普 通 感 冒 以 后 也 别 吃 抗 生 素 了 烧 吃 退 烧 药 布 洛 芬 啥 的 ...,1
4,爷 爷 你 好 我 是 武 汉 儿 童 医 院 的 护 士 你 带 着 小 孩 来 武 汉 ...,1


##  Hugging Face Transformers 

Transformers提供了NLP领域大量state-of-art的预训练语言模型结构的模型和调用框架。  
到目前为止，transformers 提供了超过100种语言的，32种预训练语言模型，简单，强大，高性能，是新手入门的不二选择。   

![jupyter](./imgs/berts.png)

### BERT 输入格式
![jupyter](./imgs/bert_inputs.png)

### BERT 文本分类输入
![jupyter](./imgs/bert_for_classification.png)

### 使用TFBertForSequenceClassification进行文本分类
https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification   

BERT 具有两种输出  
1. pooler output，对应的[CLS]的输出   
2. sequence output，对应的是序列中的所有字的最后一层hidden输出last_hidden_state。   

BERT主要可以处理两种，
- 一种任务是分类/回归任务（使用的是pooler output）
- 一种是序列任务（sequence output）。  
TFBertForSequenceClassification，即使用pooler output接softmax进行分类任务。  
![jupyter](./imgs/bert_outputs.jpeg)

In [3]:
from transformers import BertTokenizer

# 定义中文base bert的tokenzier
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

In [4]:
test_sentence = '写在年末冬初孩子流感的第五天，我们仍然没有忘记热情拥抱这2020年的第一天。'

# 使用中文base bert的tokenzier将文本转化为对应bert的输入
bert_input = tokenizer.encode_plus(
                        test_sentence,                      
                        add_special_tokens = True,  # 标记是否添加[CLS], [SEP]特殊字符
                        max_length = 50, #  最长序列长度
                        pad_to_max_length = True, # 标记是否添加[PAD]到最长长度
                        truncation=True,  # 标记是否截断
                        return_attention_mask = True, # 添加注意力掩码，使注意力计算不关注pad的数据
                        )

for k, v in bert_input.items():
    print(k)
    print(v)

input_ids
[101, 1091, 1762, 2399, 3314, 1100, 1159, 2111, 2094, 3837, 2697, 4638, 5018, 758, 1921, 8024, 2769, 812, 793, 4197, 3766, 3300, 2563, 6381, 4178, 2658, 2881, 2849, 6821, 8439, 2399, 4638, 5018, 671, 1921, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
token_type_ids
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [5]:
def convert_sample_to_feature(text, max_length):
    return tokenizer.encode_plus(text, 
                                 add_special_tokens=True, 
                                 max_length=max_length, 
                                 padding='max_length',    
                                 truncation=True,
                                 return_attention_mask = True,
                                )

# 将输入映射成TFBertForSequenceClassification的格式
def map_sample_to_dict(input_ids, token_type_ids, attention_masks, label):
    return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label


# 创建TF数据集
def build_dataset(df, max_length):
    # 准备列表，以便我们可以从列表中构建最终的TensorFlow数据集
    input_ids_list = []
    token_type_ids_list = []
    attention_mask_list = []
    label_list = []
    # 将输入数据转化为BERT输入
    for _, row in df.iterrows():
        text, label = row["text"], row["label"]
        bert_input = convert_sample_to_feature(text, max_length)  # 对文本进行转换成BERT输入
        input_ids_list.append(bert_input['input_ids'])
        token_type_ids_list.append(bert_input['token_type_ids'])
        attention_mask_list.append(bert_input['attention_mask'])
        label_list.append([label])
    dataset = tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list))
    dataset = dataset.map(map_sample_to_dict)
    return dataset

In [6]:
BATCH_SIZE = 32   # bert模型较复杂参数较多，batch size一般不大
MAX_SEQ_LEN = 240  # 最长序列长度
NUM_LABELS = 3  # 标签数量
BUFFER_SIZE = len(train_df)

# 创建数据集
train_dataset = build_dataset(train_df, MAX_SEQ_LEN).shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(BUFFER_SIZE)
val_dataset = build_dataset(val_df, MAX_SEQ_LEN).batch(BATCH_SIZE)
test_dataset = build_dataset(test_df, MAX_SEQ_LEN).batch(BATCH_SIZE)

## 2. 构建BERT分类模型

In [7]:
from transformers import TFBertForSequenceClassification

# 使用TF版本的中文base bert分类模型
model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese',  # base中文bert
                                                        num_labels=NUM_LABELS  # 指定输出的类别数
                                                       )

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
# BERT学习率一般较小, 使用Adam优化器 3e-5, 3e-6
LR = 3e-6

# BERT参数量大，拟合能力较强，在这个数据集上不需要太多迭代
EPOCHS = 5

# 同样早停等待次数也设置小一些
PATIENCE = 1

# 常用Adam优化器
optimizer = tf.keras.optimizers.Adam(learning_rate=LR)

# 这里的标签值并不是one-hot的，所以loss需要SparseCategoricalCrossentropy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) # from_logits为True会用softmax将y_pred转化为概率，结果更稳定
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer,
              loss=loss,
              metrics=[metric])

In [9]:
callback = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy',
                                            patience=PATIENCE,
                                            restore_best_weights=True)

bert_history = model.fit(train_dataset,
                         epochs=EPOCHS,
                         callbacks=[callback],
                         validation_data=val_dataset)

Epoch 1/5
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [10]:
# BERT模型保存
save_model_path = "./bert/bert_classification"
model.save_pretrained(save_model_path, saved_model=True)

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: ./bert/bert_classification/saved_model/1/assets


## 3. 模型评估

In [11]:
# 结果包括loss和logits, 取出模型预测Logits
output = model.predict(test_dataset)
output



TFSequenceClassifierOutput(loss=None, logits=array([[-2.827365  ,  0.62048835,  2.1867077 ],
       [-2.3731406 ,  0.91308105,  1.669932  ],
       [ 1.601606  ,  0.28248912, -1.983878  ],
       ...,
       [-2.8215024 ,  2.8636997 , -0.22697781],
       [ 1.6240071 ,  0.98691046, -2.6499858 ],
       [-2.706423  , -0.49490097,  3.3001475 ]], dtype=float32), hidden_states=None, attentions=None)

In [12]:
preds = np.argmax(output.logits, axis=-1)

preds[:10]

array([2, 2, 0, 1, 1, 1, 1, 1, 1, 1])

In [13]:
from sklearn.metrics import classification_report
test_label = test_df['label']
result = classification_report(test_label, preds)
print(result)

              precision    recall  f1-score   support

           0       0.68      0.66      0.67      1796
           1       0.78      0.81      0.79      5651
           2       0.74      0.68      0.71      2551

    accuracy                           0.75      9998
   macro avg       0.73      0.72      0.72      9998
weighted avg       0.75      0.75      0.75      9998



## 4. 模型预测

In [14]:
# 加载保存好的模型
save_model_path = "./bert/bert_classification"
saved_model = TFBertForSequenceClassification.from_pretrained(save_model_path, 
                                                              num_labels=NUM_LABELS)
saved_model.summary()

Some layers from the model checkpoint at ./bert/bert_classification were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at ./bert/bert_classification.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  102267648 
_________________________________________________________________
dropout_75 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  2307      
Total params: 102,269,955
Trainable params: 102,269,955
Non-trainable params: 0
_________________________________________________________________


In [15]:
predict_sentences = [
    "因为疫情被困家里2个月了，好压抑啊，感觉自己抑郁了！",
    "我国又一个新冠病毒疫苗获批紧急使用。",
    "我们在一起，打赢这场仗，抗击新馆疫情，我们在行动！"]

# 调用中文bert base模型的tokenzier
predict_inputs = tokenizer(predict_sentences,
                           padding=True,
                           max_length=MAX_SEQ_LEN, 
                           return_tensors="tf")
# 直接call保存好的bert model
output = saved_model(predict_inputs)

# 取出模型预测结果的logits
predict_logits = output.logits.numpy()

predict_logits

array([[ 0.436507  , -0.08783254, -0.18646817],
       [-1.7852176 ,  0.88021284,  1.3777912 ],
       [-2.0149825 , -0.03028507,  2.2247462 ]], dtype=float32)

In [16]:
# 取出分数最高的标签
predict_results = np.argmax(predict_logits, axis=1)
# 还原标签
predict_labels = [label - 1 for label in predict_results] 
predict_labels

[-1, 1, 1]

In [17]:
# 格式化预测结果
for text, label in zip(predict_sentences, predict_labels):
    print(f'文本: {text}\n预测标签: {label}')

文本: 因为疫情被困家里2个月了，好压抑啊，感觉自己抑郁了！
预测标签: -1
文本: 我国又一个新冠病毒疫苗获批紧急使用。
预测标签: 1
文本: 我们在一起，打赢这场仗，抗击新馆疫情，我们在行动！
预测标签: 1


## 5. 模型优化

BERT是一种预训练语言模型，参数量较大，训练较慢。    
可以将BERT作为embedding层，固定其参数，只做前向运算，再接其他特征抽取层进行特征抽取。

#### 将BERT作为embedding层  
![jupyter](./imgs/bert_embedding.png)

#### BERT接特征抽取层
![jupyter](./imgs/bert_embedding2.png)

#### 使用BERT的序列输出
![jupyter](./imgs/bert_token_classification.jpeg)

In [18]:
from transformers import TFBertModel

bert_model = TFBertModel.from_pretrained('bert-base-chinese')  # 初始化中文bert base model

input_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), name='input_ids', dtype='int32')
token_type_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), name='token_type_ids', dtype='int32')  # 定义bert model输入
attention_masks = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), name='attention_mask', dtype='int32') 
embedding_layer = bert_model(input_ids, attention_masks)[0]  # 取出BERT另一种输出last_hidden_state，然后特征抽取器
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100,  # 使用双向LSTM进行特征抽取
                                                       return_sequences=True,
                                                       dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)  # 进行max pooling
X = tf.keras.layers.BatchNormalization()(X) 
X = tf.keras.layers.Dense(256, activation='relu')(X)
X = tf.keras.layers.Dropout(0.5)(X)
y = tf.keras.layers.Dense(3, activation='softmax', name='outputs')(X)  # 3 labels due to three sentiment classes

model = tf.keras.Model(inputs=[input_ids, attention_masks, token_type_ids], outputs = y)

for layer in model.layers[:3]:  # 将BERT相关层权重冻结，不可训练，只做前向运算
     layer.trainable = False

Some layers from the model checkpoint at bert-base-chinese were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-chinese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.




In [19]:
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 240)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 240)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     TFBaseModelOutputWit 102267648   input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
bidirectional (Bidirectional)   (None, 240, 200)     695200      tf_bert_model[0][0]   