<a href="https://colab.research.google.com/github/LydiaTai/covid-bert/blob/main/CT_BERT_Huggingface_(GPU_training).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="right" width="450px" src="https://github.com/digitalepidemiologylab/covid-twitter-bert/raw/master/images/COVID-Twitter-BERT-medium.png">

# Finetuning COVID-Twitter-BERT using Huggingface
In this notebook we will finetune CT-BERT for sentiment classification using the transformer library by Huggingface.

Learn more about this library [here](https://huggingface.co/transformers/).

## Before proceeding
Create a copy of this notebook by going to "File - Save a Copy in Drive"


# Install transformers and import libraries

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 5.5MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 15.3MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 40.7MB/s 
[?25hCollecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl 

In [2]:
from transformers import (
   AutoConfig,
   AutoTokenizer,
   TFAutoModelForSequenceClassification,
   glue_convert_examples_to_features
)
from torch.optim import AdamW
import tensorflow as tf
import tensorflow_datasets as tfds
import json

# Choose a Model from the Huggingface Library

In [3]:
# Choose model
# @markdown >The default model is <i><b>COVID-Twitter-BERT</b></i>. You can however choose <i><b>BERT Base</i></b> or <i><b>BERT Large</i></b> to compare these models to the <i><b>COVID-Twitter-BERT</i></b>. All these three models will be initiated with a random classification layer. If you go directly to the Predict-cell after having compiled the model, you will see that it still runs the predition. However the output will be random. The training steps below will finetune this for the specific task. <br /><br />
model_name = 'digitalepidemiologylab/covid-twitter-bert' #@param ["digitalepidemiologylab/covid-twitter-bert", "bert-large-uncased", "bert-base-uncased"]

# Initialise tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/421 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

# Download the SST-2 Dataset and Prepare for Finetuning
You can skip this step if you are using the already finetuned model

In [10]:
# Parameters
#@markdown >Batch size and sequence length needs to be set to prepare the data. The size of the batches depends on available memory. For Colab GPU limit batch size to 8 and sequence length to 96. By reducing the length of the input (max_seq_length) you can also increase the batch size. For a dataset like SST-2 with lots of short sentences. this will likely benefit training.
max_seq_length = 96 #@param {type: "integer"}
train_batch_size =  8#@param {type: "integer"}
eval_batch_size = 8 #@param {type: "integer"}

#@markdown >The Glue dataset has around 62000 examples, and we really do not need them all for training a decent model. To cut down training time, please reduse this to only a percentage of the entire set.
use_percentage_of_data = 5 #@param {type: "slider", min: 1, max: 100}

# get dataset sizes
glue_builder = tfds.builder('glue/sst2')  # 注意这里改为sst2
glue_builder.download_and_prepare() # This line was moved up
num_train_examples = glue_builder.info.splits['train'].num_examples
num_dev_examples = glue_builder.info.splits['validation'].num_examples
num_labels = glue_builder.info.features['label'].num_classes

# download datasets
glue_builder.download_and_prepare()
train_data = glue_builder.as_dataset(split='train')
dev_data = glue_builder.as_dataset(split='validation')

# 转换为特征的函数
def convert_dataset_to_features(dataset, tokenizer, max_length, task, num_examples=None):
    texts = []
    labels = []

    # 从数据集中提取文本和标签
    for example in dataset.take(num_examples or float('inf')):
        text = example['sentence'].numpy().decode('utf-8')
        label = example['label'].numpy()
        texts.append(text)
        labels.append(label)

    # 使用tokenizer处理文本
    encoded = tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf'
    )

    # 创建特征数据集
    dataset = tf.data.Dataset.from_tensor_slices({
        'input_ids': encoded['input_ids'],
        'attention_mask': encoded['attention_mask'],
        'labels': tf.convert_to_tensor(labels, dtype=tf.int32)
    })

    return dataset

# 计算要使用的数据量
num_train_to_use = int(num_train_examples * (use_percentage_of_data/100))
num_dev_to_use = int(num_dev_examples * (use_percentage_of_data/100))

# 转换数据集为特征
train_dataset = convert_dataset_to_features(
    train_data, tokenizer, max_length=max_seq_length, task='sst-2', num_examples=num_train_to_use
)
train_dataset = train_dataset.shuffle(100).batch(train_batch_size)

dev_dataset = convert_dataset_to_features(
    dev_data, tokenizer, max_length=max_seq_length, task='sst-2', num_examples=num_dev_to_use
)
dev_dataset = dev_dataset.batch(eval_batch_size)

# Map the labels for printing
label_mapping = {i: glue_builder.info.features['label'].int2str(i) for i in range(num_labels)}

print(f'\n\nThe dataset is downloaded. The entire dataset has {num_train_examples + num_dev_examples} examples of which you are using {use_percentage_of_data}%. This will result in a train dataset with {num_train_to_use} examples and a validation dataset with {num_dev_to_use} examples.')



The dataset is downloaded. The entire dataset has 68221 examples of which you are using 5%. This will result in a train dataset with 3367 examples and a validation dataset with 43 examples.


# Compile the Model, Train it on the SST-2 Task and Save the Result
You can skip this step if you are using the already finetuned model

In [12]:
#@markdown >The default learning rate of 2e5 will be fine in most cases
learning_rate = 2e-5 #@param {type: "number"}

#@markdown > Typically these type of models are finetuned for 3 epochs. This can be increased for small datasets and decreased for large datasets.
num_epochs = 1  #@param {type: "integer"}

# Initialise a Model for Sequence Classification with 2 labels
config = AutoConfig.from_pretrained(model_name, num_labels=num_labels)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, config=config)

# Optimizer and loss
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Metrics and callbacks
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
checkpoint_path = './checkpoints/checkpoint.{epoch:02d}'
callbacks = [tf.keras.callbacks.ModelCheckpoint(checkpoint_path, save_weights_only=True)]

# Compute some variables
train_steps_per_epoch = int(num_train_examples * (use_percentage_of_data/100) / train_batch_size)
dev_steps_per_epoch = int(num_dev_examples * (use_percentage_of_data/100) / eval_batch_size)


# Compile model
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

# Train the model
history = model.fit(train_dataset,
  epochs=num_epochs,
  steps_per_epoch=train_steps_per_epoch,
  validation_data=dev_dataset,
  validation_steps=dev_steps_per_epoch,
  callbacks=callbacks)

# Print some information about the training
print(f'\nThe training has finished training after {num_epochs} epochs.')
print('\nThe history contains the accuracy and loss at every epoch:')
print(json.dumps(history.history, indent=4))

print('\nThe checkpoint callback has generated a checkpoint after every epoch (loss being the training loss, val_loss is the validation loss):')
!ls -lha ./checkpoints/

print('\nWe will now save the finetuned model and the corresponding config file on your Colab disk.')
model.save_pretrained('./huggingface_model/')

print('\nTensorflow model and config-file is saved in ./huggingface_model/')
!ls -lha ./huggingface_model/

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at digitalepidemiologylab/covid-twitter-bert and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



The training has finished training after 1 epochs.

The history contains the accuracy and loss at every epoch:
{
    "loss": [
        0.35910943150520325
    ],
    "accuracy": [
        0.8422619104385376
    ],
    "val_loss": [
        0.16689275205135345
    ],
    "val_accuracy": [
        0.949999988079071
    ]
}

The checkpoint callback has generated a checkpoint after every epoch (loss being the training loss, val_loss is the validation loss):
total 3.8G
drwxr-xr-x 2 root root 4.0K May 13 05:53 .
drwxr-xr-x 1 root root 4.0K May 13 05:43 ..
-rw-r--r-- 1 root root   83 May 13 05:53 checkpoint
-rw-r--r-- 1 root root 3.8G May 13 05:53 checkpoint.01.data-00000-of-00001
-rw-r--r-- 1 root root  73K May 13 05:53 checkpoint.01.index

We will now save the finetuned model and the corresponding config file on your Colab disk.

Tensorflow model and config-file is saved in ./huggingface_model/
total 1.3G
drwxr-xr-x 2 root root 4.0K May 13 05:53 .
drwxr-xr-x 1 root root 4.0K May 13 05:53 .

# Predict
Let's run some inference with the trained model

In [13]:
# Small function only used for formatting the output
def format_prediction(preds, label_mapping, label_name):
    preds = tf.nn.softmax(preds, axis=1)
    formatted_preds = []
    for pred in preds.numpy():
        # convert to Python types and sort
        pred = {label: float(probability) for label, probability in zip(label_mapping.values(), pred)}
        pred = {k: v for k, v in sorted(pred.items(), key=lambda item: item[1], reverse=True)}
        formatted_preds.append({label_name: list(pred.keys())[0], f'{label_name}_probabilities': pred})
    return formatted_preds

In [14]:
#@markdown >Please input text that the model can try to classify
input_text = 'I want to get vaccines.'  #@param {type: "string"}

# Tokenize the input
input_ids = tf.constant(tokenizer.encode(input_text, add_special_tokens=True))[None, :]

# Run predictions
preds = model(input_ids)

# format logits
formatted_preds = format_prediction(preds[0], label_mapping, 'sentiment')

print(f'\nLabel Mapping:{json.dumps(label_mapping, indent=4)}')
print(f'\nLogits: {preds}')
print(f'\nProbabilities:{json.dumps(formatted_preds, indent=4)}')


Label Mapping:{
    "0": "negative",
    "1": "positive"
}

Logits: TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-0.30152217, -0.21324764]], dtype=float32)>, hidden_states=None, attentions=None)

Probabilities:[
    {
        "sentiment": "positive",
        "sentiment_probabilities": {
            "positive": 0.5220543146133423,
            "negative": 0.4779457151889801
        }
    }
]


##### Copyright 2020 Per Egil Kummervold and Martin Müller

In [15]:
import tensorflow as tf
import json

# 假设这些函数和变量已经定义
# tokenizer = ...
# model = ...
# label_mapping = ...
# format_prediction = ...

def analyze_texts(texts):
    """对多条文本进行情感分析并按积极程度降序排名"""
    results = []

    for text in texts:
        # 对输入文本进行分词
        input_ids = tf.constant(tokenizer.encode(text, add_special_tokens=True))[None, :]

        # 模型预测
        preds = model(input_ids)

        # 格式化预测结果
        formatted_preds = format_prediction(preds[0], label_mapping, 'sentiment')

        # 提取积极情感的概率值
        try:
            # 从嵌套结构中获取积极情感概率
            positive_probability = formatted_preds[0]['sentiment_probabilities']['positive']
        except (IndexError, KeyError, TypeError):
            print(f"警告: 无法解析文本 '{text[:30]}...' 的预测结果")
            positive_probability = 0.0

        # 存储结果
        results.append({
            'text': text,
            'positive_probability': positive_probability,
            'all_predictions': formatted_preds
        })

    # 按积极概率降序排序
    sorted_results = sorted(results, key=lambda x: x['positive_probability'], reverse=True)

    return sorted_results

# 示例使用
sample_texts = [
    "Vaccines are a great achievement for public health!",
    "I'm not sure about getting vaccinated.",
    "Getting vaccinated is the best way to protect ourselves.",
    "Vaccines might have some side effects, but they're generally safe."
]

# 分析并排序文本
ranked_texts = analyze_texts(sample_texts)

# 打印排名结果
print("\nTexts ranked by positive sentiment (descending):")
for i, result in enumerate(ranked_texts, 1):
    print(f"\nRank {i}:")
    print(f"Text: {result['text']}")
    print(f"Positive Probability: {result['positive_probability']:.4f}")
    print(f"All Predictions: {json.dumps(result['all_predictions'], indent=4)}")


Texts ranked by positive sentiment (descending):

Rank 1:
Text: Vaccines are a great achievement for public health!
Positive Probability: 0.9675
All Predictions: [
    {
        "sentiment": "positive",
        "sentiment_probabilities": {
            "positive": 0.9674563407897949,
            "negative": 0.0325436070561409
        }
    }
]

Rank 2:
Text: Vaccines might have some side effects, but they're generally safe.
Positive Probability: 0.8315
All Predictions: [
    {
        "sentiment": "positive",
        "sentiment_probabilities": {
            "positive": 0.8315386176109314,
            "negative": 0.16846135258674622
        }
    }
]

Rank 3:
Text: Getting vaccinated is the best way to protect ourselves.
Positive Probability: 0.7910
All Predictions: [
    {
        "sentiment": "positive",
        "sentiment_probabilities": {
            "positive": 0.7910466194152832,
            "negative": 0.20895333588123322
        }
    }
]

Rank 4:
Text: I'm not sure about gettin

In [9]:
import tensorflow as tf
import json

# 假设这些函数和变量已经定义
# tokenizer = ...
# model = ...
# label_mapping = ...
# format_prediction = ...

def analyze_texts(texts):
    """对多条文本进行情感分析并按积极程度降序排名"""
    results = []

    for text in texts:
        # 对输入文本进行分词
        input_ids = tf.constant(tokenizer.encode(text, add_special_tokens=True))[None, :]

        # 模型预测
        preds = model(input_ids)

        # 格式化预测结果
        formatted_preds = format_prediction(preds[0], label_mapping, 'sentiment')

        # 提取积极情感的概率值
        try:
            # 从嵌套结构中获取积极情感概率
            positive_probability = formatted_preds[0]['sentiment_probabilities']['positive']
        except (IndexError, KeyError, TypeError):
            print(f"警告: 无法解析文本 '{text[:30]}...' 的预测结果")
            positive_probability = 0.0

        # 存储结果
        results.append({
            'text': text,
            'positive_probability': positive_probability,
            'all_predictions': formatted_preds
        })

    # 按积极概率降序排序
    sorted_results = sorted(results, key=lambda x: x['positive_probability'], reverse=True)

    return sorted_results

# 示例使用
sample_texts = [
   "This pandemic situation is really getting better!",
    "This pandemic situation is really getting worse!",
    "I want to get vaccinated.",
    "I am not sure if I want to get vaccinated.",
    "I am not sure if I want to get vaccinated or not.",
]

# 分析并排序文本
ranked_texts = analyze_texts(sample_texts)

# 打印排名结果
print("\nTexts ranked by positive sentiment (descending):")
for i, result in enumerate(ranked_texts, 1):
    print(f"\nRank {i}:")
    print(f"Text: {result['text']}")
    print(f"Positive Probability: {result['positive_probability']:.4f}")
    print(f"All Predictions: {json.dumps(result['all_predictions'], indent=4)}")


Texts ranked by positive sentiment (descending):

Rank 1:
Text: This pandemic situation is really getting better!
Positive Probability: 0.9007
All Predictions: [
    {
        "sentiment": "positive",
        "sentiment_probabilities": {
            "positive": 0.9007272720336914,
            "negative": 0.0992727056145668
        }
    }
]

Rank 2:
Text: I want to get vaccinated.
Positive Probability: 0.2977
All Predictions: [
    {
        "sentiment": "negative",
        "sentiment_probabilities": {
            "negative": 0.7023324370384216,
            "positive": 0.297667533159256
        }
    }
]

Rank 3:
Text: This pandemic situation is really getting worse!
Positive Probability: 0.0715
All Predictions: [
    {
        "sentiment": "negative",
        "sentiment_probabilities": {
            "negative": 0.9285420775413513,
            "positive": 0.07145794481039047
        }
    }
]

Rank 4:
Text: I am not sure if I want to get vaccinated.
Positive Probability: 0.0559
All Pr

In [14]:
import tensorflow as tf
import json

# 假设这些函数和变量已经定义
# tokenizer = ...
# model = ...
# label_mapping = ...
# format_prediction = ...

def analyze_texts(texts):
    """对多条文本进行情感分析并按积极程度降序排名"""
    results = []

    for text in texts:
        # 对输入文本进行分词
        input_ids = tf.constant(tokenizer.encode(text, add_special_tokens=True))[None, :]

        # 模型预测
        preds = model(input_ids)

        # 格式化预测结果
        formatted_preds = format_prediction(preds[0], label_mapping, 'sentiment')

        # 提取积极情感的概率值
        try:
            # 从嵌套结构中获取积极情感概率
            positive_probability = formatted_preds[0]['sentiment_probabilities']['positive']
        except (IndexError, KeyError, TypeError):
            print(f"警告: 无法解析文本 '{text[:30]}...' 的预测结果")
            positive_probability = 0.0

        # 存储结果
        results.append({
            'text': text,
            'positive_probability': positive_probability,
            'all_predictions': formatted_preds
        })

    # 按积极概率降序排序
    sorted_results = sorted(results, key=lambda x: x['positive_probability'], reverse=True)

    return sorted_results

# 示例使用
sample_texts = [
   "This pandemic situation is really getting better!",
    "This pandemic situation is really getting worse!",
    "I want to get vaccinated.",
    "I am not sure if I want to get vaccinated.",
    "I am not sure if I want to get vaccinated or not.",
   "This pandemic situation is really getting better!😍"
]

# 分析并排序文本
ranked_texts = analyze_texts(sample_texts)

# 打印排名结果
print("\nTexts ranked by positive sentiment (descending):")
for i, result in enumerate(ranked_texts, 1):
    print(f"\nRank {i}:")
    print(f"Text: {result['text']}")
    print(f"Positive Probability: {result['positive_probability']:.4f}")
    print(f"All Predictions: {json.dumps(result['all_predictions'], indent=4)}")


Texts ranked by positive sentiment (descending):

Rank 1:
Text: This pandemic situation is really getting better!
Positive Probability: 0.9807
All Predictions: [
    {
        "sentiment": "positive",
        "sentiment_probabilities": {
            "positive": 0.9806835055351257,
            "negative": 0.019316459074616432
        }
    }
]

Rank 2:
Text: This pandemic situation is really getting better!😍
Positive Probability: 0.9591
All Predictions: [
    {
        "sentiment": "positive",
        "sentiment_probabilities": {
            "positive": 0.9590954780578613,
            "negative": 0.04090452939271927
        }
    }
]

Rank 3:
Text: I want to get vaccinated.
Positive Probability: 0.4858
All Predictions: [
    {
        "sentiment": "negative",
        "sentiment_probabilities": {
            "negative": 0.5141684412956238,
            "positive": 0.4858315885066986
        }
    }
]

Rank 4:
Text: I am not sure if I want to get vaccinated.
Positive Probability: 0.1100
A

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from transformers import AutoTokenizer
import os

# 保存路径
save_path = "/content/drive/MyDrive/saved_model/"
os.makedirs(save_path, exist_ok=True)

# 保存模型权重和配置
model.save_pretrained(save_path)

# 保存分词器（与训练时使用的相同）
tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert")  # 替换为你训练时使用的分词器
tokenizer.save_pretrained(save_path)

# 保存训练参数（包含所有关键配置）
training_args = {
    "model_name": model_name,  # 使用的预训练模型名称
    "num_labels": num_labels,  # 分类任务的标签数量
    "max_seq_length": max_seq_length,  # 最大序列长度
    "train_batch_size": train_batch_size,  # 训练批量大小
    "eval_batch_size": eval_batch_size,  # 评估批量大小
    "use_percentage_of_data": use_percentage_of_data,  # 使用数据的百分比
    "learning_rate": learning_rate,  # 学习率
    "num_epochs": num_epochs,  # 训练轮数
    "num_train_examples": num_train_examples,  # 训练样本总数
    "num_dev_examples": num_dev_examples,  # 验证样本总数
    "train_steps_per_epoch": train_steps_per_epoch,  # 每轮训练步数
    "dev_steps_per_epoch": dev_steps_per_epoch,  # 每轮验证步数
    "optimizer": optimizer.get_config(),  # 优化器配置
    "loss": loss.__class__.__name__,  # 损失函数名称
    "metrics": [m.name for m in metrics],  # 评估指标列表
}

# 保存训练参数到文件
import json
with open(os.path.join(save_path, "training_args.json"), "w") as f:
    json.dump(training_args, f, indent=4)

print(f"训练参数已保存到: {save_path}/training_args.json")

# 写入 requirements.txt
with open(os.path.join(save_path, "requirements.txt"), "w") as f:
    f.write(f"transformers=={transformers_version}\n")
    f.write(f"tensorflow=={tensorflow_version}\n")
    f.write(f"torch=={torch_version}\n")
    # 添加其他依赖（如果有）

print(f"版本信息已保存到: {save_path}/requirements.txt")

In [22]:
model.save_pretrained('/content/drive/MyDrive/huggingface_model/')

In [1]:
tokenizer.save_pretrained('/content/drive/MyDrive/huggingface_model/')

NameError: name 'tokenizer' is not defined