## Using BERT for Like Prediction with Text Only Input

## Data Preprocessing

### Read File

In [1]:
import pandas as pd
import numpy as np
train = pd.read_csv('../../raw_data/intern_homework_train_dataset.csv')
test = pd.read_csv('../../raw_data/intern_homework_public_test_dataset.csv')
train.head()

Unnamed: 0,title,created_at,like_count_1h,like_count_2h,like_count_3h,like_count_4h,like_count_5h,like_count_6h,comment_count_1h,comment_count_2h,comment_count_3h,comment_count_4h,comment_count_5h,comment_count_6h,forum_id,author_id,forum_stats,like_count_24h
0,我的排骨湯,2022-10-05 14:20:21 UTC,12,15,15,15,16,18,10,10,10,10,10,10,598518,428921,0.7,26
1,#請益 婚禮穿搭,2022-10-05 14:28:13 UTC,0,0,3,4,4,4,2,5,8,9,9,9,399302,650840,63.9,11
2,無謂的啦啦隊,2022-10-06 07:18:22 UTC,3,7,8,11,12,14,1,1,2,3,3,3,650776,717288,19.2,19
3,文學理論 課本,2022-09-20 11:39:14 UTC,2,7,11,24,26,26,2,2,8,32,38,63,471023,173889,7.9,29
4,一般課程,2022-09-05 10:18:24 UTC,3,7,7,10,10,11,15,26,35,38,48,49,230184,594332,36.2,16


In [2]:
from sklearn.model_selection import train_test_split

train, valid = train_test_split(train, random_state=777, train_size=0.9)
len(train), len(valid)

(45000, 5000)

### Feature Selection

挑選 Label

In [3]:
train_label = train['like_count_24h']
valid_label = valid['like_count_24h']
test_label = test['like_count_24h']
train_label

35094     15
13209      8
10950     15
40900      6
27226    266
        ... 
26695    224
36785     10
40535     11
15931     10
47919     34
Name: like_count_24h, Length: 45000, dtype: int64

In [4]:
train_label = train_label.tolist()
valid_label = valid_label.tolist()
test_label = test_label.tolist()
train_label[0]

15

In [5]:
train_label = list(map(float, train_label))
valid_label = list(map(float, valid_label))
test_label = list(map(float, test_label))
train_label[0]

15.0

因為 BART 為自然語言模型，label 也需要為文字的格式

挑選 Input

省略掉 作者 ID / 看板 ID / 看板資訊 因為我認為 BART 沒辦法了解這些 Feature, 會讓模型預測文字輸出帶來 noise.

In [6]:
# 指定要刪除的 column names，並使用 drop 函數將這些 column 刪除
drop_columns = ['author_id', 'like_count_24h', 'forum_id', 'forum_stats']

train_input = train.drop(drop_columns, axis=1)
valid_input = valid.drop(drop_columns, axis=1)
test_input = test.drop(drop_columns, axis=1)

### Data Transfromations

處理 created_by Feature

In [7]:
# 將文章發佈時間 拆成 星期幾 與 小時 的函數
def split_date(df, date_column):

    # 將 created_by 欄位轉換成日期格式
    df[date_column] = pd.to_datetime(df[date_column], utc=True)
    
    # 新增 星期幾 和 小時 欄位
    df['weekday'] = df[date_column].dt.weekday
    df['hour'] = df[date_column].dt.hour

    # 移除 created_by 欄位
    df = df.drop(date_column, axis=1)

    # 回傳處理過的資料集
    return df

In [8]:
train_input = split_date(train_input, 'created_at')
valid_input = split_date(valid_input, 'created_at')
test_input = split_date(test_input, 'created_at')

# 顯示處理過的資料集
train_input.head()

Unnamed: 0,title,like_count_1h,like_count_2h,like_count_3h,like_count_4h,like_count_5h,like_count_6h,comment_count_1h,comment_count_2h,comment_count_3h,comment_count_4h,comment_count_5h,comment_count_6h,weekday,hour
35094,111台聯A3台綜A10轉學考上榜心得,1,5,7,8,9,10,2,2,6,7,7,7,5,12
13209,［世界志工社｜第二次招生說明會預告］,7,7,7,7,7,7,21,22,22,22,22,22,6,13
10950,「攜手」-我們在身份轉換中的震盪,3,3,3,3,3,4,5,5,5,5,5,5,5,12
40900,怎麼抓住小一女生的心,0,0,0,0,0,0,1,4,4,4,4,5,1,12
27226,楊丞琳以前家境清寒吃不起海鮮,2,3,5,5,5,5,0,0,1,1,1,1,1,16


In [9]:
train_input.columns

Index(['title', 'like_count_1h', 'like_count_2h', 'like_count_3h',
       'like_count_4h', 'like_count_5h', 'like_count_6h', 'comment_count_1h',
       'comment_count_2h', 'comment_count_3h', 'comment_count_4h',
       'comment_count_5h', 'comment_count_6h', 'weekday', 'hour'],
      dtype='object')

自訂義 dict，將 weekday 轉換為中文星期幾

In [10]:
weekday_dict = {
    0: '星期一',
    1: '星期二',
    2: '星期三',
    3: '星期四',
    4: '星期五',
    5: '星期六',
    6: '星期日'
}

# 將 weekday 轉換為中文星期幾
train_input['weekday'] = train_input['weekday'].map(weekday_dict)
valid_input['weekday'] = valid_input['weekday'].map(weekday_dict)
test_input['weekday'] = test_input['weekday'].map(weekday_dict)

將 Input 的數值與文字合併一篇文章來當作 BERT Input

In [11]:
# 將 dataframe 內的 feature 合併成一串文字 的函數
def transfrom_to_text(df):

    passage = ""

    # combined title and post time
    passage += "這篇文章標題是 {} ，在{}{}點發佈。".format(df["title"], df["weekday"], df["hour"])

    # combined likes_count
    passage += "文章在發佈後的第一小時累積到的愛心數有 {}，在第二小時累積到的愛心數有 {}，".format(df["like_count_1h"], df["like_count_2h"])
    passage += "第三小時累積到的愛心數有 {}，在第四小時累積到的愛心數有 {}，".format(df["like_count_3h"], df["like_count_4h"])
    passage += "第五小時累積到的愛心數有 {}，在第六小時累積到的愛心數有 {}。".format(df["like_count_5h"], df["like_count_6h"])

    # combined comment_count
    passage += "文章在發佈後的第一小時累積到的留言數有 {}，在第二小時累積到的留言數有 {}，".format(df["comment_count_1h"], df["comment_count_2h"])
    passage += "第三小時累積到的留言數有 {}，在第四小時累積到的留言數有 {}，".format(df["comment_count_3h"], df["comment_count_4h"])
    passage += "第五小時累積到的留言數有 {}，在第六小時累積到的留言數有 {}。".format(df["comment_count_5h"], df["comment_count_6h"])

    
    return passage

In [12]:
train_text = train_input.apply(transfrom_to_text, axis=1).tolist()
valid_text = valid_input.apply(transfrom_to_text, axis=1).tolist()
test_text = test_input.apply(transfrom_to_text, axis=1).tolist()
train_text[0]

'這篇文章標題是 111台聯A3台綜A10轉學考上榜心得 ，在星期六12點發佈。文章在發佈後的第一小時累積到的愛心數有 1，在第二小時累積到的愛心數有 5，第三小時累積到的愛心數有 7，在第四小時累積到的愛心數有 8，第五小時累積到的愛心數有 9，在第六小時累積到的愛心數有 10。文章在發佈後的第一小時累積到的留言數有 2，在第二小時累積到的留言數有 2，第三小時累積到的留言數有 6，在第四小時累積到的留言數有 7，第五小時累積到的留言數有 7，在第六小時累積到的留言數有 7。'

In [13]:
print("Train Data 總共有 {} 筆".format(len(train_text)))
print("Valid Data 總共有 {} 筆".format(len(valid_text)))
print("Test Data 總共有 {} 筆".format(len(test_text)))

Train Data 總共有 45000 筆
Valid Data 總共有 5000 筆
Test Data 總共有 10000 筆


### Tokenization

In [14]:
from transformers import BertTokenizer
import torch

2023-04-06 19:31:25.961254: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-06 19:31:26.069395: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-06 19:31:26.553874: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvrtc.so.11.0: cannot open shared object file: No such file or directory
2023-04-06 19:31:26.554033: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvrtc.so.11.0: cannot open shared object file: No such file or direc

In [15]:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")

In [16]:
train_encodings = tokenizer(train_text ,truncation=True, padding=True)
valid_encodings = tokenizer(valid_text ,truncation=True, padding=True)
test_encodings = tokenizer(test_text ,truncation=True, padding=True)

In [17]:
def add_label(encodings, label):
  encodings.update({'labels': label})

add_label(train_encodings, train_label)
add_label(valid_encodings, valid_label)
add_label(test_encodings, test_label)

In [18]:
train_encodings['labels'][0]

15.0

In [19]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

## Fine-tuning

In [20]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, default_data_collator
import torch

model = AutoModelForSequenceClassification.from_pretrained("bert-base-chinese", num_labels=1)

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [21]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = Dataset(train_encodings)
valid_dataset = Dataset(valid_encodings)
test_dataset = Dataset(test_encodings)

In [22]:
batch_size = 16
args = TrainingArguments(
    output_dir = "./results",
    save_strategy = "epoch",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    save_total_limit=1,
    load_best_model_at_end=False,
    metric_for_best_model="MAPE",
    weight_decay=0.01,
    eval_accumulation_steps = 1,
)

In [23]:
import numpy as np
import re
from sklearn.metrics import mean_absolute_percentage_error
def compute_metrics(p):
    predictions, labels = p
    
   
    # evaluation metrics
    MAPE = mean_absolute_percentage_error(labels, predictions)
    
    result = {'MAPE': MAPE}
    
    return result

In [24]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=default_data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [25]:
trainer.train()

***** Running training *****
  Num examples = 45000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 4221
  Number of trainable parameters = 102268417
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mhankystyle[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 32
  Num examples = 5000
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-1407
Configuration saved in ./results/checkpoint-1407/config.json
Model weights saved in ./results/checkpoint-1407/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1407/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1407/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-2814
Configuration saved in ./results/checkpoint-2814/config.json
Model weights saved in ./results/checkpoint-2814/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-2814/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2814/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-4221
Configuration sa

TrainOutput(global_step=4221, training_loss=32057.68372423596, metrics={'train_runtime': 2374.3914, 'train_samples_per_second': 56.857, 'train_steps_per_second': 1.778, 'total_flos': 2.98309758363e+16, 'train_loss': 32057.68372423596, 'epoch': 3.0})

## Evaluation

In [26]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 5000
  Batch size = 32


{'eval_loss': 38338.796875,
 'eval_MAPE': 0.6166888475418091,
 'eval_runtime': 19.6004,
 'eval_samples_per_second': 255.096,
 'eval_steps_per_second': 8.01,
 'epoch': 3.0}

Save Model

In [27]:
trainer.save_model('saved_model/bert-base-dcard')

Saving model checkpoint to saved_model/bert-base-dcard
Configuration saved in saved_model/bert-base-dcard/config.json
Model weights saved in saved_model/bert-base-dcard/pytorch_model.bin
tokenizer config file saved in saved_model/bert-base-dcard/tokenizer_config.json
Special tokens file saved in saved_model/bert-base-dcard/special_tokens_map.json


## Prediction

In [1]:
import pandas as pd
import numpy as np
private_test = pd.read_csv('../../raw_data/intern_homework_private_test_dataset.csv')

In [2]:
# 指定要刪除的 column names，並使用 drop 函數將這些 column 刪除
drop_columns = ['author_id', 'forum_id', 'forum_stats']
private_test_input = private_test.drop(drop_columns, axis=1)

In [3]:
# 將文章發佈時間 拆成 星期幾 與 小時 的函數
def split_date(df, date_column):

    # 將 created_by 欄位轉換成日期格式
    df[date_column] = pd.to_datetime(df[date_column], utc=True)
    
    # 新增 星期幾 和 小時 欄位
    df['weekday'] = df[date_column].dt.weekday
    df['hour'] = df[date_column].dt.hour

    # 移除 created_by 欄位
    df = df.drop(date_column, axis=1)

    # 回傳處理過的資料集
    return df

In [4]:
private_test_input = split_date(private_test_input, 'created_at')

weekday_dict = {
    0: '星期一',
    1: '星期二',
    2: '星期三',
    3: '星期四',
    4: '星期五',
    5: '星期六',
    6: '星期日'
}

# 將 weekday 轉換為中文星期幾
private_test_input['weekday'] = private_test_input['weekday'].map(weekday_dict)

In [5]:
# 將 dataframe 內的 feature 合併成一串文字 的函數
def transfrom_to_text(df):

    passage = ""

    # combined title and post time
    passage += "這篇文章標題是 {} ，在{}{}點發佈。".format(df["title"], df["weekday"], df["hour"])

    # combined likes_count
    passage += "文章在發佈後的第一小時累積到的愛心數有 {}，在第二小時累積到的愛心數有 {}，".format(df["like_count_1h"], df["like_count_2h"])
    passage += "第三小時累積到的愛心數有 {}，在第四小時累積到的愛心數有 {}，".format(df["like_count_3h"], df["like_count_4h"])
    passage += "第五小時累積到的愛心數有 {}，在第六小時累積到的愛心數有 {}。".format(df["like_count_5h"], df["like_count_6h"])

    # combined comment_count
    passage += "文章在發佈後的第一小時累積到的留言數有 {}，在第二小時累積到的留言數有 {}，".format(df["comment_count_1h"], df["comment_count_2h"])
    passage += "第三小時累積到的留言數有 {}，在第四小時累積到的留言數有 {}，".format(df["comment_count_3h"], df["comment_count_4h"])
    passage += "第五小時累積到的留言數有 {}，在第六小時累積到的留言數有 {}。".format(df["comment_count_5h"], df["comment_count_6h"])

    
    return passage

In [6]:
private_test_text = private_test_input.apply(transfrom_to_text, axis=1).tolist()
private_test_text[0]

'這篇文章標題是 #心情 頂樓風好大 ，在星期四0點發佈。文章在發佈後的第一小時累積到的愛心數有 6，在第二小時累積到的愛心數有 9，第三小時累積到的愛心數有 10，在第四小時累積到的愛心數有 12，第五小時累積到的愛心數有 12，在第六小時累積到的愛心數有 16。文章在發佈後的第一小時累積到的留言數有 4，在第二小時累積到的留言數有 8，第三小時累積到的留言數有 9，在第四小時累積到的留言數有 11，第五小時累積到的留言數有 12，在第六小時累積到的留言數有 12。'

In [7]:
print("預測資料總共有 {} 筆".format(len(private_test_text)))

預測資料總共有 10000 筆


In [8]:
from transformers import BertTokenizer, AutoModelForSequenceClassification, Trainer, default_data_collator
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
model = AutoModelForSequenceClassification.from_pretrained("saved_model/bert-base-dcard", num_labels=1)
private_test_encodings = tokenizer(private_test_text, truncation=True, padding=True)

2023-04-06 20:26:01.756913: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-06 20:26:01.862095: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-06 20:26:02.303974: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvrtc.so.11.0: cannot open shared object file: No such file or directory
2023-04-06 20:26:02.304095: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvrtc.so.11.0: cannot open shared object file: No such file or direc

In [9]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

private_test_dataset = Dataset(private_test_encodings)

In [10]:
import numpy as np
import re
from sklearn.metrics import mean_absolute_percentage_error
def compute_metrics(p):
    predictions, labels = p
    
    # evaluation metrics
    MAPE = mean_absolute_percentage_error(labels, predictions)
    
    result = {'MAPE': MAPE}
    
    return result

In [11]:
trainer = Trainer(
    model,
    data_collator=default_data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [13]:
predictions, labels, metrics = trainer.predict(private_test_dataset)

***** Running Prediction *****
  Num examples = 10000
  Batch size = 16


In [19]:
metrics

{'test_runtime': 42.2636,
 'test_samples_per_second': 236.61,
 'test_steps_per_second': 14.788}

In [16]:
predictions = list(map(int,predictions))

Save Prediction

In [18]:
df = pd.DataFrame(predictions, columns=["like_count_24h"])
df.to_csv("result(bert).csv", index=False)