# **Bidirectional Encoder Representations from Transformers**

This notebook implements Bidirectional Encoder Representations from Transformers(BERT), one of famous Transformer architectures, to create a text classification model as a simple baseline for T-Brain competition.

## BERT

In [1]:
%%capture
#!pip install transformers

In [2]:
from transformers import BertTokenizer, BertForSequenceClassification

PRETRAINED_MODEL_NAME = "bert-base-chinese"
NUM_LABELS = 2
tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

## T-Brain Simple Baseline - BERT

### Preparing

> First, please download the [dataset](https://gitlab.com/kvnkuol/bestline/-/archive/jc/bestline-jc.zip?path=JC/data) as a .zip file from GitLab prepared by J.C. as well as the official data from T-Brain. The following code will extract the .zip file and assign the path of data.

In [3]:
%%capture
#!unzip -o bestline-jc-JC-data.zip

RAW_DATA_PATH = 'JC_tbrain_train_final_0701.csv'    # T-Brain
DATA_PATH = 'bestline-jc-JC-data/JC/data'           # J.C.

### Preprocessing

> Loading data with `codecs` and `BeautifulSoup`.

In [4]:
import os
import pandas as pd
import codecs
from bs4 import BeautifulSoup

In [5]:
def load_data(crawled_data_path, original_data_path):
    
    raw_df = pd.read_csv(original_data_path) # Data provided by T-Brain

    news = []   # News crawled by J.C.
    labels = [] # Labels, AML related or not.

    for file in sorted(os.listdir(crawled_data_path)):
        # Get labels. Hint: Empty 'name' contains two characters '[]'.
        news_ID = int(file.split('_')[0])
        if len(raw_df.loc[news_ID-1, 'name']) > 2:
            labels.append(1)
        else:
            labels.append(0)

        # Get news content.
        f = codecs.open(DATA_PATH + '/' + file, 'r', 'utf-8')
        content = BeautifulSoup(f.read()).get_text()
        news.append(content)

    return news, labels

In [6]:
news, labels = load_data(crawled_data_path=DATA_PATH, original_data_path=RAW_DATA_PATH)

In [7]:
print(news[0])

量化交易追求絕對報酬有效對抗牛熊市近年來投資市場波動越來越明顯，追求低波動、絕對報酬的量化交易備受注目。專家表示，採用量化交易策略投資台股，不管是處於多頭或是空頭市場，績效及波動度均可領跑大盤，甚至比國內投資台股的股票型基金及ETF的波動率還低，表現也更為穩定。大數據時代來臨，風行歐美50年的量化交易儼然成為顯學，台灣亦開始重視此一趨勢發展，也因此，中華機率統計學會及台北科技大學管理學院攜手主辦，並由元大期貨、摩根亞太量化交易等公司擔任協辦單位，今(7/5)日舉辦「時間序列與量化交易研討會」，就目前熱門的量化交易、智能投資等相關議題進行研討。越來越多的基金公司重視量化交易，全球規模較大的避險基金多採行量化交易，包括橋水基金(BridgewaterAssociates)、AQR資產管理公司、曼氏集團(ManGroup)、文藝復興科技(RenaissanceTechnologies)等全球知名避險基金。摩根亞太集團董事長張堯勇指出，避險基金規模約為5兆美元，採取量化交易的基金規模約1兆美元，比重佔了20%，代表量化交易的操作績效好，才會有那麼高的比重。量化交易的操作績效不亞於價值投資及技術投資，被譽為數學天才、最賺錢的基金經理人，文藝復興對沖基金創始人詹姆斯·西蒙斯(JamesSimons)所管理的大獎章(Medallion)基金，便是典型的量化交易，績效表現優異，不僅勝過索羅斯的量子基金，也打敗了股神巴菲特的價值投資。近年來台灣也逐漸重視量化交易，摩根亞太量化交易公司今年開始將量化交易引進台股投資，是國內首家推出量化交易策略的公司，初期對象鎖定法人機構，進行私募投資。張堯勇表示，數學不只是一門學科，更是一項扭轉乾坤、轉敗為勝的競爭利器，量化交易就是將數學運用在股市投資，透過複雜、精密的推理計算，打敗股市。他指出，目前全球利率水平50位於年來新低，美國十年期公債殖利率只有2%，日本及一些歐洲國家甚至是負利率。歷史極低利率帶導致無風險及低風險工具的投資報酬率太低（例如：銀行存款、政府公債及投資等級公司債），無法對抗通貨膨脹及支付負債，所以資產配置必須增加較高風險的投資，如股票、高收益公司債或新興市場債券，但是這些投資工具波動很大，而且很容易產生虧損！他表示，長期來說，股市絕對是好的投資，報酬率也不差，但是如何選股是一門學問。目前全球60個主要股市交易所，總市值70兆美元，

#### Data Preprocessing

In [8]:
from torch import nn
import torch
from torch.utils import data
from gensim.models import Word2Vec

In [9]:
# Creating funcions, '__init__', '__getitem__' and '__len__', for dataloader.
class TBrainDataset(data.Dataset):
    """
    Expected data shape like:(data_num, data_len)
    Data can be a list of numpy array or a list of lists
    input data shape : (data_num, seq_len, feature_dim)
    
    __len__ will return the number of data
    """
    def __init__(self, X, y, tokenizer):
        self.data = X
        self.label = y
        self.tokenizer = tokenizer

    def __getitem__(self, idx):
        if self.label is None: 
            article = self.data[idx]
            label_tensor = None
        else:
            article = self.data[idx]
            label_tensor = torch.tensor(self.label[idx])

        # 建立句子的 BERT tokens
        word_pieces = ["[CLS]"]
        tokens_article = self.tokenizer.tokenize(article)
        word_pieces += tokens_article
        len_article = len(word_pieces)
        
        # 將整個 token 序列轉換成索引序列
        ids = self.tokenizer.convert_tokens_to_ids(word_pieces)
        tokens_tensor = torch.tensor(ids)

        return (tokens_tensor, label_tensor)

    def __len__(self):
        return len(self.data)


"""
# Data augmentation
def augment(x, y, n):
    # n = Number of copies.
    for i, label in enumerate(y):
        if label == 1:
            for j in range(n):
                x = torch.cat([x, x[i].unsqueeze(0)], dim=0)
                y = torch.cat([y, y[i].unsqueeze(0)], dim=0)
    return x, y
"""

'\n# Data augmentation\ndef augment(x, y, n):\n    # n = Number of copies.\n    for i, label in enumerate(y):\n        if label == 1:\n            for j in range(n):\n                x = torch.cat([x, x[i].unsqueeze(0)], dim=0)\n                y = torch.cat([y, y[i].unsqueeze(0)], dim=0)\n    return x, y\n'

In [10]:
from torch.nn.utils.rnn import pad_sequence

def create_mini_batch(samples):
    tokens_tensors = [s[0] for s in samples]
    
    # 測試集有 labels
    if samples[0][1] is not None:
        label_ids = torch.stack([s[1] for s in samples])
    else:
        label_ids = None
    
    # zero pad 到同一序列長度
    tokens_tensors = pad_sequence(tokens_tensors, 
                                  batch_first=True)
    
    # attention masks，將 tokens_tensors 裡頭不為 zero padding
    # 的位置設為 1 讓 BERT 只關注這些位置的 tokens
    masks_tensors = torch.zeros(tokens_tensors.shape, 
                                dtype=torch.long)
    masks_tensors = masks_tensors.masked_fill(
        tokens_tensors != 0, 1)
    
    return tokens_tensors, masks_tensors, label_ids

In [11]:
%%capture
# Ref: HW4 in the course ML2020 by Hung-yi Lee in NTU.

# Hyper-parameters for creating datatloader
batch_size = 1    #128
article_len = 500   # Length of article

# Preprocessing
train_x = news
y = labels

# Deviding data to training data and validation data
X_train, X_val, X_test, y_train, y_val, y_test = train_x[:3000], train_x[3000:4000], train_x[4000:], y[:3000], y[3000:4000], y[4000:]
#X_train, y_train = augment(X_train, y_train, 4)

# Create dataset for dataloader.
train_dataset = TBrainDataset(X=X_train, y=y_train, tokenizer=tokenizer)
val_dataset = TBrainDataset(X=X_val, y=y_val, tokenizer=tokenizer)
test_dataset = TBrainDataset(X=X_test, y=y_test, tokenizer=tokenizer)

# Transforming dataset to batch of tensors
train_loader = torch.utils.data.DataLoader(dataset = train_dataset, batch_size = batch_size, collate_fn=create_mini_batch, shuffle = True, num_workers = 8)
val_loader = torch.utils.data.DataLoader(dataset = val_dataset, batch_size = batch_size, collate_fn=create_mini_batch, shuffle = False, num_workers = 8)

In [12]:
# Get information from dataset
train_aml=0
for i in train_dataset.label:
    if i == 1:
        train_aml+=1

print("Number of training data: {}".format(len(train_dataset.label)))
print("Number of AML related news: {}".format(train_aml))
print("Percent of AML related news: {:.2f}%".format((train_aml)/len(train_dataset.label)*100))

val_aml=0
for i in val_dataset.label:
    if i == 1:
        val_aml+=1

print("\nNumber of validation data: {}".format(len(val_dataset.label)))
print("Number of AML related news: {}".format(val_aml))
print("Percent of AML related news: {:.2f}%".format((val_aml)/len(val_dataset.label)*100))

test_aml=0
for i in test_dataset.label:
    if i == 1:
        test_aml+=1

print("\nNumber of Testing data: {}".format(len(test_dataset.label)))
print("Number of AML related news: {}".format(test_aml))
print("Percent of AML related news: {:.2f}%".format((test_aml)/len(test_dataset.label)*100))

Number of training data: 3000
Number of AML related news: 211
Percent of AML related news: 7.03%

Number of validation data: 1000
Number of AML related news: 65
Percent of AML related news: 6.50%

Number of Testing data: 647
Number of AML related news: 52
Percent of AML related news: 8.04%


### Training

In [13]:
# Ref: HW4 in the course ML2020 by Hung-yi Lee in NTU.
import os
import torch
import argparse
import numpy as np
from torch import nn
import torch.optim as optim
import torch.nn.functional as F
from gensim.models import word2vec
from sklearn.model_selection import train_test_split

In [14]:
# Ref: HW4 in the course ML2020 by Hung-yi Lee in NTU.

# train.py
# 這個 block 是用來訓練模型的

def evaluation(outputs, labels):
    # outputs => probability (float)
    # labels => labels
    outputs[outputs>=0.5] = 1 # 大於等於 0.5 為正面
    outputs[outputs<0.5] = 0 # 小於 0.5 為負面

    # Confusion matrix
    tn, fp = 0, 0   # True Negtive, False Positive
    fn, tp = 0, 0   # False Negtive, True Positive
    for i in range(len(outputs)):
        if outputs[i]==1:
            if outputs[i].item() == labels[i].item():
                tp += 1
            else:
                fp += 1
        else:
            if outputs[i].item() == labels[i].item():
                tn += 1
            else:
                fn += 1

    # Precision
    if tp+fp == 0:
        prec = 0
    else:
        prec = tp/(tp+fp)

    # Recall
    if tp+fn == 0:
        rec = 0
    else:
        rec = tp/(tp+fn)
    
    # F1 Score
    if prec+rec == 0:
        f1 = 0
    else:
        f1 = 2 * (prec*rec/(prec+rec))

    # Number of correct predictions
    correct = torch.sum(torch.eq(outputs, labels)).item()

    return correct, f1

def training(batch_size, n_epoch, lr, model_dir, train, valid, model, device):

    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print('\nstart training, parameter total:{}, trainable:{}\n'.format(total, trainable))

    model.train() # 將 model 的模式設為 train，這樣 optimizer 就可以更新 model 的參數
    
    criterion = nn.BCELoss() # 定義損失函數，這裡我們使用 binary cross entropy loss
    t_batch = len(train) 
    v_batch = len(valid) 
    optimizer = optim.Adam(model.parameters(), lr=lr) # 將模型的參數給 optimizer，並給予適當的 learning rate

    total_loss, total_acc, best_f1 = 0, 0, 0
    train_loss, train_f1, val_loss, val_f1 = [], [], [], []

    for epoch in range(n_epoch):
        total_loss, total_acc, total_f1 = 0, 0, 0

        # Training
        for i, (tokens_tensors, masks_tensors, label_ids) in enumerate(train):
            tokens_tensors = tokens_tensors.to(device, dtype=torch.long)
            masks_tensors = masks_tensors.to(device, dtype=torch.long)
            label_ids = label_ids.to(device, dtype=torch.float)

            #inputs = inputs.to(device, dtype=torch.long) # device 為 "cuda"，將 inputs 轉成 torch.cuda.LongTensor
            #labels = labels.to(device, dtype=torch.float) # device為 "cuda"，將 labels 轉成 torch.cuda.FloatTensor，因為等等要餵進 criterion，所以型態要是 float
            optimizer.zero_grad() # 由於 loss.backward() 的 gradient 會累加，所以每次餵完一個 batch 後需要歸零
            out = model(input_ids=tokens_tensors, labels=label_ids) # 將 input 餵給模型
            #outputs = outputs.squeeze() # 去掉最外面的 dimension，好讓 outputs 可以餵進 criterion()
            loss, outputs = out[:2]

            #loss = criterion(outputs, labels) # 計算此時模型的 training loss
            loss.backward() # 算 loss 的 gradient
            optimizer.step() # 更新訓練模型的參數

            correct, f1 = evaluation(outputs, labels) # 計算此時模型的 training accuracy

            
            total_acc += (correct / batch_size)
            total_loss += loss.item()
            total_f1 += f1

            #print('[ Epoch{}: {}/{} ] loss:{:.3f} acc:{:.3f} '.format(epoch+1, i+1, t_batch, loss.item(), correct*100/batch_size), end='\r')
        
        train_loss.append(total_loss/t_batch)
        train_f1.append(total_f1/t_batch)
        print('\nEpoch: {}'.format(epoch+1))
        print('Train | Loss:{:.5f} Acc: {:.3f} F1 Score: {:.3f}'.format(total_loss/t_batch, total_acc/t_batch*100, total_f1/t_batch))

        # Validation
        model.eval() # 將 model 的模式設為 eval，這樣 model 的參數就會固定住
        with torch.no_grad():
            total_loss, total_acc, total_f1 = 0, 0, 0
            for i, (tokens_tensors, masks_tensors, label_ids) in enumerate(valid):
                #inputs = inputs.to(device, dtype=torch.long) # device 為 "cuda"，將 inputs 轉成 torch.cuda.LongTensor
                #labels = labels.to(device, dtype=torch.float) # device 為 "cuda"，將 labels 轉成 torch.cuda.FloatTensor，因為等等要餵進 criterion，所以型態要是 float
                tokens_tensors = tokens_tensors.to(device, dtype=torch.long)
                masks_tensors = masks_tensors.to(device, dtype=torch.long)
                label_ids = label_ids.to(device, dtype=torch.float)
                
                outputs = model(input_ids=tokens_tensors, attention_mask=masks_tensors, labels=label_ids) # 將 input 餵給模型
                #outputs = outputs.squeeze() # 去掉最外面的 dimension，好讓 outputs 可以餵進 criterion()
                loss = criterion(outputs, labels) # 計算此時模型的 validation loss
                correct, f1 = evaluation(outputs, labels) # 計算此時模型的 validation accuracy

                total_acc += (correct / batch_size)
                total_loss += loss.item()
                total_f1 += f1

            val_loss.append(total_loss/v_batch)
            val_f1.append(total_f1/v_batch)
            print("Valid | Loss:{:.5f} Acc: {:.3f} F1 Score: {:.3f}".format(total_loss/v_batch, total_acc/v_batch*100, total_f1/v_batch))

            # 如果 validation 的結果優於之前所有的結果，就把當下的模型存下來以備之後做預測時使用
            if total_f1 > best_f1:
                best_f1 = total_f1
                #torch.save(model, "{}/val_acc_{:.3f}.model".format(model_dir,total_acc/v_batch*100))
                torch.save(model, "{}/ckpt.model".format(model_dir))
                print('saving model with f1 {:.3f}'.format(total_f1/v_batch))

        print('-----------------------------------------------')
        model.train() # 將 model 的模式設為 train，這樣 optimizer 就可以更新 model 的參數（因為剛剛轉成 eval 模式）

    return train_loss, train_f1, val_loss, val_f1

In [15]:
# Ref: HW4 in the course ML2020 by Hung-yi Lee in NTU.

# Checking if gpu is availible.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Defining hyper-parameters
epoch = 50
lr = 0.001

fix_embedding = True # Fixing embedding during training
model_dir = os.getcwd() # Model directory for checkpoint model

# Creating model
NUM_LABELS
model = BertForSequenceClassification.from_pretrained(PRETRAINED_MODEL_NAME) #, num_labels=3)
#model = LSTM_Net(embedding, embedding_dim=250, hidden_dim=150, num_layers=1, dropout=0.5, fix_embedding=fix_embedding)
model = model.to(device)

# Starting training
#train_loss, train_f1, val_loss, val_f1 = training(batch_size, epoch, lr, model_dir, train_loader, val_loader, model, device)

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
inputs = tokenizer(news[0], return_tensors="pt")
labels = torch.tensor[1]

In [16]:
train_loss, train_f1, val_loss, val_f1 = training(batch_size, epoch, lr, model_dir, train_loader, val_loader, model, device)


start training, parameter total:102269186, trainable:102269186



RuntimeError: ignored

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.title("Training")
plt.plot(train_loss, label='train_loss')
plt.plot(train_f1, label='train_f1')
plt.legend(loc='best')
plt.show()

In [None]:
plt.title("Validation")
plt.plot(val_loss, label='val_loss')
plt.plot(val_f1, label='val_f1')
plt.legend(loc='best')
plt.show()

### Testing

In [None]:
test_loader = torch.utils.data.DataLoader(dataset = test_dataset, batch_size = len(test_dataset), shuffle = False, num_workers = 8)
test_size = len(test_dataset)
test_model = torch.load("{}/ckpt.model".format(model_dir))

In [None]:
print(test_size)

In [None]:
# Testing
test_model.eval() # 將 model 的模式設為 eval，這樣 model 的參數就會固定住
criterion = nn.BCELoss() # 定義損失函數，這裡我們使用 binary cross entropy loss
with torch.no_grad():
    total_loss, total_acc, total_f1 = 0, 0, 0
    for i, (inputs, labels) in enumerate(test_loader):
        inputs = inputs.to(device, dtype=torch.long) # device 為 "cuda"，將 inputs 轉成 torch.cuda.LongTensor
        labels = labels.to(device, dtype=torch.float) # device 為 "cuda"，將 labels 轉成 torch.cuda.FloatTensor，因為等等要餵進 criterion，所以型態要是 float
        outputs = test_model(inputs) # 將 input 餵給模型

        #outputs = outputs.squeeze() # 去掉最外面的 dimension，好讓 outputs 可以餵進 criterion()
        loss = criterion(outputs, labels) # 計算此時模型的 testing loss
        correct, f1 = evaluation(outputs, labels) # 計算此時模型的 testing accuracy

        if correct==1:
            print(loss, labels)

        total_acc += (correct / test_size)
        total_loss += loss.item()
        total_f1 += f1

    val_loss.append(total_loss/test_size)
    val_f1.append(total_f1/test_size)
    print("Valid | Loss:{:.5f} Acc: {:.3f}% F1 Score: {:.3f}".format(total_loss/test_size, total_acc/test_size*100, total_f1/test_size))

## Acknowledge

## References

### Courses
* [Machine Learning (2020,Spring)](http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html) by Hung-yi Lee in NTU.

### Blogs
* [進入 NLP 世界的最佳橋樑：寫給所有人的自然語言處理與深度學習入門指南](https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html) by LeeMeng