## Assignment 1. Neural Text Classification
## CS310 Natural Language Processing

**Total points**: 50

You should roughtly follow the structure of the notebook. Add additional cells if you feel needed. 

You can (and you should) re-use the code from Lab 2. 

Make sure your code is readable and well-structured.

### 0. Import Necessary Libraries

In [1]:
import torch
import json

train_data_path = 'train.jsonl'
test_data_path = 'test.jsonl'

### 1. Data Processing

In [2]:
# read data
train_data = []
test_data = []

with open(train_data_path, 'r') as f:
    for line in f:
        record = json.loads(line)
        train_data.append((record['sentence'], record['label'][0]))

with open(test_data_path, 'r') as f:
    for line in f:
        record = json.loads(line)
        test_data.append((record['sentence'], record['label'][0]))
        
len(train_data), len(test_data)

(12677, 651)

In [3]:
print(train_data[0:3])

[('卖油条小刘说：我说', 0), ('保姆小张说：干啥子嘛？', 0), ('卖油条小刘说：你看你往星空看月朦胧，鸟朦胧', 1)]


### create dataset

In [4]:
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # 根据索引获取一个样本，并返回 (text, label) 的元组
        return self.data[idx]

In [5]:
train_dataset = MyDataset(train_data)
test_dataset = MyDataset(test_data)

In [6]:
train_dataset[:3]

[('卖油条小刘说：我说', 0), ('保姆小张说：干啥子嘛？', 0), ('卖油条小刘说：你看你往星空看月朦胧，鸟朦胧', 1)]

#### Single Character as a Token, Discard Others 

In [7]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

def chinese_tokenizer(text): # use unicode range \u4e00 - \u9fa5
    return [char for char in text if '\u4e00' <= char <= '\u9fa5'] 

def yield_tokens(data_iter):
    for text, _ in data_iter:
        tokens = chinese_tokenizer(text)  # 对句子进行分词
        yield tokens

In [8]:
count = 0
print(train_dataset[0])
print(train_dataset[1])

for tokens in yield_tokens(train_dataset): # Use a new iterator
    print(tokens)
    count += 1
    if count > 7:
        break

('卖油条小刘说：我说', 0)
('保姆小张说：干啥子嘛？', 0)
['卖', '油', '条', '小', '刘', '说', '我', '说']
['保', '姆', '小', '张', '说', '干', '啥', '子', '嘛']
['卖', '油', '条', '小', '刘', '说', '你', '看', '你', '往', '星', '空', '看', '月', '朦', '胧', '鸟', '朦', '胧']
['卖', '油', '条', '小', '刘', '说', '咱', '是', '不', '是', '歇', '一', '下', '这', '双', '疲', '惫', '的', '双', '腿']
['卖', '油', '条', '小', '刘', '说', '快', '把', '我', '累', '死', '了']
['卖', '油', '条', '小', '刘', '说', '我', '说', '亲', '爱', '的', '大', '姐', '你', '贵', '姓', '啊']
['保', '姆', '小', '张', '说', '我', '免', '贵', '姓', '张', '我', '叫', '张', '凤', '姑']
['卖', '油', '条', '小', '刘', '说', '凤', '姑']


#### Tokenizer with Chinese/English/Number/Punctuation

In [9]:
import re

def improved_chinese_tokenizer(text):
    tokens = []

    digit = re.compile(r'\d+')
    english = re.compile(r'[a-zA-Z]+')
    punctuation = re.compile(r'[。|？|，|！|：|；|“|”]')

    # Chinese characters
    chinese_tokens = [char for char in text if '\u4e00' <= char <= '\u9fa5']
    tokens.extend(chinese_tokens)

    for match in digit.finditer(text):
        tokens.append(match.group())

    for match in english.finditer(text):
        tokens.append(match.group())

    for match in punctuation.finditer(text):
        tokens.append(match.group())

    return tokens

# Example usage:
text = "卖油条小刘说：我说123块钱 你好！this is an example of text.哇。"
tokens = improved_chinese_tokenizer(text)
print(tokens)

['卖', '油', '条', '小', '刘', '说', '我', '说', '块', '钱', '你', '好', '哇', '123', 'this', 'is', 'an', 'example', 'of', 'text', '：', '！', '。']


### With consecutive digits, English words, and punctuations.

In [10]:
def improved_yield_tokens(data_iter):
    for text, _ in data_iter:
        tokens = improved_chinese_tokenizer(text)
        yield tokens

In [11]:
count = 0
print(train_data[0])
print(train_data[1])

for tokens in improved_yield_tokens(iter(train_data)): # Use a new iterator
    print(tokens)
    count += 1
    if count > 7:
        break

('卖油条小刘说：我说', 0)
('保姆小张说：干啥子嘛？', 0)
['卖', '油', '条', '小', '刘', '说', '我', '说', '：']
['保', '姆', '小', '张', '说', '干', '啥', '子', '嘛', '：', '？']
['卖', '油', '条', '小', '刘', '说', '你', '看', '你', '往', '星', '空', '看', '月', '朦', '胧', '鸟', '朦', '胧', '：', '，']
['卖', '油', '条', '小', '刘', '说', '咱', '是', '不', '是', '歇', '一', '下', '这', '双', '疲', '惫', '的', '双', '腿', '：', '，', '？']
['卖', '油', '条', '小', '刘', '说', '快', '把', '我', '累', '死', '了', '：']
['卖', '油', '条', '小', '刘', '说', '我', '说', '亲', '爱', '的', '大', '姐', '你', '贵', '姓', '啊', '：', '？']
['保', '姆', '小', '张', '说', '我', '免', '贵', '姓', '张', '我', '叫', '张', '凤', '姑', '：']
['卖', '油', '条', '小', '刘', '说', '凤', '姑', '：']


#### Build Vocabulary

In [12]:
vocab1 = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>"])
vocab1.set_default_index(vocab1["<unk>"])

vocab = build_vocab_from_iterator(improved_yield_tokens(train_dataset), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

print("Vocabulary size:", len(vocab1))
print(vocab1(['你', '好', '世','界', '!']))

print("Vocabulary size:", len(vocab))
print(vocab(['你', '好', '世','界', '！','。','@']))
print(vocab(['卖', '油', '条','鸡', '！','。','@','123']))

Vocabulary size: 2687
[3, 31, 402, 496, 0]
Vocabulary size: 2806
[4, 34, 407, 501, 69, 301, 0]
[473, 460, 282, 894, 69, 301, 0, 0]


In [13]:
text_pipeline = lambda x: vocab(improved_chinese_tokenizer(x))
label_pipeline = lambda x: int(x)

In [14]:
# Test text_pipeline()
tokens = text_pipeline('你好世界！哈哈。')
print(tokens)

# Test label_pipeline()
lbl = label_pipeline('1')
print(lbl)

[4, 34, 407, 501, 360, 360, 69, 301]
1


### Data Batch

Define the `Collate_batch` function, which will be used to process the "raw" data batch.

In [15]:
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
# The operator 'aten::_embedding_bag' is not currently implemented for the MPS device.

def collate_batch(batch):
    label_list, token_ids_list, offsets = [], [], [0]
    for _text, _label in batch:
        label_list.append(label_pipeline(_label))
        token_ids = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        token_ids_list.append(token_ids)
        offsets.append(token_ids.size(0))  # 将每个样本的token数量添加到offsets列表中

    labels = torch.tensor(label_list, dtype=torch.int64)
    # 计算偏移量的累积和，从而得到每个批次数据在合并后的张量中的起始位置
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    token_ids = torch.cat(token_ids_list)  # 得到一个包含所有样本的token IDs的tensor

    return labels.to(device), token_ids.to(device), offsets.to(device)

In [16]:
train_iter = train_dataset
# Use collate_batch to generate the dataloader
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

In [17]:
# Test the dataloader
for i, (labels, token_ids, offsets) in enumerate(dataloader):
    print(f"batch {i} label: {labels}")
    print(f"batch {i} text: {token_ids}")
    print(f"batch {i} offsets: {offsets}")
    if i == 0:
        break

# What does offsets mean?
print('Number of tokens: ', token_ids.size(0))
print('Number of examples in one batch: ', labels.size(0))
print('Example 1: ', token_ids[offsets[0]:offsets[1]])
print('Example 7: ', token_ids[offsets[6]:offsets[7]])

batch 0 label: tensor([0, 0, 1, 0, 0, 1, 0, 0])
batch 0 text: tensor([ 473,  460,  282,   23,  423,    1,    3,    1,    2,   73,   83,   23,
         113,    1,   98,  483,   46,   59,    2,   33,  473,  460,  282,   23,
         423,    1,    4,   55,    4,  305,  760,  869,   55,  494, 2131, 2210,
        1214, 2131, 2210,    2,    6,  473,  460,  282,   23,  423,    1,   71,
           7,    5,    7,  906,   18,   75,   10,  875, 2181, 2484,    8,  875,
        1130,    2,    6,   33,  473,  460,  282,   23,  423,    1,  187,   86,
           3,  610,  183,    9,    2,  473,  460,  282,   23,  423,    1,    3,
           1,  308,  164,    8,   32,  141,    4,  687,  453,   21,    2,   33,
          73,   83,   23,  113,    1,    3,  819,  687,  453,  113,    3,  149,
         113, 1183,  221,    2,  473,  460,  282,   23,  423,    1, 1183,  221,
           2])
batch 0 offsets: tensor([  0,   9,  20,  41,  64,  77,  96, 112])
Number of tokens:  121
Number of examples in one batch:  

### 2. Build the Model

In [18]:
import torch.nn.init as init

In [19]:
from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class, hidden_dim1, hidden_dim2):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.hidden_layers = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim1),
            nn.ReLU(),
            nn.Linear(hidden_dim1, hidden_dim2),
            nn.ReLU()
        )
        self.fc = nn.Linear(hidden_dim2, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        for layer in self.hidden_layers:
            if isinstance(layer, nn.Linear):
                layer.weight.data.uniform_(-initrange, initrange)
                layer.bias.data.zero_()
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        x = self.hidden_layers(embedded)
        return self.fc(x)


In [20]:
# Build the model
train_iter = iter(train_dataset)
num_class = len(set([label for (_, label) in train_iter])) # binary classification here
vocab_size = len(vocab)
num_class, vocab_size

(2, 2806)

In [21]:
# embedding size
emsize =  64

hidden_dim1 = 16
hidden_dim2 = 8
# dropout_prob = 0


model = TextClassificationModel(vocab_size, emsize, num_class, hidden_dim1, hidden_dim2).to(device)

In [22]:
# Test the model
model.eval()
with torch.no_grad():
    for i, (labels, token_ids, offsets) in enumerate(dataloader):
        output = model(token_ids, offsets)
        # print(f"batch {i} output: {output}")
        if i == 0:
            break

# Examine the output
print('output size:', output.size())
print('output:', output)
print(model)

output size: torch.Size([8, 2])
output: tensor([[ 0.1136, -0.1126],
        [ 0.1427, -0.2067],
        [ 0.0157, -0.0216],
        [ 0.1239, -0.1515],
        [-0.1071, -0.2682],
        [ 0.0871, -0.1255],
        [ 0.0756, -0.1000],
        [ 0.0686, -0.0870]])
TextClassificationModel(
  (embedding): EmbeddingBag(2806, 64, mode='mean')
  (hidden_layers): Sequential(
    (0): Linear(in_features=64, out_features=16, bias=True)
    (1): ReLU()
    (2): Linear(in_features=16, out_features=8, bias=True)
    (3): ReLU()
  )
  (fc): Linear(in_features=8, out_features=2, bias=True)
)


### 3. Train and Evaluate

#### Define train and evaluate

In [23]:
import time
from sklearn.metrics import accuracy_score,precision_recall_fscore_support

def train(model, dataloader, optimizer, criterion, epoch: int): # criterion: loss function
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 200 # print log every 200 batches
    start_time = time.time()

    for idx, (labels, token_ids, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        output = model(token_ids, offsets)
        try:
            loss = criterion(output, labels)
        except Exception:
            print('Error in loss calculation')
            print('output: ', output.size())
            print('labels: ', labels.size())
            # print('token_ids: ', token_ids)
            # print('offsets: ', offsets)
            raise
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1) # 裁剪，防止梯度爆炸。确保梯度的范数不超过给定的阈值（在这里是0.1
        # 如果梯度的范数超过了阈值，那么梯度将按比例缩放，以使其范数不超过指定的阈值。防止梯度过大导致的参数更新过大而影响训练效果。
        optimizer.step() # update the parameters using the gradients

        total_acc += (output.argmax(1) == labels).sum().item()
        total_count += labels.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0
            start_time = time.time()


def evaluate(model, dataloader, criterion):
    model.eval()
    total_correct, total_count = 0, 0
    y_labels = []
    y_preds = []

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            output = model(text, offsets)
            predictions = output.argmax(1)

            total_correct += (predictions == label).sum().item()
            total_count += label.size(0)

            y_labels.extend(label.tolist())
            y_preds.extend(predictions.tolist())

    accuracy = total_correct / total_count
    precision, recall, f1Score, _ = precision_recall_fscore_support(y_labels, y_preds, average='weighted')

    return accuracy, precision, recall, f1Score




### Hyper-parameters, loss, optimizer, and learning-rate scheduler

In [24]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate
BATCH_SIZE = 8  # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1) # decay lr

Test `criterion`, i.e., the loss function

In [25]:
# First, obtain some output and labels
model.eval()
with torch.no_grad():
    for i, (labels, token_ids, offsets) in enumerate(dataloader):
        output = model(token_ids, offsets)
        # print(f"batch {i} output: {output}")
        if i == 0:
            break

loss = criterion(output, labels)
print('loss:', loss)

# keep multiple losses for all samples in a batch, not just the mean
criterion2 = torch.nn.CrossEntropyLoss(reduction='none') 

loss2 = criterion2(output, labels)
print('loss non-reduced:', loss2)
print('mean of loss non-reduced:', torch.mean(loss2))

# Manually calculate the loss
probs = torch.exp(output[0,:]) / torch.exp(output[0,:]).sum()
loss3 = -torch.log(probs[labels[0]])
print('loss manually computed:', loss3)

loss: tensor(0.6307)
loss non-reduced: tensor([0.5865, 0.5337, 0.7120, 0.5649, 0.6159, 0.8050, 0.6092, 0.6184])
mean of loss non-reduced: tensor(0.6307)
loss manually computed: tensor(0.5865)


#### Prepare train, valid, and test data

In [26]:
# Prepare train, valid, and test data
train_iter = MyDataset(train_data)
test_iter = MyDataset(test_data)

train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)


num_train = int(len(train_dataset) * 0.95)

split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)

### Main Training Loop

In [27]:
# Run the training loop
total_accu = None
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()

    train(model, train_dataloader, optimizer, criterion, epoch)
    # accu_val = evaluate(model, valid_dataloader, criterion)
    accu_val, precision, recall, f1 = evaluate(model, valid_dataloader, criterion)

    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
        # print('')
    else:
        total_accu = accu_val

    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)

| epoch   1 |   200/ 1506 batches | accuracy    0.711
| epoch   1 |   400/ 1506 batches | accuracy    0.676
| epoch   1 |   600/ 1506 batches | accuracy    0.708
| epoch   1 |   800/ 1506 batches | accuracy    0.699
| epoch   1 |  1000/ 1506 batches | accuracy    0.702
| epoch   1 |  1200/ 1506 batches | accuracy    0.716
| epoch   1 |  1400/ 1506 batches | accuracy    0.685
-----------------------------------------------------------
| end of epoch   1 | time:  1.25s | valid accuracy    0.472 
-----------------------------------------------------------
| epoch   2 |   200/ 1506 batches | accuracy    0.695
| epoch   2 |   400/ 1506 batches | accuracy    0.714
| epoch   2 |   600/ 1506 batches | accuracy    0.688
| epoch   2 |   800/ 1506 batches | accuracy    0.715
| epoch   2 |  1000/ 1506 batches | accuracy    0.669
| epoch   2 |  1200/ 1506 batches | accuracy    0.708
| epoch   2 |  1400/ 1506 batches | accuracy    0.693
-----------------------------------------------------------
| e

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


| epoch   3 |   200/ 1506 batches | accuracy    0.692
| epoch   3 |   400/ 1506 batches | accuracy    0.684
| epoch   3 |   600/ 1506 batches | accuracy    0.680
| epoch   3 |   800/ 1506 batches | accuracy    0.691
| epoch   3 |  1000/ 1506 batches | accuracy    0.701
| epoch   3 |  1200/ 1506 batches | accuracy    0.677
| epoch   3 |  1400/ 1506 batches | accuracy    0.696
-----------------------------------------------------------
| end of epoch   3 | time:  1.28s | valid accuracy    0.692 
-----------------------------------------------------------


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


| epoch   4 |   200/ 1506 batches | accuracy    0.690
| epoch   4 |   400/ 1506 batches | accuracy    0.691
| epoch   4 |   600/ 1506 batches | accuracy    0.661
| epoch   4 |   800/ 1506 batches | accuracy    0.693
| epoch   4 |  1000/ 1506 batches | accuracy    0.716
| epoch   4 |  1200/ 1506 batches | accuracy    0.701
| epoch   4 |  1400/ 1506 batches | accuracy    0.693
-----------------------------------------------------------
| end of epoch   4 | time:  1.23s | valid accuracy    0.692 
-----------------------------------------------------------


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


| epoch   5 |   200/ 1506 batches | accuracy    0.692
| epoch   5 |   400/ 1506 batches | accuracy    0.664
| epoch   5 |   600/ 1506 batches | accuracy    0.696
| epoch   5 |   800/ 1506 batches | accuracy    0.705
| epoch   5 |  1000/ 1506 batches | accuracy    0.700
| epoch   5 |  1200/ 1506 batches | accuracy    0.696
| epoch   5 |  1400/ 1506 batches | accuracy    0.690
-----------------------------------------------------------
| end of epoch   5 | time:  1.21s | valid accuracy    0.692 
-----------------------------------------------------------


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


| epoch   6 |   200/ 1506 batches | accuracy    0.698
| epoch   6 |   400/ 1506 batches | accuracy    0.698
| epoch   6 |   600/ 1506 batches | accuracy    0.693
| epoch   6 |   800/ 1506 batches | accuracy    0.694
| epoch   6 |  1000/ 1506 batches | accuracy    0.711
| epoch   6 |  1200/ 1506 batches | accuracy    0.681
| epoch   6 |  1400/ 1506 batches | accuracy    0.686
-----------------------------------------------------------
| end of epoch   6 | time:  1.21s | valid accuracy    0.692 
-----------------------------------------------------------


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


| epoch   7 |   200/ 1506 batches | accuracy    0.689
| epoch   7 |   400/ 1506 batches | accuracy    0.677
| epoch   7 |   600/ 1506 batches | accuracy    0.684
| epoch   7 |   800/ 1506 batches | accuracy    0.688
| epoch   7 |  1000/ 1506 batches | accuracy    0.693
| epoch   7 |  1200/ 1506 batches | accuracy    0.685
| epoch   7 |  1400/ 1506 batches | accuracy    0.693


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


-----------------------------------------------------------
| end of epoch   7 | time:  1.34s | valid accuracy    0.692 
-----------------------------------------------------------
| epoch   8 |   200/ 1506 batches | accuracy    0.705
| epoch   8 |   400/ 1506 batches | accuracy    0.711
| epoch   8 |   600/ 1506 batches | accuracy    0.685
| epoch   8 |   800/ 1506 batches | accuracy    0.686
| epoch   8 |  1000/ 1506 batches | accuracy    0.671
| epoch   8 |  1200/ 1506 batches | accuracy    0.682
| epoch   8 |  1400/ 1506 batches | accuracy    0.693
-----------------------------------------------------------
| end of epoch   8 | time:  1.30s | valid accuracy    0.692 
-----------------------------------------------------------


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


| epoch   9 |   200/ 1506 batches | accuracy    0.694
| epoch   9 |   400/ 1506 batches | accuracy    0.701
| epoch   9 |   600/ 1506 batches | accuracy    0.692
| epoch   9 |   800/ 1506 batches | accuracy    0.684
| epoch   9 |  1000/ 1506 batches | accuracy    0.692
| epoch   9 |  1200/ 1506 batches | accuracy    0.710
| epoch   9 |  1400/ 1506 batches | accuracy    0.689
-----------------------------------------------------------
| end of epoch   9 | time:  1.22s | valid accuracy    0.644 
-----------------------------------------------------------
| epoch  10 |   200/ 1506 batches | accuracy    0.702
| epoch  10 |   400/ 1506 batches | accuracy    0.698
| epoch  10 |   600/ 1506 batches | accuracy    0.714
| epoch  10 |   800/ 1506 batches | accuracy    0.732
| epoch  10 |  1000/ 1506 batches | accuracy    0.723
| epoch  10 |  1200/ 1506 batches | accuracy    0.703
| epoch  10 |  1400/ 1506 batches | accuracy    0.701
-----------------------------------------------------------
| e

In [28]:
# Save the model
torch.save(model.state_dict(), "A1_text_classification_model.pth")

### Evaluate with Test Data

In [29]:
# accu_test = evaluate(model, test_dataloader, criterion)
accu_val, precision, recall, f1_score = evaluate(model, test_dataloader, criterion)
print("test accuracy {:8.3f}, precision {:8.3f}, recall {:8.3f}, f1_score {:8.3f}".format(accu_val, precision, recall, f1_score))


test accuracy    0.696, precision    0.673, recall    0.696, f1_score    0.682


### 4. Explore Word Segmentation

In [30]:
import jieba

In [31]:
len(train_data), len(test_data)

(12677, 651)

In [32]:
train_dataset = MyDataset(train_data)
test_dataset = MyDataset(test_data)
train_dataset[:3]

[('卖油条小刘说：我说', 0), ('保姆小张说：干啥子嘛？', 0), ('卖油条小刘说：你看你往星空看月朦胧，鸟朦胧', 1)]

In [33]:
seg_list = jieba.lcut("我是南方科技大学的计算机系学生。")  # 默认是精确模式
print(seg_list)

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/zs/c32hzq1j6t54vw2q044ggrsr0000gn/T/jieba.cache
Loading model cost 0.387 seconds.
Prefix dict has been built successfully.


['我', '是', '南方', '科技', '大学', '的', '计算机系', '学生', '。']


In [34]:
def jieba_tokenizer(text):
    return jieba.lcut(text)


def yield_tokens(data_iter):
    for text, _ in data_iter:
        tokens = jieba_tokenizer(text)  # 对句子进行分词
        yield tokens

In [35]:
count = 0
print(train_data[0])
print(train_data[1])

for tokens in yield_tokens(iter(train_data)): # Use a new iterator
    print(tokens)
    count += 1
    if count > 7:
        break

('卖油条小刘说：我说', 0)
('保姆小张说：干啥子嘛？', 0)
['卖', '油条', '小', '刘说', '：', '我', '说']
['保姆', '小张', '说', '：', '干', '啥子', '嘛', '？']
['卖', '油条', '小', '刘说', '：', '你', '看', '你', '往', '星空', '看', '月', '朦胧', '，', '鸟', '朦胧']
['卖', '油条', '小', '刘说', '：', '咱', '是不是', '歇', '一下', '这', '双', '，', '疲惫', '的', '双腿', '？']
['卖', '油条', '小', '刘说', '：', '快', '把', '我', '累死', '了']
['卖', '油条', '小', '刘说', '：', '我', '说', '亲爱', '的', '大姐', '你', '贵姓', '啊', '？']
['保姆', '小张', '说', '：', '我免', '贵姓', '张', '我', '叫', '张凤姑']
['卖', '油条', '小', '刘说', '：', '凤姑']


#### Build Vocabulary

In [36]:
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

print("Vocabulary size:", len(vocab))
print(vocab(['你', '好', '世界','界', '！','。','@']))
print(vocab(['卖', '油', '条','鸡', '！','。','@','123']))

Vocabulary size: 13847
[5, 48, 515, 0, 43, 153, 0]
[385, 3516, 2129, 6008, 43, 153, 0, 0]


In [37]:
text_pipeline = lambda x: vocab(jieba_tokenizer(x))
label_pipeline = lambda x: int(x)

In [38]:
# Test text_pipeline()
tokens = text_pipeline('你好世界！哈哈。我是卖油条的')
print(tokens)

# Test label_pipeline()
lbl = label_pipeline('1')
print(lbl)

[561, 515, 43, 938, 153, 3, 12, 385, 536, 6]
1


### Data Batch

Define the `Collate_batch` function, which will be used to process the "raw" data batch.

In [39]:
from torch.utils.data import DataLoader
# Use collate_batch to generate the dataloader
train_dataset = MyDataset(train_data)

from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
# The operator 'aten::_embedding_bag' is not currently implemented for the MPS device.

train_iter = train_dataset
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

In [40]:
# Build the model
train_iter = iter(train_dataset)
num_class = len(set([label for (_, label) in train_iter])) # binary classification here
vocab_size = len(vocab)
num_class, vocab_size

(2, 13847)

In [41]:

hidden_dim1 = 16
hidden_dim2 = 8

# embedding size
emsize =  64

model1 = TextClassificationModel(vocab_size, emsize, num_class, hidden_dim1, hidden_dim2).to(device)

### Hyper-parameters, loss, optimizer, and learning-rate scheduler

In [42]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate
BATCH_SIZE = 8  # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model1.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1) # decay lr

In [43]:
# Prepare train, valid, and test data
train_iter = MyDataset(train_data)
test_iter = MyDataset(test_data)

train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)

In [44]:
# Run the training loop
total_accu = None
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()

    train(model1, train_dataloader, optimizer, criterion, epoch)
    # accu_val = evaluate(model1, valid_dataloader, criterion)
    accu_val, precision, recall, f1 = evaluate(model1, valid_dataloader, criterion)

    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
        print('')
    else:
        total_accu = accu_val

    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)

| epoch   1 |   200/ 1506 batches | accuracy    0.700
| epoch   1 |   400/ 1506 batches | accuracy    0.703
| epoch   1 |   600/ 1506 batches | accuracy    0.702
| epoch   1 |   800/ 1506 batches | accuracy    0.667
| epoch   1 |  1000/ 1506 batches | accuracy    0.701
| epoch   1 |  1200/ 1506 batches | accuracy    0.700
| epoch   1 |  1400/ 1506 batches | accuracy    0.679
-----------------------------------------------------------
| end of epoch   1 | time:  2.74s | valid accuracy    0.333 
-----------------------------------------------------------
| epoch   2 |   200/ 1506 batches | accuracy    0.682
| epoch   2 |   400/ 1506 batches | accuracy    0.693
| epoch   2 |   600/ 1506 batches | accuracy    0.685
| epoch   2 |   800/ 1506 batches | accuracy    0.694
| epoch   2 |  1000/ 1506 batches | accuracy    0.731
| epoch   2 |  1200/ 1506 batches | accuracy    0.681
| epoch   2 |  1400/ 1506 batches | accuracy    0.661


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


-----------------------------------------------------------
| end of epoch   2 | time:  2.81s | valid accuracy    0.713 
-----------------------------------------------------------
| epoch   3 |   200/ 1506 batches | accuracy    0.662
| epoch   3 |   400/ 1506 batches | accuracy    0.713
| epoch   3 |   600/ 1506 batches | accuracy    0.699
| epoch   3 |   800/ 1506 batches | accuracy    0.662
| epoch   3 |  1000/ 1506 batches | accuracy    0.705
| epoch   3 |  1200/ 1506 batches | accuracy    0.698
| epoch   3 |  1400/ 1506 batches | accuracy    0.710

-----------------------------------------------------------
| end of epoch   3 | time:  2.79s | valid accuracy    0.621 
-----------------------------------------------------------
| epoch   4 |   200/ 1506 batches | accuracy    0.711
| epoch   4 |   400/ 1506 batches | accuracy    0.701
| epoch   4 |   600/ 1506 batches | accuracy    0.713
| epoch   4 |   800/ 1506 batches | accuracy    0.717
| epoch   4 |  1000/ 1506 batches | accurac

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


-----------------------------------------------------------
| end of epoch   4 | time:  2.70s | valid accuracy    0.713 
-----------------------------------------------------------
| epoch   5 |   200/ 1506 batches | accuracy    0.701
| epoch   5 |   400/ 1506 batches | accuracy    0.712
| epoch   5 |   600/ 1506 batches | accuracy    0.715
| epoch   5 |   800/ 1506 batches | accuracy    0.702
| epoch   5 |  1000/ 1506 batches | accuracy    0.695
| epoch   5 |  1200/ 1506 batches | accuracy    0.699
| epoch   5 |  1400/ 1506 batches | accuracy    0.733


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


-----------------------------------------------------------
| end of epoch   5 | time:  2.76s | valid accuracy    0.713 
-----------------------------------------------------------
| epoch   6 |   200/ 1506 batches | accuracy    0.697
| epoch   6 |   400/ 1506 batches | accuracy    0.725
| epoch   6 |   600/ 1506 batches | accuracy    0.728
| epoch   6 |   800/ 1506 batches | accuracy    0.711
| epoch   6 |  1000/ 1506 batches | accuracy    0.689
| epoch   6 |  1200/ 1506 batches | accuracy    0.714
| epoch   6 |  1400/ 1506 batches | accuracy    0.710


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


-----------------------------------------------------------
| end of epoch   6 | time:  2.70s | valid accuracy    0.713 
-----------------------------------------------------------
| epoch   7 |   200/ 1506 batches | accuracy    0.727
| epoch   7 |   400/ 1506 batches | accuracy    0.706
| epoch   7 |   600/ 1506 batches | accuracy    0.711
| epoch   7 |   800/ 1506 batches | accuracy    0.703
| epoch   7 |  1000/ 1506 batches | accuracy    0.704
| epoch   7 |  1200/ 1506 batches | accuracy    0.713
| epoch   7 |  1400/ 1506 batches | accuracy    0.704


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


-----------------------------------------------------------
| end of epoch   7 | time:  2.73s | valid accuracy    0.713 
-----------------------------------------------------------
| epoch   8 |   200/ 1506 batches | accuracy    0.700
| epoch   8 |   400/ 1506 batches | accuracy    0.712
| epoch   8 |   600/ 1506 batches | accuracy    0.719
| epoch   8 |   800/ 1506 batches | accuracy    0.713
| epoch   8 |  1000/ 1506 batches | accuracy    0.732
| epoch   8 |  1200/ 1506 batches | accuracy    0.704
| epoch   8 |  1400/ 1506 batches | accuracy    0.710


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


-----------------------------------------------------------
| end of epoch   8 | time:  2.77s | valid accuracy    0.713 
-----------------------------------------------------------
| epoch   9 |   200/ 1506 batches | accuracy    0.693
| epoch   9 |   400/ 1506 batches | accuracy    0.706
| epoch   9 |   600/ 1506 batches | accuracy    0.734
| epoch   9 |   800/ 1506 batches | accuracy    0.738
| epoch   9 |  1000/ 1506 batches | accuracy    0.701
| epoch   9 |  1200/ 1506 batches | accuracy    0.701
| epoch   9 |  1400/ 1506 batches | accuracy    0.699
-----------------------------------------------------------
| end of epoch   9 | time:  2.75s | valid accuracy    0.713 
-----------------------------------------------------------
| epoch  10 |   200/ 1506 batches | accuracy    0.713
| epoch  10 |   400/ 1506 batches | accuracy    0.718
| epoch  10 |   600/ 1506 batches | accuracy    0.714
| epoch  10 |   800/ 1506 batches | accuracy    0.707
| epoch  10 |  1000/ 1506 batches | accuracy

In [45]:
# Save the model
torch.save(model1.state_dict(), "A1_jieba_text_classification_model.pth")

In [46]:
accu_val, precision, recall, f1 = evaluate(model1, test_dataloader, criterion)
print("test accuracy {:8.3f}, precision {:8.3f}, recall {:8.3f}, f1_score {:8.3f}".format(accu_val, precision, recall, f1_score))


test accuracy    0.716, precision    0.694, recall    0.716, f1_score    0.682
