# <b>AASD 4014 - Advance Mathematical Concepts for Deep Learning Group Project 1</b>

### <b>Member:</b> 1. Saksham Prakash (101410709) 2. Sik Yin Sun (101409665)

### <b>Data - Chinese Sentiment</b>
<font size=3>The link of Data:</font>
https://huggingface.co/datasets/sepidmnorozy/Chinese_sentiment


Importing necessary libraries

In [89]:
from datasets import load_dataset
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm
import evaluate
from text2vec import SentenceModel

Loading the Chinese Sentiment dataset

In [90]:
dataset = load_dataset("sepidmnorozy/Chinese_sentiment")

Found cached dataset csv (/home/saksham/.cache/huggingface/datasets/sepidmnorozy___csv/sepidmnorozy--Chinese_sentiment-0f90f54b0dfdaa37/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/3 [00:00<?, ?it/s]

Creating dataframes from the train, test and validation splits of the dataset

In [91]:
df_train = pd.DataFrame(dataset["train"])
df_test = pd.DataFrame(dataset["test"])
df_valid = pd.DataFrame(dataset["validation"])

In [92]:
df = pd.concat([df_train, df_test, df_valid], ignore_index=True)

In [93]:
df.head()

Unnamed: 0,label,text
0,1,酒店 房間 算 不錯 ， 大 床 很 舒適 。 晚上 睡的 很香 ， 酒店 環境 不到 四星...
1,1,"對於 這本 鯉 , 我是 針對 於 它 每期 的 主題 而 買 的 ， 第一次 買 “ 鯉 ..."
2,0,晚上 朋友 帶 我們 去 吃 海鮮 ， 結果 半夜 兩點 兩人 都 上吐下瀉 ， 請 服務 ...
3,1,教育 兩個 字 看似 簡單 ， 簡單 到 日常 的 威權 教育 。 於是 孩子 成了 父母 ...
4,1,"比較 老 的 飯店 了 , 房間 只能 用 還 比較 乾淨 整潔 來 形容 , 但 離 五星..."


In [94]:
df.shape

(19835, 2)

In [95]:
df.isnull().sum()

label    0
text     0
dtype: int64

In [96]:
df['label'].value_counts()

1    11914
0     7921
Name: label, dtype: int64

In [97]:
df['label'].value_counts() / len(df)

1    0.600655
0    0.399345
Name: label, dtype: float64

<font size=3>After briefly exploring the data, there is no null value in the label or text columns. Additionally, we found that there is more positive review than negative reviews. Creating a BERT tokenizer specific to the Chinese language</font>

In [98]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")

Examining the size of the vocabulary of the Chinese BERT tokenizer

In [99]:
vocab = tokenizer.vocab
len(vocab)

21128

Printing out the first few tokens in the vocabulary

In [100]:
count = 0
print("token: index")
for i in list(vocab):
    if count < 5:
      print(i,": ", vocab[i])
      count += 1
    else:
       break

token: index
##荷 :  18849
墜 :  1871
飽 :  7616
20cm :  13309
##蜢 :  19116


In [101]:
df['text'][0]

'酒店 房間 算 不錯 ， 大 床 很 舒適 。 晚上 睡的 很香 ， 酒店 環境 不到 四星 標準 ， 但 又 比 三星 高 。 地點 很好 ， 價錢 適中 。'

In [102]:
tokenizer(df['text'][0], padding="max_length", truncation=True)

{'input_ids': [101, 6983, 2421, 2791, 7279, 5050, 679, 7097, 8024, 1920, 2414, 2523, 5653, 6900, 511, 3241, 677, 4717, 4638, 2523, 7676, 8024, 6983, 2421, 4472, 1862, 679, 1168, 1724, 3215, 3560, 3976, 8024, 852, 1348, 3683, 676, 3215, 7770, 511, 1765, 7953, 2523, 1962, 8024, 1019, 7092, 6900, 704, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

<font size=3>After a brief exploration of the bert-base-chinese, it contains 21128 tokens, and we explored what is inside the tokenizer. We found out that there is some interesting token that starts with ##. We did some research and figured out that it is for WordPiece. Because the word is not in the dictionary, it handles those words by adding the prefix "##." And we look into the text before and after tokenizing. Adding some context to the Chinese sentence above, it just mentions how the customers feel about the hotel, and they said it is comfortable and the price is reasonable.

source: https://huggingface.co/course/chapter6/6?fw=pt</font>

In [103]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

Tokenizing the entire dataset using the tokenize_function defined above

In [17]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/12348 [00:00<?, ? examples/s]

Map:   0%|          | 0/4896 [00:00<?, ? examples/s]

Map:   0%|          | 0/2591 [00:00<?, ? examples/s]

Removing the original text column from the tokenized dataset and renaming the label column to "labels" and Setting the format of the tokenized dataset to PyTorch tensors

In [18]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

Separating the tokenized dataset into training, validation, and test datasets

In [19]:
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]
test_dataset = tokenized_datasets["test"]

Creating data loaders for the training, validation, and test datasets

In [20]:
train_dataloader = DataLoader(train_dataset, batch_size=8)
eval_dataloader = DataLoader(eval_dataset, batch_size=8)
test_dataloader = DataLoader(test_dataset, batch_size=8)

Instantiating a pre-trained Chinese BERT model for sequence classification

In [21]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-chinese", num_labels=2)

Downloading pytorch_model.bin:   0%|          | 0.00/412M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Defining an optimizer for the model, chose AdamW as shown in class

In [22]:
optimizer = AdamW(model.parameters(), lr=5e-5)

In [23]:
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

Moving the model to the GPU if one is available

In [24]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Training with a tqdm progress bar

In [30]:
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/4632 [00:00<?, ?it/s]

Evaluation on "eval_dataloader" with accuracy metric

In [32]:
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    print("pred=",predictions,"ground_truth=",batch["labels"])

metric.compute()

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

pred= tensor([0, 1, 1, 1, 0, 0, 1, 0], device='cuda:0') ground_truth= tensor([0, 1, 1, 1, 0, 0, 1, 0], device='cuda:0')
pred= tensor([1, 0, 0, 1, 1, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 0, 0, 1, 1, 0, 1, 1], device='cuda:0')
pred= tensor([0, 1, 1, 0, 1, 1, 1, 1], device='cuda:0') ground_truth= tensor([0, 1, 0, 0, 1, 1, 1, 1], device='cuda:0')
pred= tensor([0, 1, 0, 0, 1, 1, 0, 1], device='cuda:0') ground_truth= tensor([0, 1, 0, 0, 1, 1, 0, 1], device='cuda:0')
pred= tensor([0, 1, 0, 1, 0, 1, 1, 1], device='cuda:0') ground_truth= tensor([0, 1, 0, 1, 0, 1, 1, 1], device='cuda:0')
pred= tensor([1, 0, 0, 0, 0, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 0, 0, 0, 0, 0, 1, 1], device='cuda:0')
pred= tensor([1, 1, 1, 0, 0, 1, 1, 1], device='cuda:0') ground_truth= tensor([1, 1, 1, 0, 0, 1, 1, 1], device='cuda:0')
pred= tensor([0, 1, 0, 0, 1, 0, 0, 0], device='cuda:0') ground_truth= tensor([0, 1, 0, 0, 1, 0, 0, 0], device='cuda:0')
pred= tensor([1, 1, 0, 1, 0, 0, 1, 1], d

{'accuracy': 0.9644924739482825}

Evaluation on "test_dataloader" with accuracy metric

In [35]:
metric = evaluate.load("accuracy")
model.eval()
for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    print("pred=",predictions,"ground_truth=",batch["labels"])

metric.compute()

pred= tensor([1, 1, 1, 0, 0, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 1, 1, 0, 0, 0, 1, 1], device='cuda:0')
pred= tensor([1, 0, 1, 1, 1, 1, 1, 0], device='cuda:0') ground_truth= tensor([1, 1, 1, 1, 1, 0, 0, 0], device='cuda:0')
pred= tensor([1, 1, 1, 0, 1, 0, 0, 1], device='cuda:0') ground_truth= tensor([1, 1, 1, 0, 1, 0, 0, 1], device='cuda:0')
pred= tensor([0, 1, 1, 0, 1, 1, 0, 0], device='cuda:0') ground_truth= tensor([0, 0, 1, 0, 1, 1, 0, 0], device='cuda:0')
pred= tensor([0, 1, 1, 0, 0, 1, 0, 0], device='cuda:0') ground_truth= tensor([0, 1, 1, 0, 0, 1, 0, 0], device='cuda:0')
pred= tensor([0, 0, 0, 0, 1, 0, 1, 1], device='cuda:0') ground_truth= tensor([0, 0, 0, 0, 1, 0, 1, 1], device='cuda:0')
pred= tensor([1, 0, 1, 1, 1, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 0, 1, 1, 1, 0, 1, 1], device='cuda:0')
pred= tensor([1, 0, 0, 1, 1, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 0, 0, 1, 1, 0, 1, 1], device='cuda:0')
pred= tensor([1, 1, 0, 1, 0, 0, 0, 1], d

{'accuracy': 0.9634395424836601}

# Trying shibing624/text2vec-base-chinese

In [80]:
text2vec_model = SentenceModel('shibing624/text2vec-base-chinese')

2023-03-28 06:26:19.323 | DEBUG    | text2vec.sentence_model:__init__:74 - Use device: cuda


In [82]:
from sentence_transformers import SentenceTransformer

m = SentenceTransformer("shibing624/text2vec-base-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)


Downloading (…)692df/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)f8ce6692df/README.md:   0%|          | 0.00/4.10k [00:00<?, ?B/s]

Downloading (…)ce6692df/config.json:   0%|          | 0.00/856 [00:00<?, ?B/s]

Downloading (…)df8ce6692df/logs.txt:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/409M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading (…)f8ce6692df/vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

No sentence-transformers model found with name /home/saksham/.cache/torch/sentence_transformers/shibing624_text2vec-base-chinese. Creating a new one with MEAN pooling.


Sentence embeddings:
[[-4.4367404e-04 -2.9734752e-01  8.5790193e-01 ... -5.2770102e-01
  -1.4315654e-01 -1.0007865e-01]
 [ 6.5362066e-01 -7.6667517e-02  9.5962423e-01 ... -6.0122490e-01
  -1.6792037e-03  2.1457729e-01]]


In [81]:
embeddings_train = text2vec_model.encode(df_train['text'].tolist())
embeddings_test = text2vec_model.encode(df_test['text'].tolist())
embeddings_valid = text2vec_model.encode(df_valid['text'].tolist())

In [71]:
train_dataset = TensorDataset(train_embeddings_padded, torch.tensor(df_train['label'].tolist()))
test_dataset = TensorDataset(test_embeddings_padded, torch.tensor(df_test['label'].tolist()))
valid_dataset = TensorDataset(valid_embeddings_padded, torch.tensor(df_valid['label'].tolist()))

train_dataloader = DataLoader(train_dataset, batch_size=8)
test_dataloader = DataLoader(test_dataset, batch_size=8)
valid_dataloader = DataLoader(valid_dataset, batch_size=8)

In [74]:
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [75]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-chinese", num_labels=2)

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [76]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [77]:
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        inputs, labels = batch
        inputs = inputs.to(device)
        labels = labels.to(device)
        outputs = model(inputs, labels=labels)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/4632 [00:00<?, ?it/s]

RuntimeError: The expanded size of the tensor (768) must match the existing size (512) at non-singleton dimension 1.  Target sizes: [8, 768].  Tensor sizes: [1, 512]

# Trying liam168/c2-roberta-base-finetuned-dianping-chinese

In [111]:
model = AutoModelForSequenceClassification.from_pretrained("liam168/c2-roberta-base-finetuned-dianping-chinese", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("liam168/c2-roberta-base-finetuned-dianping-chinese")

In [114]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding='longest', truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/12348 [00:00<?, ? examples/s]

Map:   0%|          | 0/4896 [00:00<?, ? examples/s]

Map:   0%|          | 0/2591 [00:00<?, ? examples/s]

In [115]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]
test_dataset = tokenized_datasets["test"]

train_dataloader = DataLoader(train_dataset, batch_size=8)
eval_dataloader = DataLoader(eval_dataset, batch_size=8)
test_dataloader = DataLoader(test_dataset, batch_size=8)

In [116]:
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [117]:
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    print("pred=",predictions,"ground_truth=",batch["labels"])

metric.compute()

  0%|          | 0/4632 [00:00<?, ?it/s]

RuntimeError: The size of tensor a (851) must match the size of tensor b (512) at non-singleton dimension 1