# <b>AASD 4014 - Advance Mathematical Concepts for Deep Learning Group Project 1</b>

<b>Members:</b> 
1. Saksham Prakash (101410709) 
2. Sik Yin Sun (101409665)

## Background and Motivation
<b> Why Chinese? </b>

With both the team members belonging to different ethnicities, we wanted to explore the usage of NLP models on our native languages. With Saksham's native language Hindi and Sik's native language Chinese, we decided on Chinese as we found larger community support and resources.

<b> Why Text Classification?</b>

Initially, we explored Text Generation ideas for fine-tuning a transformer-based model on Chinese Texts. For this we explored Andrej's NanoGPT repository but couldn't get a chinese tokenizer/encoder running without errors. We further explored Text Generation models on Hugging Face as well as a few Question-Answering datasets and models on Hugging Face. However, we resorted to a simpler task of text classification in chinese for now, given the available time and resources.

<b> Why HuggingFace? </b>

Learning HuggingFace libraries for transformers felt essential because it seems to be the emerging AI exploratory space with a large community for working professionals looking to try different AI solutions quickly for the problem at hand.

## Problem Statement

The Applied AI Solutions Development Program has exclusively trained models in English. Still, now the team is interested in exploring the effectiveness of machine learning and deep learning in other languages. For this project, we have selected Chinese for sentiment analysis, using a dataset from hugging face that includes various comments on different topics such as hotels, seafood, and restaurants. We intend to utilize existing models, namely bert-base-chinese from hugging face (https://huggingface.co/bert-base-chinese) and Roberta-based models from hugging face ("https://huggingface.co/liam168/c2-roberta-base-finetuned-dianping-chinese", "https://huggingface.co/Jiabo/Roberta_Chinese_sentiment"). Both models will undergo fine-tuning, and we will compare their performance to assess the efficacy of transformers in Chinese. In addition, the "accuracy" metric will be used to evaluate the performance of the models.

## Contents

| SNo. | Contents | Link |
| -------- | -------- | -------- |
| 1 | Installing and Importing libraries | [Jump To Cell](#lib)
| 2 | Data - Chinese Sentiment | [Jump To Cell](#data)
| 3.1 | Trying bert-base-chinese | [Jump To Cell](#bert)
| 3.2 | Trying liam168/c2-roberta-base-finetuned-dianping-chinese | [Jump To Cell](#liam168)
| 3.3 | Trying Jiabo/Roberta_Chinese_sentiment | [Jump To Cell](#jiabo)
| 4 | Results | [Jump To Cell](#results)


## <b> 1. Installing and Importing libraries </b>
<a id='lib'></a>

In [None]:
!pip install transformers dataset evaluate

In [2]:
from datasets import load_dataset
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm
import evaluate

## <b>2. Data - Chinese Sentiment</b>
<font size=3>The link of Data:</font>
https://huggingface.co/datasets/sepidmnorozy/Chinese_sentiment
<a id='data'></a>

Loading the Chinese Sentiment dataset

In [3]:
dataset = load_dataset("sepidmnorozy/Chinese_sentiment")

Downloading and preparing dataset csv/sepidmnorozy--Chinese_sentiment to /root/.cache/huggingface/datasets/sepidmnorozy___csv/sepidmnorozy--Chinese_sentiment-0f90f54b0dfdaa37/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/724k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/sepidmnorozy___csv/sepidmnorozy--Chinese_sentiment-0f90f54b0dfdaa37/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Creating dataframes from the train, test and validation splits of the dataset

In [None]:
df_train = pd.DataFrame(dataset["train"])
df_test = pd.DataFrame(dataset["test"])
df_valid = pd.DataFrame(dataset["validation"])

In [None]:
df = pd.concat([df_train, df_test, df_valid], ignore_index=True)

In [None]:
df.head()

Unnamed: 0,label,text
0,1,酒店 房間 算 不錯 ， 大 床 很 舒適 。 晚上 睡的 很香 ， 酒店 環境 不到 四星...
1,1,"對於 這本 鯉 , 我是 針對 於 它 每期 的 主題 而 買 的 ， 第一次 買 “ 鯉 ..."
2,0,晚上 朋友 帶 我們 去 吃 海鮮 ， 結果 半夜 兩點 兩人 都 上吐下瀉 ， 請 服務 ...
3,1,教育 兩個 字 看似 簡單 ， 簡單 到 日常 的 威權 教育 。 於是 孩子 成了 父母 ...
4,1,"比較 老 的 飯店 了 , 房間 只能 用 還 比較 乾淨 整潔 來 形容 , 但 離 五星..."


In [None]:
df.shape

(19835, 2)

In [None]:
df.isnull().sum()

label    0
text     0
dtype: int64

In [None]:
df['label'].value_counts()

1    11914
0     7921
Name: label, dtype: int64

In [None]:
df['label'].value_counts() / len(df)

1    0.600655
0    0.399345
Name: label, dtype: float64

<font size=3>After briefly exploring the data, there is no null value in the label or text columns. Additionally, we found that there is more positive review than negative reviews. Creating a BERT tokenizer specific to the Chinese language</font>

## <b> 3.1 Trying bert-base-chinese </b>
<a id='bert'></a>

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")

Examining the size of the vocabulary of the Chinese BERT tokenizer

In [None]:
vocab = tokenizer.vocab
len(vocab)

21128

Printing out the first few tokens in the vocabulary

In [None]:
count = 0
print("token: index")
for i in list(vocab):
    if count < 5:
      print(i,": ", vocab[i])
      count += 1
    else:
       break

token: index
##荷 :  18849
墜 :  1871
飽 :  7616
20cm :  13309
##蜢 :  19116


In [None]:
df['text'][0]

'酒店 房間 算 不錯 ， 大 床 很 舒適 。 晚上 睡的 很香 ， 酒店 環境 不到 四星 標準 ， 但 又 比 三星 高 。 地點 很好 ， 價錢 適中 。'

In [None]:
tokenizer(df['text'][0], padding="max_length", truncation=True)

{'input_ids': [101, 6983, 2421, 2791, 7279, 5050, 679, 7097, 8024, 1920, 2414, 2523, 5653, 6900, 511, 3241, 677, 4717, 4638, 2523, 7676, 8024, 6983, 2421, 4472, 1862, 679, 1168, 1724, 3215, 3560, 3976, 8024, 852, 1348, 3683, 676, 3215, 7770, 511, 1765, 7953, 2523, 1962, 8024, 1019, 7092, 6900, 704, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

<font size=3>After a brief exploration of the bert-base-chinese, it contains 21128 tokens, and we explored what is inside the tokenizer. We found out that there is some interesting token that starts with ##. We did some research and figured out that it is for WordPiece. Because the word is not in the dictionary, it handles those words by adding the prefix "##." And we look into the text before and after tokenizing. Adding some context to the Chinese sentence above, it just mentions how the customers feel about the hotel, and they said it is comfortable and the price is reasonable.

source: https://huggingface.co/course/chapter6/6?fw=pt</font>

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

Tokenizing the entire dataset using the tokenize_function defined above

In [None]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/12348 [00:00<?, ? examples/s]

Map:   0%|          | 0/4896 [00:00<?, ? examples/s]

Map:   0%|          | 0/2591 [00:00<?, ? examples/s]

Removing the original text column from the tokenized dataset and renaming the label column to "labels" and Setting the format of the tokenized dataset to PyTorch tensors

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

Separating the tokenized dataset into training, validation, and test datasets

In [None]:
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]
test_dataset = tokenized_datasets["test"]

Creating data loaders for the training, validation, and test datasets

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=8)
eval_dataloader = DataLoader(eval_dataset, batch_size=8)
test_dataloader = DataLoader(test_dataset, batch_size=8)

Instantiating a pre-trained Chinese BERT model for sequence classification

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-chinese", num_labels=2)

Downloading pytorch_model.bin:   0%|          | 0.00/412M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Defining an optimizer for the model, chose AdamW as shown in class

In [None]:
optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

Moving the model to the GPU if one is available

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Training with a tqdm progress bar

In [None]:
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/4632 [00:00<?, ?it/s]

Evaluation on "eval_dataloader" with accuracy metric

In [None]:
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    print("pred=",predictions,"ground_truth=",batch["labels"])

metric.compute()

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

pred= tensor([0, 1, 1, 1, 0, 0, 1, 0], device='cuda:0') ground_truth= tensor([0, 1, 1, 1, 0, 0, 1, 0], device='cuda:0')
pred= tensor([1, 0, 0, 1, 1, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 0, 0, 1, 1, 0, 1, 1], device='cuda:0')
pred= tensor([0, 1, 1, 0, 1, 1, 1, 1], device='cuda:0') ground_truth= tensor([0, 1, 0, 0, 1, 1, 1, 1], device='cuda:0')
pred= tensor([0, 1, 0, 0, 1, 1, 0, 1], device='cuda:0') ground_truth= tensor([0, 1, 0, 0, 1, 1, 0, 1], device='cuda:0')
pred= tensor([0, 1, 0, 1, 0, 1, 1, 1], device='cuda:0') ground_truth= tensor([0, 1, 0, 1, 0, 1, 1, 1], device='cuda:0')
pred= tensor([1, 0, 0, 0, 0, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 0, 0, 0, 0, 0, 1, 1], device='cuda:0')
pred= tensor([1, 1, 1, 0, 0, 1, 1, 1], device='cuda:0') ground_truth= tensor([1, 1, 1, 0, 0, 1, 1, 1], device='cuda:0')
pred= tensor([0, 1, 0, 0, 1, 0, 0, 0], device='cuda:0') ground_truth= tensor([0, 1, 0, 0, 1, 0, 0, 0], device='cuda:0')
pred= tensor([1, 1, 0, 1, 0, 0, 1, 1], d

{'accuracy': 0.9644924739482825}

Evaluation on "test_dataloader" with accuracy metric

In [None]:
metric = evaluate.load("accuracy")
model.eval()
for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    print("pred=",predictions,"ground_truth=",batch["labels"])

metric.compute()

pred= tensor([1, 1, 1, 0, 0, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 1, 1, 0, 0, 0, 1, 1], device='cuda:0')
pred= tensor([1, 0, 1, 1, 1, 1, 1, 0], device='cuda:0') ground_truth= tensor([1, 1, 1, 1, 1, 0, 0, 0], device='cuda:0')
pred= tensor([1, 1, 1, 0, 1, 0, 0, 1], device='cuda:0') ground_truth= tensor([1, 1, 1, 0, 1, 0, 0, 1], device='cuda:0')
pred= tensor([0, 1, 1, 0, 1, 1, 0, 0], device='cuda:0') ground_truth= tensor([0, 0, 1, 0, 1, 1, 0, 0], device='cuda:0')
pred= tensor([0, 1, 1, 0, 0, 1, 0, 0], device='cuda:0') ground_truth= tensor([0, 1, 1, 0, 0, 1, 0, 0], device='cuda:0')
pred= tensor([0, 0, 0, 0, 1, 0, 1, 1], device='cuda:0') ground_truth= tensor([0, 0, 0, 0, 1, 0, 1, 1], device='cuda:0')
pred= tensor([1, 0, 1, 1, 1, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 0, 1, 1, 1, 0, 1, 1], device='cuda:0')
pred= tensor([1, 0, 0, 1, 1, 0, 1, 1], device='cuda:0') ground_truth= tensor([1, 0, 0, 1, 1, 0, 1, 1], device='cuda:0')
pred= tensor([1, 1, 0, 1, 0, 0, 0, 1], d

{'accuracy': 0.9634395424836601}

## <b> 3.2 Trying liam168/c2-roberta-base-finetuned-dianping-chinese </b>
<a id='liam168'></a>

In [4]:
model = AutoModelForSequenceClassification.from_pretrained("liam168/c2-roberta-base-finetuned-dianping-chinese", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("liam168/c2-roberta-base-finetuned-dianping-chinese", model_max_length=512)

Downloading (…)lve/main/config.json:   0%|          | 0.00/854 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/409M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/377 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/269k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [5]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/12348 [00:00<?, ? examples/s]

Map:   0%|          | 0/4896 [00:00<?, ? examples/s]

Map:   0%|          | 0/2591 [00:00<?, ? examples/s]

In [6]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]
test_dataset = tokenized_datasets["test"]

train_dataloader = DataLoader(train_dataset, batch_size=8)
eval_dataloader = DataLoader(eval_dataset, batch_size=8)
test_dataloader = DataLoader(test_dataset, batch_size=8)

In [7]:
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [9]:
progress_bar = tqdm(range(num_training_steps))
model.train()

for epoch in range(num_epochs):
    for step, batch in enumerate(train_dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        # Print loss at every 100 steps
        if step % 100 == 0:
            print(f"Epoch [{epoch+1}/{num_epochs}] - Step [{step}/{len(train_dataloader)}] - Loss: {loss.item()}")

        progress_bar.update(1)

  0%|          | 0/4632 [00:00<?, ?it/s]

Epoch [1/3] - Step [0/1544] - Loss: 0.10450977087020874
Epoch [1/3] - Step [100/1544] - Loss: 0.09194397181272507
Epoch [1/3] - Step [200/1544] - Loss: 0.2944766879081726
Epoch [1/3] - Step [300/1544] - Loss: 0.03351861611008644
Epoch [1/3] - Step [400/1544] - Loss: 0.04639328643679619
Epoch [1/3] - Step [500/1544] - Loss: 0.09038589149713516
Epoch [1/3] - Step [600/1544] - Loss: 0.21144528687000275
Epoch [1/3] - Step [700/1544] - Loss: 0.03603355586528778
Epoch [1/3] - Step [800/1544] - Loss: 0.7431256771087646
Epoch [1/3] - Step [900/1544] - Loss: 0.21767641603946686
Epoch [1/3] - Step [1000/1544] - Loss: 0.1747177541255951
Epoch [1/3] - Step [1100/1544] - Loss: 0.16029979288578033
Epoch [1/3] - Step [1200/1544] - Loss: 0.039638299494981766
Epoch [1/3] - Step [1300/1544] - Loss: 0.07835166156291962
Epoch [1/3] - Step [1400/1544] - Loss: 0.3723095655441284
Epoch [1/3] - Step [1500/1544] - Loss: 0.10038720816373825
Epoch [2/3] - Step [0/1544] - Loss: 0.0427623875439167
Epoch [2/3] - St

In [10]:
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    # print("pred=",predictions,"ground_truth=",batch["labels"])

metric.compute()

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.9679660362794288}

In [11]:
metric = evaluate.load("accuracy")
model.eval()
for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    # print("pred=",predictions,"ground_truth=",batch["labels"])

metric.compute()

{'accuracy': 0.9677287581699346}

## <b> 3.3 Trying Jiabo/Roberta_Chinese_sentiment </b>
<a id='jiabo'></a>

In [12]:
model = AutoModelForSequenceClassification.from_pretrained("voidful/albert_chinese_small_sentiment", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("voidful/albert_chinese_small_sentiment", model_max_length=512)

Downloading (…)lve/main/config.json:   0%|          | 0.00/965 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/19.0M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/379 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [13]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/12348 [00:00<?, ? examples/s]

Map:   0%|          | 0/4896 [00:00<?, ? examples/s]

Map:   0%|          | 0/2591 [00:00<?, ? examples/s]

In [14]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]
test_dataset = tokenized_datasets["test"]

train_dataloader = DataLoader(train_dataset, batch_size=8)
eval_dataloader = DataLoader(eval_dataset, batch_size=8)
test_dataloader = DataLoader(test_dataset, batch_size=8)

In [15]:

optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)


  0%|          | 0/4632 [00:00<?, ?it/s]

In [20]:
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    # print("pred=",predictions,"ground_truth=",batch["labels"])

metric.compute()

{'accuracy': 0.9378618294094944}

In [19]:
metric = evaluate.load("accuracy")
model.eval()
for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    # print("pred=",predictions,"ground_truth=",batch["labels"])

metric.compute()

{'accuracy': 0.9377042483660131}

## <b> 4. Results </b>
<a id='results'></a>

| SNo. | Training | Eval Accuracy | Test Accuracy | Training Time | Comments
| -------- | -------- | -------- | -------- | -------- | -------- |
| 3.1 | bert-base-chinese | 0.964 | 0.963 | 2 hrs | Largest
| 3.2 | liam168/c2-roberta-base-finetuned-dianping-chinese | 0.967 | 0.967 | 58m | Best Overall
| 3.3 | Jiabo/Roberta_Chinese_sentiment | 0.937 | 0.937 | 11m | Most Efficient


GPU used

In [27]:
!nvidia-smi

Tue Mar 28 13:13:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0    29W /  70W |   2045MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces