# Load Data

1. Download the dataset, including the following files:

    ```
    dataset/
    ├── data_identification.csv
    ├── emotion.csv
    ├── sampleSubmission.csv
    └── tweets_DM.json
    ```

    Descriptions of each file:

    - `data_identification.csv:` Assign each "tweet_id" to a train or test label.

    - `emotion.csv:` Assign each "tweet_id" to a emotion label.

    - `sampleSubmission.csv:` Demonstration of format of submission.csv.

    - `tweets_DM.json:` Primary dataset, containing tweets.


In [None]:
!kaggle competitions download -c dm-2024-isa-5810-lab-2-homework -p dataset/
!unzip dataset/dm-2024-isa-5810-lab-2-homework.zip -d dataset/

2. Load `tweets_DM.json` into a dictionary, then take out the portion of `"tweet"` into a `DataFrame`.

In [4]:
import json
import pandas as pd

# This is raw data, we need to extract the train data and test data
with open("dataset/tweets_DM.json") as f:
    tweets = [json.loads(data) for data in f]
    tweets = [tweet["_source"]["tweet"] for tweet in tweets]

df_tweets = pd.DataFrame(tweets)
df_tweets.head()

Unnamed: 0,hashtags,tweet_id,text
0,[Snapchat],0x376b20,"People who post ""add me on #Snapchat"" must be ..."
1,"[freepress, TrumpLegacy, CNN]",0x2d5350,"@brianklaas As we see, Trump is dangerous to #..."
2,[bibleverse],0x28b412,"Confident of your obedience, I write to you, k..."
3,[],0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>
4,[],0x2de201,"""Trust is not the same as faith. A friend is s..."


3. According to `data_identification.csv` to distinguish which "tweet_id" belongs to train or test dataset.

In [4]:
import csv

train_id = set()
test_id = set()
with open("dataset/data_identification.csv") as f:
    reader = csv.reader(f)
    for row in reader:
        if row[1] == "train":
            train_id.add(row[0])
        else:
            test_id.add(row[0])

4. According to `emotion.csv` to distinguish which "tweet_id" belongs to which emotion label.

In [5]:
import csv

emotion = {}
with open("dataset/emotion.csv") as f:
    reader = csv.reader(f)
    for row in reader:
        emotion[row[0]] = row[1]

5. Obtain `train_df` and `test_df`, and then encode the label by `LabelEncoder`.

In [None]:
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

# Extract the train data and test data
train_df = df_tweets[df_tweets["tweet_id"].isin(train_id)].reset_index(drop=True)
test_df = df_tweets[df_tweets["tweet_id"].isin(test_id)].reset_index(drop=True)
# Add the emotion column to the train data
train_df.loc[:, "emotion"] = train_df.apply(lambda x: emotion[x["tweet_id"]], axis=1)
train_df["label"] = labelencoder.fit_transform(train_df["emotion"])
train_df.head(10)

In [9]:
train_df = train_df.drop_duplicates(subset=["text"]).reset_index(drop=True)
train_df["text"].duplicated().sum()

0

In [10]:
# 釋放記憶體
del df_tweets
del tweets
del emotion
del train_id
del test_id

In [11]:
train_df.loc[:, "length"] = train_df["text"].apply(lambda x: len(x.split()))
train_df.loc[:, "<LH>"] = train_df["text"].apply(lambda x: x.count("<LH>"))
train_df.loc[:, "@"] = train_df["text"].apply(lambda x: x.count("@"))
train_df.loc[:, "#"] = train_df["text"].apply(lambda x: x.count("#"))
train_df.loc[:, "trash rate"] = (
    train_df["<LH>"] + train_df["@"] + train_df["#"]
) / train_df["length"]

test_df.loc[:, "length"] = test_df["text"].apply(lambda x: len(x.split()))
test_df.loc[:, "<LH>"] = test_df["text"].apply(lambda x: x.count("<LH>"))
test_df.loc[:, "@"] = test_df["text"].apply(lambda x: x.count("@"))
test_df.loc[:, "#"] = test_df["text"].apply(lambda x: x.count("#"))
test_df.loc[:, "trash rate"] = (
    test_df["<LH>"] + test_df["@"] + test_df["#"]
) / test_df["length"]

In [12]:
trash_rate = 2 / 3
lh_max = 4

lh_rate_count = train_df[train_df["trash rate"] >= trash_rate].shape[0]
lh_count = train_df[train_df["<LH>"] > lh_max].shape[0]

print(f"Number of rows with (trash rate >= {trash_rate:.2f}): {lh_rate_count}")
print(f"Number of rows with (<LH> > {lh_max}): {lh_count}")


# Filter the rows with <LH> rate >= 2/3 or <LH> >= 5
filtered_df = train_df[
    (train_df["trash rate"] >= trash_rate) | (train_df["<LH>"] > lh_max)
]

# Convert the filtered dataframe to a list of dictionaries
filtered_texts = filtered_df.to_dict(orient="records")

# Save the list of dictionaries to a JSON file
with open("filtered_texts.json", "w") as outfile:
    json.dump(filtered_texts, outfile, ensure_ascii=False, indent=4)

Number of rows with (trash rate >= 0.67): 38505
Number of rows with (<LH> > 4): 39817


In [13]:
trash_rate = 2 / 3
lh_max = 4

lh_rate_count = test_df[test_df["trash rate"] >= trash_rate].shape[0]
lh_count = test_df[test_df["<LH>"] > lh_max].shape[0]

print(f"Number of rows with (trash rate >= {trash_rate:.2f}): {lh_rate_count}")
print(f"Number of rows with (<LH> > {lh_max}): {lh_count}")


# Filter the rows with <LH> rate >= 2/3 or <LH> >= 5
filtered_df = test_df[
    (test_df["trash rate"] >= trash_rate) | (test_df["<LH>"] > lh_max)
]

# Convert the filtered dataframe to a list of dictionaries
filtered_texts = filtered_df.to_dict(orient="records")

# Save the list of dictionaries to a JSON file
with open("filtered_texts_test.json", "w") as outfile:
    json.dump(filtered_texts, outfile, ensure_ascii=False, indent=4)

Number of rows with (trash rate >= 0.67): 548
Number of rows with (<LH> > 4): 0


In [14]:
# Remove rows with trash rate >= 2/3 or <LH> >= 5
train_df = train_df[
    (train_df["trash rate"] < trash_rate) & (train_df["<LH>"] <= lh_max)
]

In [15]:
print("train data")
lh_rate_count = train_df[train_df["trash rate"] >= trash_rate].shape[0]
lh_count = train_df[train_df["<LH>"] > lh_max].shape[0]

print(f"Number of rows with (trash rate >= {trash_rate:.2f}): {lh_rate_count}")
print(f"Number of rows with (<LH> > {lh_max}): {lh_count}")

train data
Number of rows with (trash rate >= 0.67): 0
Number of rows with (<LH> > 4): 0


In [16]:
train_df.head(10)

Unnamed: 0,hashtags,tweet_id,text,emotion,label,length,<LH>,@,#,trash rate
0,[Snapchat],0x376b20,"People who post ""add me on #Snapchat"" must be ...",anticipation,1,14,1,0,1,0.142857
1,"[freepress, TrumpLegacy, CNN]",0x2d5350,"@brianklaas As we see, Trump is dangerous to #...",sadness,5,18,2,1,3,0.333333
2,[],0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>,fear,3,7,1,0,0,0.142857
3,"[authentic, LaughOutLoud]",0x1d755c,@RISKshow @TheKevinAllison Thx for the BEST TI...,joy,4,15,1,2,2,0.333333
4,[],0x2c91a8,Still waiting on those supplies Liscus. <LH>,anticipation,1,7,1,0,0,0.142857
5,[],0x368e95,Love knows no gender. 😢😭 <LH>,joy,4,6,1,0,0,0.166667
6,[LeagueCup],0x249c0c,@DStvNgCare @DStvNg More highlights are being ...,sadness,5,17,1,2,1,0.235294
7,"[SSM, gender, diversity]",0x359db9,The #SSM debate; <LH> (a manufactured fantasy ...,anticipation,1,22,1,0,3,0.181818
8,[],0x23b037,I love suffering 🙃🙃 I love when valium does no...,joy,4,27,1,0,0,0.037037
9,[Pissed],0x1fde89,Can someone tell my why my feeds scroll back t...,anger,0,21,0,0,1,0.047619


In [17]:
test_df.head(10)

Unnamed: 0,hashtags,tweet_id,text,length,<LH>,@,#,trash rate
0,[bibleverse],0x28b412,"Confident of your obedience, I write to you, k...",24,2,0,1,0.125
1,[],0x2de201,"""Trust is not the same as faith. A friend is s...",25,2,0,0,0.08
2,"[materialism, money, possessions]",0x218443,When do you have enough ? When are you satisfi...,23,1,0,3,0.173913
3,"[GodsPlan, GodsWork]",0x2939d5,"God woke you up, now chase the day #GodsPlan #...",11,1,0,2,0.272727
4,[],0x26289a,"In these tough times, who do YOU turn to as yo...",15,1,0,0,0.066667
5,[],0x31c6e0,Turns out you can recognise people by their un...,10,1,0,0,0.1
6,[sheltered],0x32edee,"I like how Hayvens mommy, daddy, and the keybo...",22,1,0,1,0.090909
7,[notamused],0x3714ee,I just love it when every single one of my son...,24,1,0,1,0.083333
8,[CelebrityBigBrother],0x235628,@JulieChen when can we expect a season of #Cel...,15,1,1,1,0.2
9,[],0x283024,Tbh. Regret hurts more than stepping on a LEGO...,10,1,0,0,0.1


# Data Preprocessing: Data cleaning

## Preprocess Function

In [18]:
import re
import contractions  # 檢查是否有縮寫
import emoji
from spellchecker import SpellChecker

spell = SpellChecker()


def preprocess_tweet(text):
    # text = contractions.fix(text)  # expand contractions
    # text = re.sub(r"http\S+", "[URL]", text)  # remove URL
    text = re.sub(r"(https?://)?[\w.-]+\.com(\.\w+)?", "<URL>", text)
    # text = re.sub(r"@\S+", "", text)  # remove mentions
    # text = re.sub(r"#\S+", "", text)  # remove hashtag
    # text = re.sub(r"\+\S+", "", text)  # remove phone number
    text = re.sub(r"<LH>", "<mask>", text)  # remove <LH>
    # text = emoji.replace_emoji(text, replace="")  # remove emojis
    # text = re.sub(r"(\W)\1+", r"\1", text)  # remove repeating characters
    # text = re.sub(r"[^\w\s]", "", text)  # remove punctuations
    text = re.sub(r"\s+", " ", text)  # remove extra whitespaces

    return text

### Example

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Twitter/twhin-bert-base")

text = "Don't miss this amazing offer! Check it out now: http://example.com #AmazingOffer 😊😊😊 @username123 <LH> Save $100 today! 🎉🎉🎉 #DealOfTheDay Contact us at +123456789. <LH>"

text = preprocess_tweet(text)
print(text)
token_text = tokenizer.tokenize(text)
print(token_text)
print(type(token_text))

  from .autonotebook import tqdm as notebook_tqdm


Don't miss this amazing offer! Check it out now: <URL> #AmazingOffer 😊😊😊 @username123 <mask> Save $100 today! 🎉🎉🎉 #DealOfTheDay Contact us at +123456789. <mask>
['▁Don', "'", 't', '▁miss', '▁this', '▁amazing', '▁offer', '!', '▁Check', '▁it', '▁out', '▁now', ':', '▁<', 'URL', '>', '▁#', 'A', 'maz', 'ing', 'Off', 'er', '▁', '😊', '😊', '😊', '▁@', 'user', 'name', '123', ' <mask>', '▁Save', '▁$100', '▁today', '!', '▁', '🎉', '🎉', '🎉', '▁#', 'De', 'al', 'Of', 'The', 'Day', '▁Contact', '▁us', '▁at', '▁+', '1234', '56', '789', '.', ' <mask>']
<class 'list'>


## Preprocessing

In [20]:
from multiprocessing import Pool
from tqdm import tqdm
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"


def preprocess_with_progress(data, func, num_workers=4):
    with Pool(num_workers) as pool:
        # 使用 tqdm 包裝進度條
        results = list(tqdm(pool.imap(func, data), total=len(data)))
    return results


# 對訓練和測試數據集應用
train_df["text"] = preprocess_with_progress(train_df["text"], preprocess_tweet)
test_df["text"] = preprocess_with_progress(test_df["text"], preprocess_tweet)

100%|██████████| 1375950/1375950 [00:23<00:00, 59486.74it/s]
100%|██████████| 411972/411972 [00:07<00:00, 57964.82it/s]


# Data in this training

In [21]:
train_df.head(10)

Unnamed: 0,hashtags,tweet_id,text,emotion,label,length,<LH>,@,#,trash rate
0,[Snapchat],0x376b20,"People who post ""add me on #Snapchat"" must be ...",anticipation,1,14,1,0,1,0.142857
1,"[freepress, TrumpLegacy, CNN]",0x2d5350,"@brianklaas As we see, Trump is dangerous to #...",sadness,5,18,2,1,3,0.333333
2,[],0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <mask>,fear,3,7,1,0,0,0.142857
3,"[authentic, LaughOutLoud]",0x1d755c,@RISKshow @TheKevinAllison Thx for the BEST TI...,joy,4,15,1,2,2,0.333333
4,[],0x2c91a8,Still waiting on those supplies Liscus. <mask>,anticipation,1,7,1,0,0,0.142857
5,[],0x368e95,Love knows no gender. 😢😭 <mask>,joy,4,6,1,0,0,0.166667
6,[LeagueCup],0x249c0c,@DStvNgCare @DStvNg More highlights are being ...,sadness,5,17,1,2,1,0.235294
7,"[SSM, gender, diversity]",0x359db9,The #SSM debate; <mask> (a manufactured fantas...,anticipation,1,22,1,0,3,0.181818
8,[],0x23b037,I love suffering 🙃🙃 I love when valium does no...,joy,4,27,1,0,0,0.037037
9,[Pissed],0x1fde89,Can someone tell my why my feeds scroll back t...,anger,0,21,0,0,1,0.047619


In [22]:
test_df.head(10)

Unnamed: 0,hashtags,tweet_id,text,length,<LH>,@,#,trash rate
0,[bibleverse],0x28b412,"Confident of your obedience, I write to you, k...",24,2,0,1,0.125
1,[],0x2de201,"""Trust is not the same as faith. A friend is s...",25,2,0,0,0.08
2,"[materialism, money, possessions]",0x218443,When do you have enough ? When are you satisfi...,23,1,0,3,0.173913
3,"[GodsPlan, GodsWork]",0x2939d5,"God woke you up, now chase the day #GodsPlan #...",11,1,0,2,0.272727
4,[],0x26289a,"In these tough times, who do YOU turn to as yo...",15,1,0,0,0.066667
5,[],0x31c6e0,Turns out you can recognise people by their un...,10,1,0,0,0.1
6,[sheltered],0x32edee,"I like how Hayvens mommy, daddy, and the keybo...",22,1,0,1,0.090909
7,[notamused],0x3714ee,I just love it when every single one of my son...,24,1,0,1,0.083333
8,[CelebrityBigBrother],0x235628,@JulieChen when can we expect a season of #Cel...,15,1,1,1,0.2
9,[],0x283024,Tbh. Regret hurts more than stepping on a LEGO...,10,1,0,0,0.1


# Model Training

## Hyperparameter

In [32]:
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_name = "Twitter/twhin-bert-base"

# Hyperparameters
# gamma = 0.95
train_batch_size = 256
val_batch_size = 256
dropout_rate = 0.1
lr = 2e-5
epochs = 8
val_split = 0.1

## Dataset

In [33]:
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split


class TweetDataset(Dataset):
    def __init__(self, df):
        super().__init__()
        self.id = df["tweet_id"].tolist()
        self.text = df["text"].tolist()
        if "label" in df.columns:
            self.label = df["label"].tolist()
        else:
            self.label = None

    def __len__(self):
        return len(self.text)

    def __getitem__(self, idx):
        item = {"id": self.id[idx], "text": self.text[idx]}
        if self.label is not None:
            item["label"] = self.label[idx]
        return item


# Split the train data into training and validation sets
train_data, val_data = train_test_split(train_df, test_size=val_split, random_state=42)

# Create datasets
ds_train = TweetDataset(train_data)
ds_val = TweetDataset(val_data)

## Dataloader

In [34]:
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained(model_name)


# Load the dataset and finish the preprocessing
def collate_fn(batch):
    inputs = tokenizer(
        [data["text"] for data in batch],
        padding=True,
        max_length=128,
        truncation=True,
        return_tensors="pt",
    )
    inputs = inputs.to(device)

    labels = torch.tensor([data["label"] for data in batch], dtype=torch.long)
    labels = labels.to(device)
    return inputs, labels


dl_train = DataLoader(
    ds_train,
    batch_size=train_batch_size,
    shuffle=True,
    collate_fn=collate_fn,
)
dl_val = DataLoader(
    ds_val,
    batch_size=val_batch_size,
    shuffle=True,
    collate_fn=collate_fn,
)

## model definition

In [35]:
class TweetEmotionClassifier(torch.nn.Module):
    def __init__(self, model_name, dropout=0.1):
        super().__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.dropout = torch.nn.Dropout(p=dropout)
        self.linear = torch.nn.Linear(self.bert.config.hidden_size, 8)

    def forward(self, **kwargs):
        output = self.bert(**kwargs)
        cls_output = output.last_hidden_state[:, 0, :]
        cls_output = self.dropout(cls_output)
        logits = self.linear(cls_output)

        return logits

    def extract_features(self, **kwargs):
        output = self.bert(**kwargs)
        cls_output = output.last_hidden_state[:, 0, :]
        return cls_output


model = TweetEmotionClassifier(model_name, dropout=dropout_rate)

Some weights of BertModel were not initialized from the model checkpoint at Twitter/twhin-bert-base and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## optimizer and loss function

In [None]:
from torch.optim import AdamW
from torchmetrics.classification import MulticlassAccuracy, MulticlassF1Score
from torch.optim.lr_scheduler import ExponentialLR

class_counts = train_df["label"].value_counts()
class_weights = 1.0 / class_counts
class_weights /= class_weights.sum()
class_weights = torch.tensor(class_weights.sort_index().values, dtype=torch.float32)
print(class_weights)

optimizer = AdamW(model.parameters(), lr=lr)
# scheduler = ExponentialLR(
#     optimizer,
#     # gamma=gamma,
# )

criteria = torch.nn.CrossEntropyLoss()  # weight=class_weights).to(device)

acc = MulticlassAccuracy(num_classes=class_counts).to(device)
f1 = MulticlassF1Score(num_classes=class_counts).to(device)

tensor([0.2930, 0.0497, 0.0844, 0.1835, 0.0242, 0.0609, 0.2464, 0.0580])


In [37]:
# from transformers import get_linear_schedule_with_warmup

# # Warmup
# total_steps = len(dl_train) * epochs

# warmup_ratio = 0.1
# warmup_steps = int(total_steps * warmup_ratio)

# scheduler = get_linear_schedule_with_warmup(
#     optimizer,
#     num_warmup_steps=warmup_steps,
#     num_training_steps=total_steps,
# )

## training and evaluation

In [38]:
from tqdm import tqdm
import os

saved_dic = f"./twhin-bert_b{train_batch_size}_Wpre_v1_&_rm_LH"
if not os.path.exists(saved_dic):
    os.makedirs(saved_dic)

best_model = {"ep": -1, "loss": float("inf")}
model.to(device)
for ep in range(epochs):
    model.train()
    bar = tqdm(dl_train, desc=f"Training Epoch [{ep + 1}/{epochs}]")
    train_loss = 0
    for inputs, labels in bar:
        # inputs = inputs.to(device)
        # labels = labels.to(device)

        optimizer.zero_grad()
        logits = model(**inputs)
        loss = criteria(logits, labels)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        bar.set_postfix(
            loss=train_loss / (bar.n + 1), lr=optimizer.param_groups[0]["lr"]
        )

    # scheduler.step()

    model.eval()
    bar = tqdm(dl_val, desc=f"Validation Epoch [{ep + 1}/{epochs}]")
    val_loss = 0
    acc.reset()
    f1.reset()
    with torch.no_grad():
        for inputs, labels in bar:
            # inputs = inputs.to(device)
            # labels = labels.to(device)

            logits = model(**inputs)
            loss = criteria(logits, labels)

            val_loss += loss.item()
            bar.set_postfix(loss=val_loss / (bar.n + 1))

            acc.update(logits, labels)
            f1.update(logits, labels)

    if val_loss < best_model["loss"]:
        best_model["ep"] = ep
        best_model["loss"] = val_loss

    print(f"Accuracy: {acc.compute():.4f}")
    print(f"F1 Score: {f1.compute():.4f}")
    torch.save(model, f"{saved_dic}/ep{ep}.ckpt")

print(f"Best model is at epoch {best_model['ep']} with loss {best_model['loss']}")

Training Epoch [1/8]: 100%|██████████| 4838/4838 [45:02<00:00,  1.79it/s, loss=1.04, lr=2e-5]
Validation Epoch [1/8]: 100%|██████████| 538/538 [01:48<00:00,  4.95it/s, loss=0.936]


Accuracy: 0.6613
F1 Score: 0.6613


Training Epoch [2/8]: 100%|██████████| 4838/4838 [44:59<00:00,  1.79it/s, loss=0.896, lr=2e-5]
Validation Epoch [2/8]: 100%|██████████| 538/538 [01:47<00:00,  4.99it/s, loss=0.914]


Accuracy: 0.6690
F1 Score: 0.6690


Training Epoch [3/8]: 100%|██████████| 4838/4838 [44:58<00:00,  1.79it/s, loss=0.824, lr=2e-5]
Validation Epoch [3/8]: 100%|██████████| 538/538 [01:47<00:00,  5.02it/s, loss=0.906]


Accuracy: 0.6771
F1 Score: 0.6771


Training Epoch [4/8]: 100%|██████████| 4838/4838 [45:03<00:00,  1.79it/s, loss=0.761, lr=2e-5] 
Validation Epoch [4/8]: 100%|██████████| 538/538 [01:47<00:00,  5.00it/s, loss=0.914]


Accuracy: 0.6784
F1 Score: 0.6784


Training Epoch [5/8]: 100%|██████████| 4838/4838 [45:02<00:00,  1.79it/s, loss=0.7, lr=2e-5]  
Validation Epoch [5/8]: 100%|██████████| 538/538 [01:47<00:00,  5.01it/s, loss=0.938]


Accuracy: 0.6768
F1 Score: 0.6768


Training Epoch [6/8]: 100%|██████████| 4838/4838 [45:00<00:00,  1.79it/s, loss=0.644, lr=2e-5]
Validation Epoch [6/8]: 100%|██████████| 538/538 [01:47<00:00,  5.00it/s, loss=0.978]


Accuracy: 0.6746
F1 Score: 0.6746


Training Epoch [7/8]: 100%|██████████| 4838/4838 [45:01<00:00,  1.79it/s, loss=0.588, lr=2e-5]
Validation Epoch [7/8]: 100%|██████████| 538/538 [01:47<00:00,  5.00it/s, loss=1.01]


Accuracy: 0.6707
F1 Score: 0.6707


Training Epoch [8/8]: 100%|██████████| 4838/4838 [44:59<00:00,  1.79it/s, loss=0.536, lr=2e-5]
Validation Epoch [8/8]: 100%|██████████| 538/538 [01:48<00:00,  4.96it/s, loss=1.06]


Accuracy: 0.6663
F1 Score: 0.6663
Best model is at epoch 2 with loss 487.21711856126785
