# DAT341 Applied Machine Learning

## Assginment 3: Stance classification

### Part 1: Solve the Basic Problem

First, let's import the datasets. There are two datasets which one of them is used for training and the other one is for testing. Meanwhile, we could print out part of the dataset to see the format of the data.

In [2]:
import numpy as np
import pandas as pd

train_file="a3_train_final.tsv"
test_file="a3_test.tsv"

df_train=pd.read_csv(train_file,sep="\t",header=None,names=["label","text"])
df_test=pd.read_csv(test_file,sep="\t",header=None,names=["label","text"] )

print("Train sample")
print(df_train.head())
print("Test sample")
print(df_test.head())

Train sample
     label                                               text
0      0/0  A woman's autonomous right to decide for her o...
1     1/-1   And yet ignorance is an enemy, even to its owner
2      0/0  Anti vaxxer??? So if I have had a tetanus shot...
3      0/0  Are you planning on getting the vaccine when i...
4  0/0/0/0  Benefits outweigh the risks. Yes, the benefits...
Test sample
   label                                               text
0      0  Extremely rare is only good if it doesn't happ...
1      1  I have two parents in their 70s. Both had the ...
2      0  Not getting vaccinated is still more dangerous...
3      1  The average life expectancy of a human is 74 y...
4      1  Trust the science is a dumb saying. Science is...


Before we preprocess the data, we first need to handle the multi-annotation problem. Different people(2 or more) could give different remark to the same comment. To simplify the situation, we could take the average of multiple annotation values and rounding to the nearest integer.

In [5]:
def resolve_labels(label_str):
    labels = list(map(int, label_str.split('/')))  
    labels = [0.5 if l == -1 else l for l in labels]  
    return round(np.mean(labels))  

df_train["label"] = df_train["label"].astype(str).apply(resolve_labels)
df_train = df_train.dropna(subset=["label"]).reset_index(drop=True)

print(df_train["label"].value_counts()) 

label
0    5889
1    5707
Name: count, dtype: int64


It seems like the nubmer of the annoation in '0' is similar to that in '1'. Thus, we could move on to the next part which we handle the text of the comments here. As is shown in the dataset and the common sense about remarks in Youtube or other platforms, there might be complex symbols and useless words in the comments. The text needs to be cleaned by removing punctuation, converting to lowercase, removing stopwords, eliminating extra spaces, and applying stemming or lemmatization.

In [8]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    text = text.lower()  
    text = re.sub(r"[^\w\s]", "", text)  
    words = text.split()
    words = [word for word in words if word not in stop_words] 
    return " ".join(words)

df_train["clean_text"] = df_train["text"].apply(preprocess_text)
df_test["clean_text"] = df_test["text"].apply(preprocess_text)

print("\nProcessed Text Sample:")
print(df_train.head())



Processed Text Sample:
   label                                               text  \
0      0  A woman's autonomous right to decide for her o...   
1      1   And yet ignorance is an enemy, even to its owner   
2      0  Anti vaxxer??? So if I have had a tetanus shot...   
3      0  Are you planning on getting the vaccine when i...   
4      0  Benefits outweigh the risks. Yes, the benefits...   

                                          clean_text  
0  womans autonomous right decide body dont auton...  
1                     yet ignorance enemy even owner  
2  anti vaxxer tetanus shot typhoid fever hepatit...  
3  planning getting vaccine available nope hopefu...  
4  benefits outweigh risks yes benefits getting s...  


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\xiach\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


Performing TF-IDF Vectorization

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)

X_train = vectorizer.fit_transform(df_train["clean_text"])
y_train = df_train["label"]

X_test = vectorizer.transform(df_test["clean_text"])
y_test = df_test["label"]

print("\nCompleted!!!!!!!")



Completed!!!!!!!


Logistic Regression and Support Vector Machine

In this task, we choose Logistic Regression and Support Vector Machine as classification models because they are well-suited for high-dimensional sparse data. Logistic Regression is an efficient linear classifier with low computational cost, making it ideal for binary classification tasks while offering good interpretability. SVM performs well in high-dimensional spaces by finding the optimal decision boundary, improving generalization, especially for small datasets.

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

svm_model = SVC(kernel="linear")
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

Print out the report of these two model.

In [16]:
print("\nLogistic Regression Performance:")
print(classification_report(y_test, y_pred_lr))

print("\nSVM Performance:")
print(classification_report(y_test, y_pred_svm))


Logistic Regression Performance:
              precision    recall  f1-score   support

           0       0.79      0.86      0.82       267
           1       0.85      0.77      0.81       267

    accuracy                           0.81       534
   macro avg       0.82      0.81      0.81       534
weighted avg       0.82      0.81      0.81       534


SVM Performance:
              precision    recall  f1-score   support

           0       0.80      0.86      0.83       267
           1       0.85      0.79      0.82       267

    accuracy                           0.83       534
   macro avg       0.83      0.83      0.83       534
weighted avg       0.83      0.83      0.83       534



According to the results, it turns out that SVM has a better performance. Thus, we could choose SVM to be our model.

### Part 2: Try out the powerful modern text representation model -- BERT.

First import libiraries.

In [20]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW
from transformers import get_scheduler
import pandas as pd
import numpy as np

Import the dataset and handle the data like we did before.

In [22]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

train_file = "a3_train_final.tsv"
test_file = "a3_test.tsv"

df_train = pd.read_csv(train_file, sep="\t", header=None, names=["label", "text"])
df_test = pd.read_csv(test_file, sep="\t", header=None, names=["label", "text"])

def resolve_labels(label_str):
    labels = list(map(int, label_str.split('/')))
    labels = [l for l in labels if l != -1]  
    return round(np.mean(labels)) if labels else None  

df_train["label"] = df_train["label"].astype(str).apply(resolve_labels)
df_train = df_train.dropna()  
df_test["label"] = df_test["label"].astype(int)

print(df_train["label"].value_counts())

Using device: cpu
label
0    5889
1    5707
Name: count, dtype: int64


Initialize the BERT classifier model.

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

class VaccineDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = int(self.labels[idx])  
        encoding = self.tokenizer(text, padding="max_length", truncation=True, max_length=self.max_len, return_tensors="pt")
        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "label": torch.tensor(label, dtype=torch.long)
        }

train_dataset = VaccineDataset(df_train["text"].tolist(), df_train["label"].tolist(), tokenizer)
test_dataset = VaccineDataset(df_test["text"].tolist(), df_test["label"].tolist(), tokenizer)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True) 
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5)

num_training_steps = len(train_loader) * 3  
lr_scheduler = get_scheduler(
    "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

loss_fn = torch.nn.CrossEntropyLoss()
epochs = 3

for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step() 
        total_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader)}")
    torch.cuda.empty_cache()  

model.save_pretrained("bert_vaccine_classifier")
tokenizer.save_pretrained("bert_vaccine_classifier")

print("Complete!!!!!!")

Next stage, let's evaluate the model by the testing dataset.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

def evaluate_model(model, data_loader):
    model.eval()
    predictions, true_labels = [], []

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()
            labels = labels.cpu().numpy()

            predictions.extend(preds)
            true_labels.extend(labels)

    print("Classification Report:")
    print(classification_report(true_labels, predictions))
    print(f"Accuracy: {accuracy_score(true_labels, predictions):.4f}")

evaluate_model(model, test_loader)