# Training a Sentiment Classifier

In [1]:
import random
import tqdm

## Feature Extraction

In [2]:
def extract_features(x: str) -> dict[str, float]:
    features = {}
    x_split = x.split(' ')
    for x in x_split:
        features[x] = features.get(x,0)+1
    return features

In [3]:
feature_weights = {}

## Data Reading

In [4]:
def read_xy_data(filename: str) -> tuple[list[str], list[int]]:
    x_data = []
    y_data = []
    with open(filename, 'r') as f:
        for line in f:
            label, text = line.strip().split(" ||| ")
            x_data.append(text)
            y_data.append(int(label))
    return x_data, y_data

In [5]:
x_train, y_train = read_xy_data("/home/peterchen/Study/Advanced-NLP/data/sst-sentiment-text-threeclass/train.txt")
x_dev, y_dev = read_xy_data("/home/peterchen/Study/Advanced-NLP/data/sst-sentiment-text-threeclass/dev.txt")

In [6]:
print(x_train[0])
print(y_train[0])

The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
1


## Inference Code

In [8]:
def run_classifier(features: dict[str, float]) -> int:
    score = 0
    for feat_name, feat_value in features.items():
        score = score + feat_value * feature_weights.get(feat_name, 0)
    if score > 0:
        return 1
    elif score < 0:
        return -1
    else:
        return 0

## Training Code

In [10]:
NUM_EPOCHS = 5
for epoch in range(1, NUM_EPOCHS+1):
    data_ids = list(range(len(x_train)))
    random.shuffle(data_ids) # 将索引列表随机打乱
    for data_id in tqdm.tqdm(data_ids, desc=f"Epoch {epoch}"):
        x = x_train[data_id]
        y = y_train[data_id]
        if y == 0:
            # 如果标签 y 为0（即中性样本），则跳过该数据点
            continue
        features = extract_features(x)
        predicted_y = run_classifier(features)
        if predicted_y != y: # 如果预测结果与真实标签不符，则需要更新权重
            for feature in features:
                # 权重的更新规则是 权重 = 原权重 + 真实标签 * 特征值
                feature_weights[feature] = feature_weights.get(feature, 0) + y*features[feature]

Epoch 1: 100%|██████████| 8544/8544 [00:00<00:00, 171142.12it/s]
Epoch 2: 100%|██████████| 8544/8544 [00:00<00:00, 175893.22it/s]
Epoch 3: 100%|██████████| 8544/8544 [00:00<00:00, 176886.45it/s]
Epoch 4: 100%|██████████| 8544/8544 [00:00<00:00, 180693.07it/s]
Epoch 5: 100%|██████████| 8544/8544 [00:00<00:00, 182026.65it/s]


## Evaluation Code

In [12]:
def calculate_accuracy(x_data: list[str], y_data: list[int]) -> float:
    total_number = 0
    correct_number = 0 
    for x, y in zip(x_data, y_data):
        y_pred = run_classifier(extract_features(x))
        total_number += 1
        if y == y_pred:
            correct_number += 1
    return correct_number / float(total_number)

In [13]:
label_count = {}
for y in y_dev:
    if y not in label_count:
        label_count[y] = 0
    label_count[y] += 1
print(label_count)

{1: 444, 0: 229, -1: 428}


In [15]:
train_accuracy = calculate_accuracy(x_train, y_train)
test_accuracy = calculate_accuracy(x_dev, y_dev)
print(f"Train accuracy: {train_accuracy}")
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.7927200374531835
Test accuracy: 0.5803814713896458


## Error Analysis

In [16]:
def find_errors(x_data, y_data):
    error_ids = []
    y_preds = []
    for i, (x,y) in enumerate(zip(x_data, y_data)):
        y_preds.append(run_classifier(extract_features(x)))
        if y != y_preds[-1]:
            error_ids.append(i)
    for _ in range(5):
        my_id = random.choice(error_ids)
        x, y, y_pred = x_data[my_id], y_data[my_id], y_preds[my_id]
        print(f"{x}\ntrue label: {y}\npredicted label: {y_pred}\n")

In [17]:
find_errors(x_dev, y_dev)

The format gets used best ... to capture the dizzying heights achieved by motocross and BMX riders , whose balletic hotdogging occasionally ends in bone-crushing screwups .
true label: 1
predicted label: 0

On the heels of The Ring comes a similarly morose and humorless horror movie that , although flawed , is to be commended for its straight-ahead approach to creepiness .
true label: 1
predicted label: -1

It deserves to be seen by anyone with even a passing interest in the events shaping the world beyond their own horizons .
true label: 1
predicted label: -1

It 's hard to imagine Alan Arkin being better than he is in this performance .
true label: 1
predicted label: -1

The film 's essentially over by the meet-cute .
true label: 0
predicted label: -1

