# Build a Rule-based Sentiment Classifier

<img src="/home/peterchen/Study/Advanced-NLP/01-simpleclassifier/image.png" width="50%" style="margin: 0 auto;">

In [1]:
def extract_features(x: str) -> dict[str, float]:
    features = {}
    x_split = x.split()
    
    # 计算“好词”和“坏词”的数量
    good_words = ['love', 'good', 'nice', 'great', 'enjoy', 'enjoyed']
    bad_words = ['hate', 'bad', 'terrible', 'disappointing', 'sad', 'lost', 'angry']
    for x_word in x_split:
        if x_word in good_words:
            features['good_words'] = features.get('good_word_count', 0) + 1
        if x_word in bad_words:
            features['bad_words'] = features.get('bad_word_count', 0) + 1
    # 总是将 bias（偏置）键的值设置为 1
    features['bias'] = 1
    return features

feature_weights = {"good_word_count": 1.0, "bad_word_count": -1.0, "bias": 0.5}

## Data Reading

In [2]:
def read_xy_data(filename: str) -> tuple[list[str], list[int]]:
    x_data = []
    y_data = []
    with open(filename, 'r') as f:
        for line in f:
            label, text = line.strip().split(' ||| ')
            x_data.append(text)
            y_data.append(int(label))
    return x_data, y_data

In [3]:
x_train, y_train = read_xy_data("/home/peterchen/Study/Advanced-NLP/data/sst-sentiment-text-threeclass/train.txt")
x_test, y_test = read_xy_data("/home/peterchen/Study/Advanced-NLP/data/sst-sentiment-text-threeclass/dev.txt")

In [4]:
print(x_train[0])
print(y_train[0])

The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
1


## Run the Classifier and Calculate Accuracy

In [6]:
def run_classifier(x: str) -> int:
    score = 0
    for feat_name, feat_value in extract_features(x).items():
        score = score + feat_value * feature_weights.get(feat_name, 0)
    if score > 0:
        return 1
    elif score < 0:
        return -1
    else:
        return 0

In [7]:
def calculate_accuracy(x_data: list[str], y_data: list[int]) -> float:
    total_number = 0
    correct_number = 0
    for x, y in zip(x_data, y_data):
        y_pred = run_classifier(x)
        total_number += 1
        if y_pred == y:
            correct_number += 1
    return correct_number / total_number

In [8]:
label_count = {}
for y in y_test:
    if y not in label_count:
        label_count[y] = 0
    label_count[y] += 1
print(label_count)

{1: 444, 0: 229, -1: 428}


In [9]:
train_accuracy = calculate_accuracy(x_train, y_train)
test_accuracy = calculate_accuracy(x_test, y_test)
print(f"Train accuracy: {train_accuracy}")
print(f"Dev/test accuracy: {test_accuracy}")

Train accuracy: 0.4225187265917603
Dev/test accuracy: 0.4032697547683924


## Error Analysis

In [11]:
import random

def find_errors(x_data, y_data):
    error_ids = []
    y_preds = []
    for i, (x, y) in enumerate(zip(x_data, y_data)):
        y_preds.append(run_classifier(x))
        if y != y_preds[-1]:
            error_ids.append(i)
    for _ in range(5):
        my_id = random.choice(error_ids)
        x, y, y_pred = x_data[my_id], y_data[my_id], y_preds[my_id]
        print(f"{x}\ntrue label: {y}\npredicted label: {y_pred}\n")

In [12]:
find_errors(x_train, y_train)

Snipes is both a snore and utter tripe .
true label: -1
predicted label: 1

Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one .
true label: -1
predicted label: 1

There is no psychology here , and no real narrative logic -- just a series of carefully choreographed atrocities , which become strangely impersonal and abstract .
true label: -1
predicted label: 1

If I have to choose between gorgeous animation and a lame story -LRB- like , say , Treasure Planet -RRB- or so-so animation and an exciting , clever story with a batch of appealing characters , I 'll take the latter every time .
true label: 0
predicted label: 1

Ozpetek joins the ranks of those gay filmmakers who have used the emigre experience to explore same-sex culture in ways that elude the more nationally settled .
true label: 0
predicted label: 1

