### **Building a Rule-based Sentiment Classifier**
This notebook is an attempt to build a rule-based sentiment classifier. It will take in a text X and return a label of "1" if the sentiment of the text is positive, "-1" if the sentiment of the text is negative, and "0" if the sentiment of the text is neutral. You can test the accuracy of your classifier on the Stanford Sentiment Treebank by running the notebook all the way to end.

The final way the classifier decides whether to assign a positive, negative, or neutral label is by calculating the dot product feature_weights * extract_features(X), and if the value is greater than zero, return 1, less than zero return -1, and if exactly zero return 0.


In [1]:
def extract_features(x: str) -> dict[str, float]:
    features = {}
    x_split = x.split(' ')
    
    good_words = ['love', 'good', 'nice', 'great', 'enjoy', 'enjoyed']
    bad_words = ['hate', 'bad', 'terrible', 'disappointing', 'sad', 'lost', 'angry']
    
    for x_words in x_split:
        if x_words in good_words:
            features['good_word_count'] = features.get('good_word_count', 0) + 1
        if x_words in bad_words:
            features['bad_word_count'] = features.get('bad_word_count', 0) + 1 
            
    features['bias'] = 1
    
    return features
    
feature_weights = {'good_word_count':1.0,'bad_word_count': -1.0, 'bias':0.5 }
            
            

In [2]:
extract_features("I love to play football because it is nice and sad")

{'good_word_count': 2, 'bad_word_count': 1, 'bias': 1}

>From the example above, we can see that the function successfully extracts the number of good and bad words in the input (x)

### **Reading the Data**

In [3]:
def read_xy_data(filename: str) -> tuple[list[str], list[int]]:
    x_data = []
    y_data = []
    with open(filename, 'r') as f:
        for line in f:
            label, text = line.strip().split(' ||| ')
            x_data.append(text)
            y_data.append(int(label))
    return x_data, y_data

x_train, y_train = read_xy_data('./data/train.txt')
x_test, y_test = read_xy_data('./data/test.txt') 

print(x_train[0])
print(y_train[0])

The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
1


### **Run the Classifier**

In [5]:
def run_classifier(x: str) -> int:
    score = 0

    for feat_name, feat_value in extract_features(x).items():
        score = score + feat_value * feature_weights.get(feat_name, 0)
        
    if score > 1:
        return 1
    elif score < -1:
        return -1
    else:
        return 0

In [6]:
run_classifier("I hate to play football because it is nice")

0

### **Calculate Accuracy**


In [7]:
def calculate_accuracy(x_data: list[str], y_data: list[int]) -> float:
    total_number = 0
    correct_number = 0
    for x, y in zip(x_data, y_data, strict=True):
        y_pred = run_classifier(x)
        total_number += 1
        if y == y_pred:
            correct_number += 1
    return correct_number / float(total_number)

In [8]:
label_count = {}
for y in y_test:
    if y not in label_count:
        label_count[y] = 0
    label_count[y] += 1
print(label_count)

{0: 389, 1: 909, -1: 912}


In [9]:
train_accuracy = calculate_accuracy(x_train, y_train)
test_accuracy = calculate_accuracy(x_test, y_test)
print(f'Train accuracy: {train_accuracy}')
print(f'Dev/test accuracy: {test_accuracy}')

Train accuracy: 0.21594101123595505
Dev/test accuracy: 0.19411764705882353


### **ERROR ANALYSIS**

In [10]:
import random
def find_errors(x_data, y_data):
    error_ids = []
    y_preds = []
    for i, (x, y) in enumerate(zip(x_data, y_data)):
        y_preds.append(run_classifier(x))
        if y != y_preds[-1]:
            error_ids.append(i)
    for _ in range(5):
        my_id = random.choice(error_ids)
        x, y, y_pred = x_data[my_id], y_data[my_id], y_preds[my_id]
        print(f'{x}\ntrue label: {y}\npredicted label: {y_pred}\n')

In [11]:
find_errors(x_train, y_train)

Hard-core slasher aficionados will find things to like ... but overall the Halloween series has lost its edge .
true label: -1
predicted label: 0

A thoroughly awful movie -- dumb , narratively chaotic , visually sloppy ... a weird amalgam of ` The Thing ' and a geriatric ` Scream . '
true label: -1
predicted label: 0

This toothless Dog , already on cable , loses all bite on the big screen .
true label: -1
predicted label: 0

The Cat 's Meow marks a return to form for director Peter Bogdanovich ...
true label: 1
predicted label: 0

Off the Hook is overlong and not well-acted , but credit writer-producer-director Adam Watstein with finishing it at all .
true label: -1
predicted label: 0



### **SIMPLE BOW EVALUATION**

In [12]:
import random
import tqdm

In [13]:
def extract_features(x: str) -> dict[str, float]:
    features = {}
    x_split = x.split(' ')
    for x in x_split:
        features[x] = features.get(x, 0) + 1.0
    return features

In [14]:
# Example
extract_features("I am a good boy")

{'I': 1.0, 'am': 1.0, 'a': 1.0, 'good': 1.0, 'boy': 1.0}

In [15]:
feature_weights = {}

### **Inference Code**
How we run the classifier.

In [16]:
def run_classifier(features: dict[str, float]) -> int:
    score = 0
    for feat_name, feat_value in features.items():
        score = score + feat_value * feature_weights.get(feat_name, 0)
    if score > 0:
        return 1
    elif score < 0:
        return -1
    else:
        return 0    

In [18]:
# Example
run_classifier({'I': 1.0, 'am': 1.0, 'a': 1.0, 'good': 1.0, 'boy': 1.0})

0

### **Training Code**
> Learn the weights of the classifier.

In [20]:
NUM_EPOCHS = 5
for epoch in range(1, NUM_EPOCHS+1):
    # Shuffle the order of the data
    data_ids = list(range(len(x_train)))
    random.shuffle(data_ids)
    # Run over all data points
    for data_id in tqdm.tqdm(data_ids, desc=f'Epoch {epoch}'):
        x = x_train[data_id]
        y = y_train[data_id]
        # We will skip neutral examples
        if y == 0:    
            continue
        # Make a prediction
        features = extract_features(x)
        predicted_y = run_classifier(features)
        # Update the weights if the prediction is wrong
        if predicted_y != y:
            for feature in features:
                feature_weights[feature] = feature_weights.get(feature, 0) + y * features[feature]

Epoch 1: 100%|██████████| 8544/8544 [00:00<00:00, 21190.18it/s]
Epoch 2: 100%|██████████| 8544/8544 [00:00<00:00, 71243.81it/s]
Epoch 3: 100%|██████████| 8544/8544 [00:00<00:00, 47761.52it/s]
Epoch 4: 100%|██████████| 8544/8544 [00:00<00:00, 52450.59it/s]
Epoch 5: 100%|██████████| 8544/8544 [00:00<00:00, 49996.70it/s]


### **Evaluation Code**
How we evaluate the classifier:

In [21]:
def calculate_accuracy(x_data: list[str], y_data: list[int]) -> float:
    total_number = 0
    correct_number = 0
    for x, y in zip(x_data, y_data):
        y_pred = run_classifier(extract_features(x))
        total_number += 1
        if y == y_pred:
            correct_number += 1
    return correct_number / float(total_number)

In [23]:
label_count = {}
for y in y_test:
    if y not in label_count:
        label_count[y] = 0
    label_count[y] += 1
print(label_count)

{0: 389, 1: 909, -1: 912}


In [25]:
train_accuracy = calculate_accuracy(x_train, y_train)
test_accuracy = calculate_accuracy(x_test, y_test)
print(f'Train accuracy: {train_accuracy}')
print(f'Dev/test accuracy: {test_accuracy}')

Train accuracy: 0.7560861423220974
Dev/test accuracy: 0.6135746606334842


### **Error Analysis**
An important part of improving any system is figuring out where it goes wrong. The following two functions allow you to randomly observe some mistaken examples, which may help you improve the classifier. Feel free to write more sophisticated methods for error analysis as well.

In [26]:
def find_errors(x_data, y_data):
    error_ids = []
    y_preds = []
    for i, (x, y) in enumerate(zip(x_data, y_data)):
        y_preds.append(run_classifier(extract_features(x)))
        if y != y_preds[-1]:
            error_ids.append(i)
    for _ in range(5):
        my_id = random.choice(error_ids)
        x, y, y_pred = x_data[my_id], y_data[my_id], y_preds[my_id]
        print(f'{x}\ntrue label: {y}\npredicted label: {y_pred}\n')
find_errors(x_test, y_test)

You wo n't believe much of it , but you will laugh at the audacity , at the who 's who casting and the sheer insanity of it all .
true label: 0
predicted label: 1

Strong setup and ambitious goals fade as the film descends into unsophisticated scare tactics and B-film thuggery .
true label: -1
predicted label: 1

What will , most likely , turn out to be the most repellent movie of 2002 .
true label: -1
predicted label: 1

For all its highfalutin title and corkscrew narrative , the movie turns out to be not much more than a shaggy human tale .
true label: -1
predicted label: 1

With a story inspired by the tumultuous surroundings of Los Angeles , where feelings of marginalization loom for every dreamer with a burst bubble , The Dogwalker has a few characters and ideas , but it never manages to put them on the same path .
true label: -1
predicted label: 1

