<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/5.classification/HW5_FeatureExploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/5.classification/HW5_FeatureExploration.ipynb)

**N.B.** Once it's open on Colab, remember to save a copy (by e.g. clicking `Copy to Drive` above).

---

# Feature engineering for text classification

This notebook explores feature engineering for text classification.  Your task is to create two new feature functions (like `dictionary_feature` and `unigram_feature` below), and include them in the `build_features` function.  What features do you think will help for your particular problem? Your grade is *not* tied to whether accuracy goes up or down, so be creative!  You are free to read in any other external resources you like (dictionaries, document metadata, etc.)

You are free to use any of the following datasets for this exercise, or to use your own (if you have your own labeled data with at least 500 examples from at least two classes, I would encourage you to use it!).  If you use your own data, just be sure to format it like the examples below; each directory has a `train.tsv`, `dev.tsv` and `test.tsv` file, where each file is tab-separated (label in the first column and text in the second column).

* [Sentiment Analysis](https://ai.stanford.edu/~amaas/data/sentiment/) (Positive/Negative)
* [Congressional Speech](https://www.cs.cornell.edu/home/llee/data/convote.html) (Democrat/Republican)
* Library of Congress Subject Classication ([21 categories](https://en.wikipedia.org/wiki/Library_of_Congress_Classification))

For whichever dataset you pick, download the data first using the code below.


In [None]:
# get LMRD data
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/lmrd/train.tsv -O lmrd_train.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/lmrd/dev.tsv -O lmrd_dev.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/lmrd/test.tsv -O lmrd_test.tsv

In [1]:
# get Convote data
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/train.tsv -O convote_train.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/dev.tsv -O convote_dev.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/test.tsv -O convote_test.tsv

--2025-09-26 16:42:11--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4660140 (4.4M) [text/plain]
Saving to: ‘convote_train.tsv’


2025-09-26 16:42:12 (126 MB/s) - ‘convote_train.tsv’ saved [4660140/4660140]

--2025-09-26 16:42:12--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 351382 (343K) [text/plain]
Saving to: ‘convote_dev.tsv’


2025-09-26 16:42:12

In [None]:
# get LoC data
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/loc/train.tsv -O loc_train.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/loc/dev.tsv -O loc_dev.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/loc/test.tsv -O loc_test.tsv

In [2]:
import operator
import sys
from collections import Counter

import nltk
from nltk import word_tokenize
from sklearn import linear_model, preprocessing

nltk.download("punkt")
nltk.download("punkt_tab")

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import sparse

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


## Part 1: Loading data

**Q1: Briefly describe your data (including the categories you're predicting).**  If you're using your own data, tell us about it; if you're using one of the datasets above, tell us something that shows you've looked at the data. How many examples are in each category?

In [3]:
def read_data(filename):
    df = pd.read_csv(filename, names=["label", "text"], sep="\t")
    return df.text.to_list(), df.label.to_list()

In [4]:
# Change this to the directory with the data you will be using.
# The directory should contain train.tsv, dev.tsv and test.tsv
data = "convote"

x_train, y_train = read_data("%s_train.tsv" % data)
x_dev, y_dev = read_data("%s_dev.tsv" % data)

In [33]:
from collections import Counter

print(x_train[0])
print(y_train[0] + "\n")
print(x_dev[0])
print(y_dev[0] + "\n")
print("Train: ", Counter(y_train))
print("Dev: ", Counter(y_dev))

mr. speaker , i rise in opposition to the rules package that we have before us today .  it is outrageous that my republican colleagues have placed before us a rules package that at best lacks integrity , and at worst is completely unethical .  as the highest body of elected officials in our country , we should be held to the highest ethical standards .  but instead , my republican colleagues have opted to put before us a rules package that actually lowers our ethics standards , so that they may promote their own agenda , at whatever cost .  this rules package makes it far more difficult for ethics investigations to take place .  by requiring a majority of the ethics committee before an investigation can even begin , we are in great danger of diminishing the integrity of our great institution .  with this new rule , the majority party can effectively block any ethics investigation of a member of their party .  this is an abuse of power .  and it 's not just democrats who oppose this pla

These data are of congressional debate speeches, containing a label ("R" or "D") and a speech (transcribed to text), seperated by tab. There are 1373 passages labelled R and 1350 passages labelled D in the training data, and 130 passages labelled R and 127 labelled D in the dev data, as shown above.

## Part 2: Features

Here, you will hand-engineer some features for your classifier.

In [5]:
## HELPER FUNCTIONS ##

def majority_class(y_train, y_dev):
    label_counts = Counter(y_train)
    majority = label_counts.most_common(1)[0][0]

    correct = 0.
    for label in y_dev:
        if label == majority:
            correct += 1

    print("%s\t%.3f" % (majority, correct/len(y_dev)))
    return correct / len(y_dev)

def build_features(x_train, feature_functions):
    data = []
    for doc in x_train:
        feats = {}
        tokens = doc.split(" ")

        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

# This helper function converts a dictionary of feature names to unique numerical ids
def create_vocab(data):
    feature_vocab = {}
    idx = 0
    for doc in data:
        for feat in doc:
            if feat not in feature_vocab:
                feature_vocab[feat] = idx
                idx += 1

    return feature_vocab

# This helper function converts a dictionary of feature names to a sparse representation
# that we can fit in a scikit-learn model.  This is important because almost all feature
# values will be 0 for most documents (note: why?), and we don't want to save them all in
# memory.

def features_to_ids(data, feature_vocab):
    new_data = sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx, feature_vocab[f]] = doc[f]
    return new_data

We'll start with two feature classes -- one feature class noting the presence of a word in an external dictionary, and one feature class for the word identity (i.e., unigram).  We'll implement each feature class as a function that takes a single document as input (as a list of tokens) and returns a dict corresponding to the feature we're creating.

In [6]:
# Here's a sample dictionary we can create by inspecting the output of the Mann-Whitney test (in 2.compare/)

# EDIT TO FIT YOUR DATASET (this already sort of fits the convote data, so I
# just added to it)
dem_dictionary = set(["republican","cut", "opposition", "programs", "spending"])
repub_dictionary = set(["growth","economy", "budget", "business"])

def political_dictionary_feature(tokens):
    feats = {}
    for word in tokens:
        if word in dem_dictionary:
            feats["word_in_dem_dictionary"] = 1
        if word in repub_dictionary:
            feats["word_in_repub_dictionary"] = 1
    return feats

In [7]:
def unigram_feature(tokens):
    feats = {}
    for word in tokens:
        feats["UNIGRAM_%s" % word] = 1
    return feats

**Q2**: **Add first new feature function here.**  Describe your feature and why you think it will help.

In [19]:
# finds word combinations like "tax_cut" or "illegal_aliens"
def bigram_feature(tokens):
    feats={}
    for i in range(len(tokens)-1):
      feats["BIGRAM_%s" % tokens[i] + "_" + tokens[i+1]] = 1
    return feats

**Q3**: **Add second new feature function here.** Describe your feature and why you think it will help.

In [20]:
def document_length_feature(tokens):
    feats={}
    feats["document_length"] = len(tokens)
    return feats

We use the `build_features` helper function to aggregate together all of the information from different feature classes.  Each document has a feature dict (`feats`), and we'll update that dict with the new dict that each separate feature class is returning.  (Here you want to make sure that the keys each feature function is creating are unique so they don't get clobbered by other functions).

In [23]:
# This function trains a model and returns the predicted and true labels for test data
def evaluate(x_train, x_dev, y_train, y_dev, feature_functions):
    x_train_feat = build_features(x_train, feature_functions)
    x_dev_feat = build_features(x_dev, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab = create_vocab(x_train_feat)

    x_train_ids = features_to_ids(x_train_feat, feature_vocab)
    x_dev_ids = features_to_ids(x_dev_feat, feature_vocab)

    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(x_train_ids, y_train)
    predictions = logreg.predict(x_dev_ids)
    #return (predictions, y_dev)
    return logreg, feature_vocab, predictions, y_dev

In [16]:
def print_weights(clf, vocab, n=10):
    reverse_vocab = [None]*len(clf.coef_[0])
    for k in vocab:
        reverse_vocab[vocab[k]] = k

    if len(clf.classes_) == 2:

        weights=clf.coef_[0]
        for feature, weight in sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))[:n]:
            print("%.3f\t%s" % (weight, feature))

        print()

        for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
            print("%.3f\t%s" % (weight, feature))

    else:
        for i, cat in enumerate(clf.classes_):

            weights=clf.coef_[i]

            for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
                print("%s\t%.3f\t%s" % (cat, weight, feature))
            print()

In [17]:
majority_class(y_train,y_dev)

R	0.506


0.5058365758754864

Explore the impact of different feature functions by evaluating them below:

In [24]:
features = [unigram_feature]
#clf, vocab = pipeline(x_train, x_dev, y_train, y_dev, features)
clf, vocab, predictions, y_dev = evaluate(x_train, x_dev, y_train, y_dev, features)

If you want to print the coefficients for any of the models you train, you can do so like this.

In [25]:
print_weights(clf, vocab)

-1.489	UNIGRAM_leader
-1.307	UNIGRAM_objection
-1.193	UNIGRAM_cuts
-1.137	UNIGRAM_present
-1.091	UNIGRAM_republican
-0.967	UNIGRAM_quorum
-0.941	UNIGRAM_remainder
-0.936	UNIGRAM_although
-0.930	UNIGRAM_request
-0.909	UNIGRAM_democratic

1.067	UNIGRAM_respond
0.941	UNIGRAM_taxes
0.937	UNIGRAM_understanding
0.919	UNIGRAM_bringing
0.904	UNIGRAM_call
0.901	UNIGRAM_immediate
0.899	UNIGRAM_absolutely
0.873	UNIGRAM_30
0.866	UNIGRAM_detained
0.844	UNIGRAM_yielding


## Part 3: Analysis

**Q4**: Implement a function that returns the parametric confidence interval bounds for a binomial estimator of the model accuracy. It should return a tuple of floats `(lower_bound, upper_bound)`.

In [None]:
def binomial_test(predictions, targets, significance_level=0.95):
    # YOUR CODE HERE
    upper_bound = 0.0
    lower_bound = 0.0
    return (lower_bound, upper_bound)

**Q5**: Plot the performance for models trained with different combinations of features, including your two custom features. Some combinations you might try (but feel free to pick your own!):
1. Just the dictionary features
2. Just the unigram features
3. Just your custom features
4. Unigram features + custom features

Make a bar plot with confidence intervals. Does incorporating your features result in a statistically significant change in performance?

In [None]:
# example of how to train a classifier with the dictionary and unigram features
features = [political_dictionary_feature, unigram_feature]
predictions, targets = evaluate(x_train, x_dev, y_train, y_dev, features)
lower_bound, upper_bound = binomial_test(predictions, targets)

# YOUR CODE TO EVALUATE MULTIPLE MODELS HERE

# YOUR CODE TO GENERATE PLOT HERE

---

## To submit

Congratulations on finishing this homework!
Please follow the instructions below to download the notebook file (`.ipynb`) and its printed version (`.pdf`) for submission on bCourses -- remember **all cells must be executed**.

1.  Download a copy of the notebook file: `File > Download > Download .ipynb`.

2.  Print the notebook as PDF (via your browser, or tools like [nbconvert](https://nbconvert.readthedocs.io/en/latest/)).