<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/5.classification/Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/5.classification/Classification.ipynb)

This notebook explores feature engineering for text classification, training a regularized logistic regression model for the binary classification task of predicting a movie's genre.

In [11]:
import json
import operator
from collections import Counter

import nltk
import pandas as pd
from scipy import sparse
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from tqdm import tqdm

nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [2]:
!wget https://github.com/dbamman/anlp25/raw/refs/heads/main/data/movie.metadata.tsv
!wget https://github.com/dbamman/anlp25/raw/refs/heads/main/data/plot_summaries.txt

--2025-09-23 23:44:24--  https://github.com/dbamman/anlp25/raw/refs/heads/main/data/movie.metadata.tsv
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/movie.metadata.tsv [following]
--2025-09-23 23:44:24--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/movie.metadata.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15038604 (14M) [text/plain]
Saving to: ‘movie.metadata.tsv’


2025-09-23 23:44:25 (130 MB/s) - ‘movie.metadata.tsv’ saved [15038604/15038604]

--2025-09-23 23:44:25--  https://github.com/dbamman/anlp25/raw/refs/heads/main/

Read movie metadata and return dictionary mapping genres to the set of movies that are tagged with that genre; print out the top 30 genres by frequency.

In [3]:
def read_metadata(metadata_filename):
    metadata = pd.read_csv(metadata_filename, sep="\t", names=["movie_id", "_", "title", "year", "box_office", "_1", "_2", "_3", "json_genres"])

    # JSON genre looks like: {"/m/01jfsb": "Thriller", "/m/0glj9q": "Erotic thriller", "/m/09blyk": "Psychological thriller"}
    # Convert to ["Thriller", "Erotic thriller", "Psychological thriller"]
    metadata.loc[:, "genres"] = metadata.json_genres.apply(lambda x: list(json.loads(x).values()) if x is not None else [])
    # Explode so that each row has one genre
    movie_genres_metadata = metadata[["movie_id", "genres"]].set_index("movie_id").explode("genres")

    # map from genre -> set(movie_ids)
    genre_movie_map = {genre: set(movie_ids) for genre, movie_ids in movie_genres_metadata.groupby("genres").groups.items()}

    # print out the 30 most frequent genres
    genre_counts = sorted([(genre, len(movie_ids)) for genre, movie_ids in genre_movie_map.items()], key=lambda x: x[1], reverse=True)
    for genre, count in genre_counts[:30]:
        print(count, "\t", genre)

    return genre_movie_map

In [4]:
metadata = read_metadata("movie.metadata.tsv")

31890 	 Drama
15609 	 Comedy
9747 	 Romance Film
8661 	 Black-and-white
8363 	 Thriller
8223 	 Action
7155 	 World cinema
6883 	 Short Film
6629 	 Indie
6601 	 Crime Fiction
4963 	 Horror
4935 	 Documentary
4728 	 Silent film
4692 	 Adventure
4556 	 Action/Adventure
4146 	 Family Film
3981 	 Musical
3677 	 Comedy film
3345 	 Romantic drama
3052 	 Mystery
3005 	 Animation
2855 	 Science Fiction
2630 	 Romantic comedy
2616 	 Fantasy
2531 	 War film
2156 	 Western
2108 	 Crime Thriller
2087 	 Japanese Movies
1742 	 Period piece
1678 	 Comedy-drama


Subset the metadata to just those movies that exclusively appear with one of two genres selected for binary classification.

In [5]:
def filter_for_genres(metadata, genre_a, genre_b):
    # get all movies in genre_a and not in genre_b
    # remember, our metadata is of type dict[str, set]
    # so we can use set difference
    genre_a_movies = metadata[genre_a] - metadata[genre_b]

    # get all movies in genre_b but not in genre_a
    genre_b_movies = metadata[genre_b] - metadata[genre_a]

    labels = {
        a_movie: 1 for a_movie in genre_a_movies
    }
    labels.update({
        b_movie: 0 for b_movie in genre_b_movies
    })

    return labels

In [6]:
genre1 = "Romantic comedy"
genre2 = "Science Fiction"

labels = filter_for_genres(metadata, genre1, genre2)

Read movie summary data and tokenize the descriptions of those movies that match for the two genres selected.  Return a list of tokenized summaries and a corresponding list of their binary labels.

In [7]:
def read_data(plot_filename, labels):

    summaries = pd.read_csv(plot_filename, sep="\t", names=["movie_id", "summary"])
    # insert label
    summaries.loc[:, "label"] = summaries.movie_id.apply(lambda x: labels[x] if x in labels else -1)
    summaries = summaries.set_index("movie_id")

    def tokenize_and_process(text):
        return nltk.word_tokenize(text.lower())

    X = []
    Y = []
    for summary in tqdm(summaries.itertuples(), total=len(summaries)):
        if summary.Index not in labels:
            continue
        X.append(tokenize_and_process(summary.summary))
        Y.append(labels[summary.Index])

    return X, Y

In [12]:
X, Y = read_data("plot_summaries.txt", labels)

100%|██████████| 42303/42303 [00:11<00:00, 3820.56it/s]


Split the data into training and validation sets (hold out 20% of the data for evaluation).

In [13]:
trainX, devX, trainY, devY = train_test_split(X, Y, test_size=0.2, random_state=0)

In [14]:
def build_features(dataX, feature_functions):
    """ Featurize the data according to the list of parameter feature_functions.

    Each feature_fn takes in a list of tokens and outputs a dictionary mapping strings to numbers.
    """

    data = []
    for tokens in dataX:
        feats = {}

        for feature_fn in feature_functions:
            feats.update(feature_fn(tokens))

        data.append(feats)
    return data

In [15]:
def features_to_ids(data, feature_vocab):
    """
    Convert a dictionary of feature names to a sparse representation that we can
    fit in a scikit-learn model.  This is important because almost all feature
    values will be 0 for most documents (note: why?), and we don't want to save
    them all in memory.
    """
    new_data = sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx, doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [16]:
def create_vocab(data, top_n=None):
    """
    This helper function converts a dictionary of feature names to unique numerical ids.
    top_n limits the features to only the n most frequent features observed in the training data
    (in terms of the number of documents that contains it).
    """
    counts = Counter()
    for doc in data:
        for feat in doc:
            counts[feat] += 1

    feature_vocab = {}

    for idx, (k, v) in enumerate(counts.most_common(top_n)):
        feature_vocab[k] = idx

    return feature_vocab

In [17]:
def pipeline(trainX, devX, trainY, devY, feature_functions):

    """ This function evaluates a list of feature functions on the training/dev data arguments """

    trainX_feat = build_features(trainX, feature_functions)
    devX_feat = build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data.
    feature_vocab = create_vocab(trainX_feat, top_n=100000)

    trainX_ids = features_to_ids(trainX_feat, feature_vocab)
    devX_ids = features_to_ids(devX_feat, feature_vocab)

    clf = linear_model.LogisticRegression(C=100, solver='lbfgs', penalty='l2', max_iter=10000)
    clf.fit(trainX_ids, trainY)
    print("Accuracy: %.3f" % clf.score(devX_ids, devY))

    return clf, feature_vocab

Let's create a simple dictionary-based feature: this feature value is set to 1 for a document whenever any word present that dictionary appears in that document (and 0 otherwise).

In [18]:
comedy_dictionary = set(["comedy", "love", "date"])
scifi_dictionary = set(["science", "ship", "alien"])

def dictionary_feature(tokens):
    feats = {}
    for word in tokens:
        if word in comedy_dictionary:
            feats["word_in_comedy_dictionary"]=1
        if word in scifi_dictionary:
            feats["word_in_scifi_dictionary"]=1
    return feats

In [19]:
features = [dictionary_feature]
clf, vocab = pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.735


Is this accuracy good or bad?  We need to contextualize this performance against some baseline. A simple one to use is a *majority class* classifier: for every document in the test data, let's just predict whatever class appears the most frequently in the training data.

In [20]:
def majority_class(trainY, devY):
    labelCounts = Counter()
    for label in trainY:
        labelCounts[label] += 1
    majority_class = labelCounts.most_common(1)[0][0]

    correct=0.
    for label in devY:
        if label == majority_class:
            correct += 1

    print("%s\t%.3f" % (majority_class, correct/len(devY)))

In [21]:
majority_class(trainY, devY)

0	0.523


In [22]:
def unigram_feature(tokens):
    feats = {}
    for word in tokens:
        feats["UNIGRAM_%s" % word] = 1
    return feats

In [23]:
features = [unigram_feature]
clf, vocab = pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.959


Let's print out the top 10 features with the strongest weights for each class

In [24]:
def print_weights(clf, vocab, n=10):
    weights = clf.coef_[0]
    reverse_vocab = [None] * len(weights)
    for k in vocab:
        reverse_vocab[vocab[k]] = k

    for feature, weight in sorted(zip(reverse_vocab, weights), key=lambda x: x[1])[:n]:
        print("%.3f\t%s" % (weight, feature))

    print()

    for feature, weight in list(sorted(zip(reverse_vocab, weights), key=lambda x: x[1], reverse=True))[:n]:
        print("%.3f\t%s" % (weight, feature))

In [25]:
print_weights(clf, vocab, n=10)

-3.660	UNIGRAM_earth
-3.237	UNIGRAM_scientist
-2.691	UNIGRAM_alien
-2.658	UNIGRAM_dr.
-2.444	UNIGRAM_planet
-2.419	UNIGRAM_world
-2.198	UNIGRAM_space
-1.964	UNIGRAM_mysterious
-1.867	UNIGRAM_future
-1.853	UNIGRAM_kill

3.354	UNIGRAM_love
2.576	UNIGRAM_relationship
2.322	UNIGRAM_marriage
2.058	UNIGRAM_wedding
1.921	UNIGRAM_money
1.843	UNIGRAM_marry
1.787	UNIGRAM_she
1.776	UNIGRAM_married
1.655	UNIGRAM_men
1.544	UNIGRAM_sex


In [44]:
def your_awesome_feature(tokens):
    feats = {}
    for word in tokens:
        feats["UNIGRAM_%s" % word] = 1
    return feats


In [45]:
features=[your_awesome_feature]
pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.959


(LogisticRegression(C=100, max_iter=10000),
 {'UNIGRAM_.': 0,
  'UNIGRAM_a': 1,
  'UNIGRAM_,': 2,
  'UNIGRAM_the': 3,
  'UNIGRAM_to': 4,
  'UNIGRAM_and': 5,
  'UNIGRAM_of': 6,
  'UNIGRAM_in': 7,
  'UNIGRAM_is': 8,
  'UNIGRAM_with': 9,
  'UNIGRAM_his': 10,
  "UNIGRAM_'s": 11,
  'UNIGRAM_that': 12,
  'UNIGRAM_he': 13,
  'UNIGRAM_for': 14,
  'UNIGRAM_by': 15,
  'UNIGRAM_on': 16,
  'UNIGRAM_as': 17,
  'UNIGRAM_who': 18,
  'UNIGRAM_an': 19,
  'UNIGRAM_but': 20,
  'UNIGRAM_her': 21,
  'UNIGRAM_from': 22,
  'UNIGRAM_has': 23,
  'UNIGRAM_him': 24,
  'UNIGRAM_at': 25,
  'UNIGRAM_they': 26,
  'UNIGRAM_when': 27,
  'UNIGRAM_are': 28,
  'UNIGRAM_it': 29,
  'UNIGRAM_their': 30,
  'UNIGRAM_after': 31,
  'UNIGRAM_she': 32,
  'UNIGRAM_into': 33,
  'UNIGRAM_out': 34,
  'UNIGRAM_be': 35,
  'UNIGRAM_up': 36,
  "UNIGRAM_''": 37,
  'UNIGRAM_``': 38,
  'UNIGRAM_one': 39,
  'UNIGRAM_not': 40,
  'UNIGRAM_them': 41,
  'UNIGRAM_while': 42,
  'UNIGRAM_which': 43,
  'UNIGRAM_have': 44,
  'UNIGRAM_about': 45,
  'U