<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/1.words/EvaluateTokenizationForSentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/1.words/EvaluateTokenizationForSentiment.ipynb)

# The impact of tokenization on downstream tasks

Tokenization can have a big impact on downstream model performance. Here, we look at different methods for tokenization and stemming/lemmatization and evaluate how they affect the performance on a simple binary sentiment classification task.

We use a train/dev deataset of 1000 reviews from the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

Each tokenization method is evaluated on the same learning algorithm ($l_2$-regularized logistic regression); the only difference is the tokenization process.

For more, see: http://sentiment.christopherpotts.net/tokenizing.html.

In [1]:
# download code and data
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/1.words/happyfuntokenizing.py

!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/sentiment.1000.train.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/sentiment.1000.dev.txt

--2025-09-02 23:44:14--  https://raw.githubusercontent.com/dbamman/anlp25/main/1.words/happyfuntokenizing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7510 (7.3K) [text/plain]
Saving to: ‘happyfuntokenizing.py’


2025-09-02 23:44:14 (63.1 MB/s) - ‘happyfuntokenizing.py’ saved [7510/7510]

--2025-09-02 23:44:14--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/sentiment.1000.train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1320314 (1.3M) [text/plain]
Saving to: ‘sentiment.1000.train.txt’


2025-09-02 23:4

In [2]:
# make sure dependencies are installed
!pip install nltk
!pip install spacy
!pip install scikit-learn



In [3]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

import spacy
from nltk.stem.porter import *
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn import linear_model


from happyfuntokenizing import Tokenizer as potts

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


## Setting up evaluation
We'll set up a class that we can use to test different tokenization methods.

In [4]:
class TokenizationTest():

    def __init__(self, train_file, dev_file):
        self.train_file = train_file
        self.dev_file = dev_file
        self.count_vectorizer = CountVectorizer(
            max_features=10_000,
            analyzer=lambda x: x,
            lowercase=False,
            strip_accents=None,
            binary=True
        )
        self.label_encoder = LabelEncoder()

    def read_data(self, filename, tokenizer):
        tokenized_text = []
        labels = []

        with open(filename, encoding="utf-8") as file:
            for idx, line in enumerate(file):
                cols = line.rstrip().split("\t")
                label = cols[0]
                text = cols[1]
                tokens = list(tokenizer(text))
                tokenized_text.append(tokens)
                labels.append(label)
        return tokenized_text, labels

    def evaluate(self, tokenizer):
        train_tokens, train_labels = self.read_data(self.train_file, tokenizer)
        dev_tokens, dev_labels = self.read_data(self.dev_file, tokenizer)

        X_train = self.count_vectorizer.fit_transform(train_tokens)
        X_dev = self.count_vectorizer.transform(dev_tokens)

        self.label_encoder.fit(train_labels)
        Y_train = self.label_encoder.transform(train_labels)
        Y_dev = self.label_encoder.transform(dev_labels)

        model = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2')
        model.fit(X_train, Y_train)
        print("Function '%s' Accuracy: %.3f" % (tokenizer.__name__, model.score(X_dev, Y_dev)))

## Setting up tokenizers

Now let's set up our tokenizers. Each tokenizer should take as input a string and output a list of strings. We'll try six different tokenization methods.

1. Splitting on whitespace with `str.split()`
2. Splitting on whitespace, then stemming with the [Porter stemmer](https://tartarus.org/martin/PorterStemmer/)
3. Using [`nltk.word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html)
4. Using the [`spacy` tokenizer](https://spacy.io/usage/linguistic-features#how-tokenizer-works)
5. Using the [`spacy` tokenizer](https://spacy.io/usage/linguistic-features#how-tokenizer-works) with [lemmatization](https://spacy.io/api/lemmatizer)
6. Using the [Potts tokenizer](http://sentiment.christopherpotts.net/tokenizing.html) (implemented for you in `happyfuntokenization.py`)

Note: evaluating the spacy tokenizers might take ~1 minute.

In [5]:
# load NLTK porter stemmer
stemmer = PorterStemmer()
def tokenize_with_porter(data):
    return [
        stemmer.stem(word) for word in str.split(data)
    ]

In [6]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

def tokenize_with_spacy(data):
    spacy_tokens = nlp(data)
    return [token.text for token in spacy_tokens]

def tokenize_with_spacy_lemma(data):
    spacy_tokens = nlp(data)
    return [token.lemma_ for token in spacy_tokens]

In [7]:
# load Potts sentiment tokenizer
potts_tokenizer = potts()
def tokenize_with_potts(data):
    return list(potts_tokenizer.tokenize(data))

## Testing the tokenizers

In [8]:
tester = TokenizationTest("sentiment.1000.train.txt", "sentiment.1000.dev.txt")

In [9]:
tester.evaluate(str.split)

Function 'split' Accuracy: 0.858


In [10]:
tester.evaluate(tokenize_with_porter)

Function 'tokenize_with_porter' Accuracy: 0.866


In [11]:
tester.evaluate(nltk.word_tokenize)

Function 'word_tokenize' Accuracy: 0.874


In [12]:
tester.evaluate(tokenize_with_spacy)

Function 'tokenize_with_spacy' Accuracy: 0.872


In [13]:
tester.evaluate(tokenize_with_spacy_lemma)

Function 'tokenize_with_spacy_lemma' Accuracy: 0.872


In [14]:
tester.evaluate(tokenize_with_potts)

Function 'tokenize_with_potts' Accuracy: 0.883


## Extra

Inspect the output of some of these tokenizers. How do different tokenizers handle some of the issues we talked about in lecture (e.g., punctuation, emoticons, casing)?

In [15]:
tokenize_with_potts("I love INFO 256!")  # modify this to test different tokenizers / different strings

['i', 'love', 'info', '256', '!']

The Potts tokenizer was designed with web text in mind, with special hand-crafted rules for emoticons, HTML tags, and hashtags. Can you approach the performance of the Potts tokenizer (>0.88) by combining some of the other methods we test?

In [16]:
def my_tokenizer(data: str) -> list[str]:
    """Tokenize the `data` string into a list of strings."""
    # your code here
    return ["implement", "me"]

tester.evaluate(my_tokenizer)

Function 'my_tokenizer' Accuracy: 0.500
