<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/6.tests/Bootstrap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/6.tests/Bootstrap.ipynb)

This notebook explores the use of the bootstrap to create confidence intervals for any statistic of interest that is estimated from data.  

For the classification model that you developed, use the bootstrap to put 95% confidence intervals around your measure of validity.

In [1]:
import sys
from collections import Counter
from math import sqrt
from random import choices

import nltk
import numpy as np
import pandas as pd
from nltk import word_tokenize
from scipy import sparse
from scipy.stats import norm
from sklearn import linear_model, preprocessing

nltk.download("punkt")
nltk.download("punkttab")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Error loading punkttab: Package 'punkttab' not found in
[nltk_data]     index


False

In [None]:
# get LMRD data
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/lmrd/train.tsv -O lmrd_train.tsv
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/lmrd/dev.tsv -O lmrd_dev.tsv
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/lmrd/test.tsv -O lmrd_test.tsv

In [2]:
# get Convote data
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/train.tsv -O convote_train.tsv
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/dev.tsv -O convote_dev.tsv
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/test.tsv -O convote_test.tsv

--2025-09-30 23:31:59--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4660140 (4.4M) [text/plain]
Saving to: ‘convote_train.tsv’


2025-09-30 23:31:59 (46.1 MB/s) - ‘convote_train.tsv’ saved [4660140/4660140]

--2025-09-30 23:31:59--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/convote/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 351382 (343K) [text/plain]
Saving to: ‘convote_dev.tsv’


2025-09-30 23:31:5

In [None]:
# get LoC data
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/loc/train.tsv -O loc_train.tsv
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/loc/dev.tsv -O loc_dev.tsv
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/loc/test.tsv -O loc_test.tsv

In [3]:
def read_data(filename):
    df = pd.read_csv(filename, names=["label", "text"], sep="\t")
    return df.text.to_list(), df.label.to_list()

In [4]:
# Change this to the directory with the data you will be using.
# The directory should contain train.tsv, dev.tsv and test.tsv
data = "convote"

x_train, y_train = read_data("%s_train.tsv" % data)
x_dev, y_dev = read_data("%s_dev.tsv" % data)

In [5]:
## HELPER FUNCTIONS ##

def build_features(x_train, feature_functions):
    data = []
    for doc in x_train:
        feats = {}
        tokens = doc.split(" ")

        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

# This helper function converts a dictionary of feature names to unique numerical ids
def create_vocab(data):
    feature_vocab = {}
    idx = 0
    for doc in data:
        for feat in doc:
            if feat not in feature_vocab:
                feature_vocab[feat] = idx
                idx += 1

    return feature_vocab

# This helper function converts a dictionary of feature names to a sparse representation
# that we can fit in a scikit-learn model.  This is important because almost all feature
# values will be 0 for most documents (note: why?), and we don't want to save them all in
# memory.

def features_to_ids(data, feature_vocab):
    new_data = sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx, feature_vocab[f]] = doc[f]
    return new_data

In [6]:
# This function trains a model and returns the predicted and true labels for test data
def evaluate(x_train, x_dev, y_train, y_dev, feature_functions):
    x_train_feat = build_features(x_train, feature_functions)
    x_dev_feat = build_features(x_dev, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab = create_vocab(x_train_feat)

    x_train_ids = features_to_ids(x_train_feat, feature_vocab)
    x_dev_ids = features_to_ids(x_dev_feat, feature_vocab)

    label_encoder = preprocessing.LabelEncoder()
    label_encoder.fit(y_train)

    y_train = label_encoder.transform(y_train)
    y_dev = label_encoder.transform(y_dev)

    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(x_train_ids, y_train)
    predictions = logreg.predict(x_dev_ids)
    return (predictions, y_dev)


## Metrics

In [7]:
def accuracy(targets, predictions):
    correct = [
        int(pred == target) for pred, target in zip(predictions, targets)
    ]

    return sum(correct) / len(correct)

In [8]:
def F1(targets, predictions):

    true_positives = 0
    pred_positives = 0
    relevant = 0

    for pred, target in zip(predictions, targets):
        if pred == 1 and pred == target:
            true_positives += 1
        if target == 1:
            relevant += 1
        if pred == 1:
            pred_positives += 1

    precision = true_positives / pred_positives if pred_positives > 0 else 0
    recall = true_positives / relevant if relevant > 0 else 0
    f = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    return f

## Model

Specify features for model and train logistic regression

In [23]:
# Here's a sample dictionary we can create by inspecting the output of the Mann-Whitney test (in 2.compare/)

# EDIT TO FIT YOUR DATASET
# i added back terms i used in HW5 for this feature
dem_dictionary = set(["republican","cut", "opposition", "programs", "spending"])
repub_dictionary = set(["growth","economy", "budget", "business"])

def political_dictionary_feature(tokens):
    feats = {}
    for word in tokens:
        if word in dem_dictionary:
            feats["word_in_dem_dictionary"] = 1
        if word in repub_dictionary:
            feats["word_in_repub_dictionary"] = 1
    return feats

def unigram_feature(tokens):
    feats = {}
    for word in tokens:
        feats["UNIGRAM_%s" % word] = 1
    return feats

# below are custom features from HW5, to undergo testing
def bigram_feature(tokens):
    feats={}
    for i in range(len(tokens)-1):
      feats["BIGRAM_%s" % tokens[i] + "_" + tokens[i+1]] = 1
    return feats

def document_length_feature(tokens):
    feats={}
    feats["document_length"] = len(tokens)
    return feats

In [24]:
features = [political_dictionary_feature, unigram_feature, bigram_feature, document_length_feature]
predictions, targets = evaluate(x_train, x_dev, y_train, y_dev, features)

First, let's just see what parametric confidence intervals are for accuracy (for which the underlying assumptions of normality are justified by the CLT).

In [25]:
def binomial_cis(predictions, targets, confidence_level=0.95):
    correct = [int(prediction == target) for prediction, target in zip(predictions, targets)]

    success_rate = np.mean(correct)

    # two-tailed test
    critical_value = (1 - confidence_level) / 2
    # ppf finds z such that p(X < z) = critical_value
    z_alpha = -1 * norm.ppf(critical_value)
    print("Critical value: %.3f\tz_alpha: %.3f" % (critical_value, z_alpha))

    # the standard error is the square root of (the variance/sample size)
    # the variance for a binomial test is p*(1-p)
    standard_error = sqrt((success_rate * (1-success_rate)) / len(correct))

    lower = success_rate - z_alpha * standard_error
    upper = success_rate + z_alpha * standard_error

    return lower, upper


In [26]:
binomial_cis(predictions, targets, confidence_level=0.95)

Critical value: 0.025	z_alpha: 1.960


(np.float64(0.6608194031545993), np.float64(0.7710872116313929))

Here we'll use the bootstrap to create confidence intervals at a specified confidence level for any function `metric(truth, predictions)` where *truth* is an array of true labels for a set of data points, and *predictions* is an array of predicted labels for those same points.  This `bootstrap` function returns a tuple of (lower, median, upper), where *lower* is the lower confidence bound, *upper* is the upper confidence bound, and *median* is the median value of the metric among the bootstrap resamples.

In [27]:
def bootstrap(predictions, targets, metric, B=10000, confidence_level=0.95):
    critical_value = (1 - confidence_level) / 2
    lower_sig = 100 * critical_value
    upper_sig = 100 * (1 - critical_value)
    data = []
    for g, p in zip(targets, predictions):
        data.append([g, p])

    values = []

    for b in range(B):
        choice = choices(data, k=len(data))
        choice = np.array(choice)
        value = metric(choice[:,0], choice[:,1])

        values.append(value)

    percentiles = np.percentile(values, [lower_sig, 50, upper_sig])

    lower = percentiles[0]
    median = percentiles[1]
    upper = percentiles[2]

    return lower, median, upper


We can use that bootstrap implementation to generate confidence intervals for accuracy and F1 score for the predictions made above.

In [28]:
confidence_level = 0.95
lower, median, upper = bootstrap(targets, predictions, accuracy, B=10000, confidence_level=confidence_level)
print("%.3f, %s%% Bootstrap confidence interval: [%.3f, %.3f]" % (median, confidence_level*100, lower, upper))
# for reference, the output for just the political_dictionary_feature is:
# 0.541, 95.0% Bootstrap confidence interval: [0.479, 0.603].
# the median value 0.716 below is well above the upper bound for the model
# containing only one feature, indicating a significant accuracy increase.

0.716, 95.0% Bootstrap confidence interval: [0.658, 0.770]


In [29]:
confidence_level=0.95
lower, median,upper = bootstrap(targets, predictions, F1, B=10000,confidence_level=confidence_level)
print("%.3f, %s%% Bootstrap confidence interval: [%.3f, %.3f]" % (median, confidence_level*100, lower, upper))
# for reference, output of the last model is below.
# 0.659, 95.0% Bootstrap confidence interval: [0.599, 0.715]
# the median value 0.735 is just beyond the last model upper bound
# indicating a meaningful increase.

0.735, 95.0% Bootstrap confidence interval: [0.672, 0.790]
