# Zero Shot Classification for Detecting Book Titles

This notebook uses NLI as a Zero Shot Classifier for finding comments with book titles.
It doesn't work perfectly, but it makes some useful predictions as a starting point for active learning.

It requires around 2 hours to classify 2 million comments on an RTX 5000 using a MiniLM based NLI model.

# Load the Data

In [1]:
from pathlib import Path
import pandas as pd
import xxhash

Read in all Hacker News Stories from 2021, which [can be downloaded from Kaggle](https://www.kaggle.com/datasets/edwardjross/hackernews-2021-comments-and-stories) (extracted from the BigQuery dataset).

In [2]:
df = pd.read_parquet('../data/01_raw/hackernews2021.parquet').set_index('id')

In [3]:
df

Unnamed: 0_level_0,title,url,text,dead,by,score,time,timestamp,type,parent,descendants,ranking,deleted
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
27405131,,,They didn&#x27;t say they <i>weren&#x27;t</i> ...,,chrisseaton,,1622901869,2021-06-05 14:04:29+00:00,comment,27405089.0,,,
27814313,,,"Check out <a href=""https:&#x2F;&#x2F;www.remno...",,noyesno,,1626119705,2021-07-12 19:55:05+00:00,comment,27812726.0,,,
28626089,,,Like a million-dollars pixel but with letters....,,alainchabat,,1632381114,2021-09-23 07:11:54+00:00,comment,28626017.0,,,
27143346,,,Not the question...,,SigmundA,,1620920426,2021-05-13 15:40:26+00:00,comment,27143231.0,,,
29053108,,,There’s the Unorganized Militia of the United ...,,User23,,1635636573,2021-10-30 23:29:33+00:00,comment,29052087.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
27367848,,,Housing supply isn’t something that can’t chan...,,JCM9,,1622636746,2021-06-02 12:25:46+00:00,comment,27367172.0,,,
28052800,,,Final Fantasy XIV has been experiencing consta...,,amyjess,,1628017217,2021-08-03 19:00:17+00:00,comment,28050798.0,,,
28052805,,,How did you resolve it?,,8ytecoder,,1628017238,2021-08-03 19:00:38+00:00,comment,28049375.0,,,
26704924,,,This hasn&#x27;t been my experience being vega...,,pacomerh,,1617657938,2021-04-05 21:25:38+00:00,comment,26704794.0,,,


# Split the Data

The data will be split deterministically by the by the root story.
This allows using features about the comment thread.

## Finding the root

For each comment the root can be found by walking up the parents recursively.

In [4]:
parent_dict = df['parent'].fillna(df.index.to_series()).to_dict()

root_dict = {}

for item, parent in parent_dict.items():
    while parent in parent_dict:
        grandparent = parent_dict[parent]
        if parent == grandparent:
            break
        parent = grandparent
    root_dict[item] = parent
    
df['root'] = df.index.map(root_dict)

## Deterministic Splitting

The hash of the root id with a fixed salt gives a deterministic random split.
Choose a 50% training set.

In [5]:
def bucket(s, salt='hnbooks'):
    return xxhash.xxh32_intdigest(str(s)+salt) % 100

bucket = df['root'].apply(bucket)

df['bucket'] = bucket

df['train'] = bucket < 50

# Clean the text

Hacker News comments have a subset of HTML, remove some of the markup.

In [6]:
import re
import html

def clean(text):
    text = html.unescape(text)
    text = text.replace('<i>', '')
    text = text.replace('</i>', '')
    text = text.replace('<p>', '\n\n')
    text = re.sub('<a href="(.*?)".*?>.*?</a>', r'\1', text)
    return text.strip()

Create a sample of training data with the clean text.

In [7]:
sample = (
    df
    .query('train & deleted.isna() & dead.isna()')
    .rename(columns={'text': 'comment_text'})
    .assign(text=lambda _: (
        _['title'].fillna('') + '\n' + _['comment_text'].fillna('')
    ).map(clean)
           )
).copy()

# Zero-Shot Classification

Following [Joe Davison's article](https://joeddav.github.io/blog/2020/05/29/ZSL.html) we can use NLI to perform zero-shot inference.

Instead of using BART we'll use [nli-MiniLM2-L6-H768](https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768) from [SentenceTransformers pre-trained cross-encoders](https://www.sbert.net/docs/pretrained_cross-encoders.html#nli) which is built on [MiniLM](https://github.com/microsoft/unilm/tree/master/minilm) (I'm not sure which model, but likely one of [nreimers](https://huggingface.co/nreimers)) and trained on SNLI and MultiNLI datasets.

I don't use the [ZeroShotClassificationPipeline](https://huggingface.co/docs/transformers/v4.21.2/en/main_classes/pipelines#transformers.ZeroShotClassificationPipeline) because it was raising errors on long texts.

In [8]:
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

'cuda:0'

In [9]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = 'cross-encoder/nli-MiniLM2-L6-H768'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)

I tried a few different `hypothesis` (such as "This contains the title of a book" or "The book is referenced by title") but after quickly checking a few examples this seemed to work relatively well.

In [10]:
def zero_shot_predict_book(examples, device=device):
    hypothesis = "This comment mentions a book by title."

    tokens = tokenizer(text=[(ex, hypothesis) for ex in examples],
                          return_tensors='pt',
                 truncation=True, padding=True).to(device)

    with torch.no_grad():
        preds = model(**tokens)
    
    logits = preds.logits
    # Contradiction, entailment, neutral
    # Ignore neutral
    entail_contradiction_logits = logits[:,[0,1]]
    probs = entail_contradiction_logits.softmax(dim=1)
    return probs[:,1].to('cpu')

Testing it on a few examples, it over-indexes on mentions of book and misses more subtle references, but it picks something up.

In [11]:
zero_shot_predict_book(["I really liked that book", # False
                        "Dune is a great movie", # False
                        "I recommend The Structure and Interpretation of Computer Programs", # True
                       "This makes me think of 'tracer bullets' from Pragmatic Programmer" # True])

tensor([0.9915, 0.0016, 0.9544, 0.3877])

Now we can run this over the whole list

In [12]:
batch_size = 8

In [13]:
from tqdm.auto import tqdm

In [18]:
def minibatch(seq, size):
    return [seq[i:i+size] for i in range(0, len(seq), size)]

This takes about 2 hours on an RTX 6000.

In [19]:
preds = []

for batch in tqdm(minibatch(sample_indices, batch_size)):
    examples = sample.loc[batch].text.to_list()
    preds.append(zero_shot_predict_book(examples).numpy())
    
preds = np.concatenate(preds)

  0%|          | 0/243496 [00:00<?, ?it/s]

Save the output

In [20]:
(
    pd.Series(preds, index=pd.Index(sample_indices, name='id'), name='prob')
    .to_frame()
    .to_parquet('../data/02_intermediate/zero_shot_contains_book_title_predictions.parquet',
                compression='gzip', engine='pyarrow')
)