# Work of Art Detection

Ontonotes contains a Work of Art category, which is described as "Titles of books, songs, etc."
Finding Hacker News comments that contain a "Work of Art" is a good heuristic for finding comments that may contain the title of a book.

# Load the Data

In [2]:
from pathlib import Path
import pandas as pd
import xxhash

Read in all Hacker News Stories from 2021, which [can be downloaded from Kaggle](https://www.kaggle.com/datasets/edwardjross/hackernews-2021-comments-and-stories) (extracted from the BigQuery dataset).

In [3]:
df = pd.read_parquet('../data/01_raw/hackernews2021.parquet').set_index('id')

# Split the Data

The data will be split deterministically by the by the root story.
This allows using features about the comment thread.

## Finding the root

For each comment the root can be found by walking up the parents recursively.

In [4]:
parent_dict = df['parent'].fillna(df.index.to_series()).to_dict()

root_dict = {}

for item, parent in parent_dict.items():
    while parent in parent_dict:
        grandparent = parent_dict[parent]
        if parent == grandparent:
            break
        parent = grandparent
    root_dict[item] = parent
    
df['root'] = df.index.map(root_dict)

## Deterministic Splitting

The hash of the root id with a fixed salt gives a deterministic random split.
Choose a 50% training set.

In [5]:
def bucket(s, salt='hnbooks'):
    return xxhash.xxh32_intdigest(str(s)+salt) % 100

bucket = df['root'].apply(bucket)

df['bucket'] = bucket

df['train'] = bucket < 50

# Clean the text

Hacker News comments have a subset of HTML, remove some of the markup.

In [6]:
import re
import html

def clean(text):
    text = html.unescape(text)
    text = text.replace('<i>', '')
    text = text.replace('</i>', '')
    text = text.replace('<p>', '\n\n')
    text = re.sub('<a href="(.*?)".*?>.*?</a>', r'\1', text)
    return text.strip()

Create a sample of training data with the clean text.

In [7]:
sample = (
    df
    .query('train & deleted.isna() & dead.isna()')
    .rename(columns={'text': 'comment_text'})
    .assign(text=lambda _: (
        _['title'].fillna('') + '\n' + _['comment_text'].fillna('')
    ).map(clean)
           )
).copy()

# Work of Art Detection

We'll use SpaCy's transformer model to extract all the comments that have a `WORK_OF_ART` entity

This takes around 6 hours on an RTX 6000.

In [8]:
import spacy
spacy.require_gpu()
nlp = spacy.load('en_core_web_trf')

In [9]:
from tqdm.auto import tqdm

In [None]:
%%time

sample_items = sample

pred_true = []
for idx, doc in tqdm(zip(sample_items.index, nlp.pipe(sample_items.text)),
                     total=len(sample_items)):
    for ent in doc.ents:
        if ent.label_ == 'WORK_OF_ART':
            pred_true.append(idx)
            break

  0%|          | 0/1947961 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (586 > 512). Running this sequence through the model will result in indexing errors


In [None]:
len(pred_true) / len(sample_items), len(pred_true)

In [None]:
sample_items['work_of_art'] = False
sample_items.loc[pred_true, 'work_of_art'] = True

sample_items[['work_of_art']].to_parquet('../data/02_intermediate/work_of_art_predictions.parquet',
                                          compression='gzip', engine='pyarrow')