# Preparing data for SpaCy Textcat

We will take the Zero Shot classifications and distill this into a SpaCy textcat classifier.
To do this we first need to convert the data into `.spacy` files for training.

## Load in the data

Load in the data, along zero shot annotations.

In [1]:
import pandas as pd

In [2]:
df_zero_shot = pd.read_parquet('../data/02_intermediate/zero_shot_contains_book_title_predictions.parquet')

In [3]:
df = (
    pd
    .read_parquet('../data/01_raw/hackernews2021.parquet')
    .set_index('id')
    .merge(df_zero_shot, left_index=True, right_index=True)
)

## Get clean text

Extract text for classification from the HTML title and text

In [4]:
import re
import html

CLEAN_PATTERNS = {
    re.compile("<p>"): "\n\n",
    re.compile("<i>"): r"",
    re.compile("</i>"): r"",
    re.compile('<a href="([^"]*)"[^>]*>.*?</a>'): r"\1",
    re.compile("<pre><code>((?:.|\n)*?)</code></pre>", flags=re.MULTILINE): r"\1",
}


def clean_text(s):
    for match, sub in CLEAN_PATTERNS.items():
        s = match.sub(sub, s)
    return html.unescape(s)


In [5]:
import numpy as np

df = df.assign(clean_text = 
        lambda _: (
            np.where(_.title.isna(), "", _.title)
            + np.where(~_.title.isna() & ~_.text.isna(), "<p>", "")
            + _.text.fillna("")
        ).apply(clean_text)
)

## Convert to SpaCy

Convert the data into a SpaCy format for training.

In [6]:
import spacy
from spacy.tokens import Doc, DocBin

  from .autonotebook import tqdm as notebook_tqdm


We need to use a Language to represent the doc (it also tokenizes it).

In [7]:
nlp = spacy.blank("en")

Set an `id` extension to identify which comment it came from.

In [8]:
Doc.set_extension("id", default=None)


Convert the rows into Doc objects.
Note that using `nlp.pipe` was much faster than `make_doc`.

For the classification it expects exactly one correct class so we use `BOOK` and `NOT_BOOK` as the classes.

In [9]:

from tqdm.auto import tqdm

threshold = 0.8


docs = []

def rows_to_tuples(df):
    for id, row in tqdm(df.iterrows(), total=len(df)):
        is_book = row.prob > threshold
        yield (row.clean_text, {"id": id, "cats": {"BOOK": is_book, "NOT_BOOK": not is_book}})

def tuples_to_docs(tuples):
    for doc, context in nlp.pipe(tuples, as_tuples=True):
        doc._.id = context['id']
        doc.cats = context['cats']
        yield doc


Train/dev split

In [10]:
prob_train = 0.8

df['train'] = np.random.choice([True, False],
                                size=len(df),
                                p=[prob_train, 1-prob_train])

Just take a random 100_000 samples for training.

In [11]:

limit = 100_000
sample = df.query('train').sample(n=limit)


In [12]:
docs = tuples_to_docs(rows_to_tuples(sample))

Unfortunately the DocBin object exhausts the generator, so it can't be used to stream larger-than-memory datasets.

In [13]:

db = DocBin(docs=docs)

100%|██████████| 100000/100000 [02:20<00:00, 710.73it/s]


In [14]:
db.to_disk('../data/02_intermediate/book_zero_shot_train.spacy')

Save the development set too.

In [15]:
sample = df.query('~train').sample(n=limit)
docs = tuples_to_docs(rows_to_tuples(sample))
db = DocBin(docs=docs)
db.to_disk('../data/02_intermediate/book_zero_shot_dev.spacy')

100%|██████████| 100000/100000 [02:23<00:00, 698.33it/s]


To train execute in terminal at project root

```
spacy train config/spacy_book_classifier.cfg \
    --paths.train data/02_intermediate/book_zero_shot_train.spacy \
    --paths.dev data/02_intermediate/book_zero_shot_dev.spacy \
    --output ./data/06_models/bookcat_zero_shot
```