### Data Prep

Split the dataset into train, test, and dev and convert the raw json dataset into the Spacy DocBin format

Adapted from 
https://catherinebreslin.medium.com/text-classification-with-spacy-3-0-d945e2e8fc44


In [1]:
from spacy.tokens import DocBin
import spacy

import json
import random

In [17]:
def extract_lines(infile, outfile, row_indices):
    """Extract the lines from infile that are specified in row_indices

    save the results to outfile
    """

    with open(infile) as f:
        lines = f.readlines()

    outlines = []
    for i in row_indices:
        outlines.append(lines[i])

    with open(outfile, "w") as f:
        f.writelines(outlines)


def convert(
    infile,
    outfile,
    categories=['POLITICS', 'WELLNESS', 'ENTERTAINMENT', 'TRAVEL'],
):
    """ Convert the json array of documents from infile to the DocBin
    
    format. Also set the category flag.
    """
    nlp = spacy.blank("en")
    db = DocBin()
    
    with open(infile) as f:
        lines = f.readlines()
    
    for i, line in enumerate(lines):
        l = json.loads(line)
        if l["category"] in categories:
            doc = nlp.make_doc(l["headline"])

            # set default values for all category flags to 0
            doc.cats = {category: 0 for category in categories}
            doc.cats[l["category"]] = 1
            db.add(doc)
    
    print(f"found {len(db)} rows")
    db.to_disk(outfile)

#### 1. Data exploration

The dataset is an array of json blobs. Read in the json blobs into a list.

Note: the notebook expects the raw data file to be stored in folder name `data` in the same path.

In [4]:
with open("data/News_Category_Dataset_v3.json", "r") as f1:
    lines = f1.readlines()

In [6]:
# get the number of rows
rows = len(lines)
print(rows)

209527


In [8]:
# get unique categories

cat_list = []
for line in lines:
    line = json.loads(line)
    cat_list.append(line["category"])

categories = set(cat_list)

In [9]:
categories

{'ARTS',
 'ARTS & CULTURE',
 'BLACK VOICES',
 'BUSINESS',
 'COLLEGE',
 'COMEDY',
 'CRIME',
 'CULTURE & ARTS',
 'DIVORCE',
 'EDUCATION',
 'ENTERTAINMENT',
 'ENVIRONMENT',
 'FIFTY',
 'FOOD & DRINK',
 'GOOD NEWS',
 'GREEN',
 'HEALTHY LIVING',
 'HOME & LIVING',
 'IMPACT',
 'LATINO VOICES',
 'MEDIA',
 'MONEY',
 'PARENTING',
 'PARENTS',
 'POLITICS',
 'QUEER VOICES',
 'RELIGION',
 'SCIENCE',
 'SPORTS',
 'STYLE',
 'STYLE & BEAUTY',
 'TASTE',
 'TECH',
 'THE WORLDPOST',
 'TRAVEL',
 'U.S. NEWS',
 'WEDDINGS',
 'WEIRD NEWS',
 'WELLNESS',
 'WOMEN',
 'WORLD NEWS',
 'WORLDPOST'}

In [10]:
# count the number of categories
len(categories)

42

#### 2. Split into train, dev, and test datasets

Split the file into train, dev, and test datasets. The dev dataset is used as a validation set by Spacy during training.

In [12]:
# first generate a list of indices for all the rows. Shuffle the indices
# to get a random sampling

indices = [d for d in range(rows)]

# set the random seed to make ensure repeatibility
random.seed(42)
random.shuffle(indices)

In [13]:
# get the indices for the various sets
# the dataset is quite large. only train on 100k rows to reduce computational time

# train: 100k rows
# dev: 20k rows
# test: 20k

test_indices = indices[0:20000]
dev_indices = indices[20000:40000]
train_indices = indices[40000:140000]

In [14]:
# set the filenames

infile_name = "data/News_Category_Dataset_v3.json"
test_json = "data/test.json"
dev_json = "data/dev.json"
train_json = "data/train.json"

test_filename = "data/test.spacy"
dev_filename = "data/dev.spacy"
train_filename = "data/train.spacy"

In [15]:
# call extract_lines to extract the lines from the main dataset

extract_lines(infile_name, test_json, test_indices)
extract_lines(infile_name, dev_json, dev_indices)
extract_lines(infile_name, train_json, train_indices)

In [18]:
# convert the json format to the Spacy DocBin format
convert(test_json, test_filename, categories)
convert(dev_json, dev_filename, categories)
convert(train_json, train_filename, categories)

found 20000 rows
found 20000 rows
found 100000 rows
