# Step 1: Build the Dataset

The first thing to do is ensure that modules are auto-reloaded at runtime to allow for development in other files.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

We then set the Snorkel database location and start and connect to it.  By default, we use a PosgreSQL database backend, which can be created using `createdb DB_NAME` once psql is installed.  Note that Snorkel does *not* currently support parallel database processing with a SQLite backend.

In [2]:
# Setting Snorkel DB location
import os
import sys

import random
import numpy as np

#For networked PostgreSQL
postgres_location = 'postgresql://jdufault:123@localhost:5432'
postgres_db_name = 'es_locs_1M'
os.environ['SNORKELDB'] = os.path.join(postgres_location,postgres_db_name)

#For local PostgreSQL
#os.environ['SNORKELDB'] = 'postgres:///es_locs_small'

# Adding path above for utils
sys.path.append('../utils')

# For SQLite
#db_location = '.'
#db_name = "es_locs_small.db"
#os.environ['SNORKELDB'] = '{0}:///{1}/{2}'.format("sqlite", db_location, db_name)

# Start Snorkel session
from snorkel import SnorkelSession
session = SnorkelSession()

# Setting parallelism
parallelism = 32

# Setting random seed
seed = 1701
random.seed(seed)
np.random.seed(seed)


We now set the document preprocessor to read raw data into the Snorkel database.  There exist three possible data source options: JSONL files from the MEMEX project (option: `memex_jsons`), a raw tsv file of extractions from the memex project `content.tsv` (option: `content.tsv`), and tsvs with a similar format to `content.tsv` drawn from an Elasticsearch index of the data (option: `es`).  `max_docs` controls the number of documents read by the preprocessor, and `data_source` sets the location of the data.  For MEMEX json source, this should be a directory, while in all other cases it should be a tsv file.

In [17]:
from dataset_utils import set_preprocessor, combine_dedupe, parallel_parse_html

# Set data source: options are 'content.tsv', 'memex_jsons', 'es'
data_source = 'es'

# Setting max number of docs to ingest
max_docs = 100

# Setting location of data source

# For ES:
data_loc = '/dfs/scratch0/jdunnmon/data/memex-data/tsvs/price/test/shard/parsed/output_all_shard_00.tsv'

# Optional: add tsv with additional documents to create combined tsv without duplicates
#data_loc = combine_dedupe(data_loc, 'output_location.tsv', 'combined.tsv')

# If memex_raw_content is a content_field, uses term as a regex in raw data in addition to getting title and body
term = r'(\d\d\d?.*?hours?|\d\d\d?.*?half|\d\d\d?.*?minutes?)'

# Optional: parse html and save to a separate tsv
# html_col = 8
# parallel_parse_html(data_loc, term=term, threads=parallelism, col=html_col)

# Doc length in characters, remove to have no max
max_doc_length=1500

# Setting preprocessor
doc_preprocessor = set_preprocessor(data_source, data_loc, max_docs=max_docs, verbose=False, clean_docs=False,
                                    content_fields=['raw_content', 'url'], term=term, max_doc_length=max_doc_length)

Using single-threaded loader


Now, we execute the preprocessor.  Parallelism can be changed using the `parallelism` flag.  Note that we use the Spacy parser rather than CoreNLP, as this tends to give superior results.

In [18]:
from snorkel.parser import CorpusParser
from snorkel.parser.spacy_parser import Spacy

# Applying corpus parser
corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(list(doc_preprocessor), parallelism=parallelism, verbose=False)

Title Alexis - Entertainer in Bellingham, WA - NaughtyReviews <|> #mobile_menu display: none!important nav li a padding-left: 20px <|> Naughtyreviews is now called Tempted. We have been working hard to enhance the site. <|> mini-panel-elegance2_individual <|> Message me. I would like to find someone who appreciates being pampered and is fun to hang out with. Whats up? Im Alexis. I am a funny, sensual, down to earth female seeking a nonjudgmental, no drama, no baggage guy. <|> /mini-panel-elegance2_individual <|> /panel-pane organization node profile <|> Naughty Reviews - Dating with benefits <|> NaughtyReviews offers a unique dating experience that allows you to have the perfect date every time. Thanks to our revolutionary 360 degree ratings system guys and girls can rate each other to keep things safe and predictable. Start to rate and get rated today to increase your member level and gain access to the most exclusive dating opportunities. NaughtyReviews makes it easy to find the hott

Running UDF...
CPU times: user 244 ms, sys: 1.11 s, total: 1.35 s
Wall time: 11.4 s


Checking the number of parsed documents and sentences in the database.

In [19]:
from snorkel.models import Document, Sentence

# Printing number of docs/sentences
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Documents: 100
Sentences: 1450


Separating into train, dev, and test sets

In [23]:
from dataset_utils import create_test_train_splits

# Getting all documents parsed by Snorkel
docs = session.query(Document).order_by(Document.name).all()

# Creating train, test, dev splits
%time train_docs, dev_docs, test_docs, train_sents, dev_sents, test_sents = create_test_train_splits(docs, 'price', gold_dict=None, dev_frac=0.3, test_frac=0.3, hand_label=True)

Train: 40 Docs, 563 Sentences
Dev: 30 Docs, 427 Sentences
Test: 30 Docs, 460 Sentences
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 2.72 ms


Create candidate extractor.

In [24]:
from snorkel.candidates import Ngrams
from snorkel.candidates import CandidateExtractor
from dataset_utils import create_candidate_class, price_match
from snorkel.matchers import Union, LambdaFunctionMatcher

# Setting extraction type -- should be a subfield in your data source extractions field!
extraction_type = 'price'

# Creating candidate class
candidate_class, candidate_class_name = create_candidate_class(extraction_type)

# Defining ngrams for candidates
price_ngrams = Ngrams(n_max=4)

# Define matchers
price_matcher = LambdaFunctionMatcher(func=price_match)

# Union matchers and create candidate extractor
cand_extractor = CandidateExtractor(candidate_class, [price_ngrams], [price_matcher])

Applying candidate extractor to each split (train, dev, test)

In [25]:
# Applying candidate extractor to each split
for k, sents in enumerate([train_sents, dev_sents, test_sents]):
    %time cand_extractor.apply(sents, split=k, parallelism=parallelism)
    print("Number of candidates:", session.query(candidate_class).filter(candidate_class.split == k).count())

Clearing existing...
Running UDF...
CPU times: user 232 ms, sys: 1.01 s, total: 1.24 s
Wall time: 4.37 s
Number of candidates: 5
Clearing existing...
Running UDF...
CPU times: user 220 ms, sys: 948 ms, total: 1.17 s
Wall time: 4.21 s
Number of candidates: 0
Clearing existing...
Running UDF...
CPU times: user 240 ms, sys: 964 ms, total: 1.2 s
Wall time: 4.41 s
Number of candidates: 13


In [None]:
from snorkel.viewer import SentenceNgramViewer
cands_dev = session.query(candidate_class).filter(candidate_class.split == 1).order_by(candidate_class.id).all()
sv = SentenceNgramViewer(cands_dev, session, annotator_name='gold')
sv

In [30]:
from snorkel.viewer import SentenceNgramViewer
cands_dev = session.query(candidate_class).filter(candidate_class.split == 2).order_by(candidate_class.id).all()
sv = SentenceNgramViewer(cands_dev, session, annotator_name='gold')
sv

<IPython.core.display.Javascript object>

SentenceNgramViewer(cids=[[[9, 10, 11, 12], [2], [3, 4, 5, 6, 7]], [[8], [0], [1]]], html='<head>\n<style>\nsp…