# Step 1: Build the Dataset

The first thing to do is ensure that modules are auto-reloaded at runtime to allow for development in other files.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

We then set the Snorkel database location and start and connect to it.  By default, we use a PosgreSQL database backend, which can be created using `createdb DB_NAME` once psql is installed.  Note that Snorkel does *not* currently support parallel database processing with a SQLite backend.

In [16]:
# Setting Snorkel DB location
import os
import sys

import random
import numpy as np

#For PostgreSQL
postgres_location = 'postgresql://jdunnmon:123@localhost:5432'

postgres_db_name = 'es_locs_small'
os.environ['SNORKELDB'] = os.path.join(postgres_location,postgres_db_name)

# Adding path above for utils
sys.path.append('..')

# For SQLite
#db_location = '.'
#db_name = "es_locs_small.db"
#os.environ['SNORKELDB'] = '{0}:///{1}/{2}'.format("sqlite", db_location, db_name)

# Start Snorkel session
from snorkel import SnorkelSession
session = SnorkelSession()

# Setting random seed
seed = 1701
random.seed(seed)
np.random.seed(seed)

We now set the document preprocessor to read raw data into the Snorkel database.  There exist three possible data source options: JSONL files from the MEMEX project (option: `memex_jsons`), a raw tsv file of extractions from the memex project `content.tsv` (option: `content.tsv`), and tsvs with a similar format to `content.tsv` drawn from an Elasticsearch index of the data (option: `es`).  `max_docs` controls the number of documents read by the preprocessor, and `data_source` sets the location of the data.  For MEMEX json source, this should be a directory, while in all other cases it should be a tsv file.

In [6]:
from dataset_utils import set_preprocessor

# Set data source: options are 'content.tsv', 'memex_jsons', 'es'
data_source = 'es'

# Setting max number of docs to ingest
max_docs = 1000

# Setting location of data source

# For ES:
data_loc = '/dfs/scratch1/jdunnmon/data/memex-data/es/output_location.tsv'

# Setting preprocessor
doc_preprocessor = set_preprocessor(data_source,data_loc,
                                    max_docs=max_docs,verbose=False,clean_docs=True,content_field='extracted_text')

Now, we execute the preprocessor.  Parallelism can be changed using the `parallelism` flag.  Note that we use the Spacy parser rather than CoreNLP, as this tends to give superior results.

In [17]:
from snorkel.parser import CorpusParser
from snorkel.parser.spacy_parser import Spacy

# Applying corpus parser
corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(list(doc_preprocessor), parallelism=8, verbose=False)

Clearing existing...
Running UDF...
CPU times: user 1.21 s, sys: 320 ms, total: 1.53 s
Wall time: 10.3 s


Checking the number of parsed documents and sentences in the database.

In [18]:
from snorkel.models import Document, Sentence

# Printing number of docs/sentences
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Documents: 1000
Sentences: 7337


Separating into train, dev, and test sets

In [19]:
from dataset_utils import create_test_train_splits

# Getting all documents parsed by Snorkel
docs = session.query(Document).order_by(Document.name).all()

# Creating train, test, dev splits
%time train_docs, dev_docs, test_docs, train_sents, dev_sents, test_sents = create_test_train_splits(docs, 'location', gold_dict=None, dev_frac=0.01, test_frac=0.01,)

Train: 980 Docs, 7133 Sentences
Dev: 10 Docs, 78 Sentences
Test: 10 Docs, 126 Sentences
CPU times: user 2.02 s, sys: 268 ms, total: 2.29 s
Wall time: 3.37 s


Create candidate extractor.

In [20]:
from snorkel.candidates import Ngrams
from snorkel.candidates import CandidateExtractor
from dataset_utils import create_candidate_class
from snorkel_utils import get_location_matcher, get_candidate_filter, CandidateExtractorFilter, LocationMatcher

# Setting extraction type -- should be a subfield in your data source extractions field!
extraction_type = 'location'

# Creating candidate class
candidate_class, candidate_class_name  = create_candidate_class(extraction_type)

# Defining ngrams for candidates
location_ngrams   = Ngrams(n_max=3)

# Uand matcher for candidate extractor
location_matcher = LocationMatcher(longest_match_only=True)
cand_extractor    = CandidateExtractor(candidate_class, [location_ngrams], [location_matcher])

# For more complex matching/filtering behavior:
#location_matcher  = get_location_matcher()
#candidate_filter =  get_candidate_filter()
#cand_extractor = CandidateExtractorFilter(LocationExtraction,[location_ngrams],[location_matcher],candidate_filter=candidate_filter)

Applying candidate extractor to each split (train, dev, test)

In [21]:
# Applying candidate extractor to each split
for k, sents in enumerate([train_sents, dev_sents, test_sents]):
    %time cand_extractor.apply(sents, split=k, parallelism=8)
    print("Number of candidates:", session.query(candidate_class).filter(candidate_class.split == k).count())

Clearing existing...
Running UDF...
CPU times: user 1.5 s, sys: 596 ms, total: 2.1 s
Wall time: 6.53 s
Number of candidates: 1708
Clearing existing...
Running UDF...
CPU times: user 68 ms, sys: 248 ms, total: 316 ms
Wall time: 3.35 s
Number of candidates: 22
Clearing existing...
Running UDF...
CPU times: user 80 ms, sys: 268 ms, total: 348 ms
Wall time: 3.43 s
Number of candidates: 24


Add gold labels.

In [22]:
from dataset_utils import get_gold_labels_from_meta

# Adding dev gold labels using dictionary
%time missed_dev = get_gold_labels_from_meta(session, candidate_class, extraction_type, 1, annotator='gold', gold_dict=None)

# Adding test gold labels using dictionary
%time missed_test = get_gold_labels_from_meta(session, candidate_class, extraction_type, 2, annotator='gold', gold_dict=None)

Loading 22 candidate labels

AnnotatorLabels created: 22
CPU times: user 420 ms, sys: 144 ms, total: 564 ms
Wall time: 517 ms
Loading 24 candidate labels

AnnotatorLabels created: 24
CPU times: user 284 ms, sys: 8 ms, total: 292 ms
Wall time: 319 ms


In [23]:
# Checking percent of gold labels that are positive
from dataset_utils import check_gold_perc
perc_pos = check_gold_perc(session)

Percent Positive: 0.20


In [24]:
from dataset_utils import remove_gold_labels
# Remove gold labels if you want -- uncomment!
#remove_gold_labels(session)