# Intro. to Snorkel: Extracting Spouse Relations from the News

## Part I: Preprocessing

In this tutorial, we will walk through the process of using `Snorkel` to identify mentions of spouses in a corpus of news articles. The tutorial is broken up into 5 notebooks, each covering a step in the pipeline:
1. Preprocessing
2. Candidate Extraction
3. Annotating Evaluation Data
4. Featurization & Training
5. Evaluation

In this notebook, we preprocess several documents using `Snorkel` utilities, parsing them into a simple hierarchy of component parts of our input data, which we refer to as _contexts_. We also extract standard linguistic features from each context which will be useful downstream using [CoreNLP](http://stanfordnlp.github.io/CoreNLP/), 

All of this preprocessed input data is saved to a database.  (Connection strings can be specified by setting the `SNORKELDB` environment variable.  In Snorkel, if no database is specified, then a SQLite database at `./snorkel.db` is created by default--so no setup is needed here!

### Initializing a `SnorkelSession`

First, we initialize a `SnorkelSession`, which will enable us to save intermediate results.

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from snorkel import SnorkelSession
session = SnorkelSession()

## Loading the `Corpus`

Next, we load and pre-process the corpus, storing it for convenience in a `Corpus` object.

### Unarchive the Data

In [None]:
import os
os.system('cd data; tar -xzvf data.tar.gz')

### Configuring a `DocParser`

We'll start by defining a `TSVDocParser` class to read in the documents, which are stored in a tab-seperated value format as pairs of document names and text.

In [None]:
from snorkel.parser import TSVDocParser
doc_parser = TSVDocParser(path='data/articles-train.tsv')

### Creating a `SentenceParser`

Next, we'll use an NLP preprocessing tool to split the `Document` objects into sentences, tokens, and provide annotations--part-of-speech tags, dependency parse structure, lemmatized word forms, etc.--for these sentences.  Here we use the default `SentenceParser` class.

In [None]:
from snorkel.parser import SentenceParser

sent_parser = SentenceParser()

### Pre-processing & loading the `Corpus`

Finally, we'll put this all together using a `CorpusParser` object, which will execute the parsers and store the results as a `Corpus`:

In [None]:
from snorkel.parser import CorpusParser

cp = CorpusParser(doc_parser, sent_parser)
%time corpus = cp.parse_corpus(session, 'News Training')

Note that the printed stats are a property of the `Corpus` object, and can be printed again via the `corpus.stats()` method!

In [None]:
doc = corpus.documents[0]
doc

In [None]:
sent = doc.sentences[0]
print unicode(sent)
print unicode(sent.words)
print sent.pos_tags

### Saving the `Corpus`
Finally, we persist the parsed corpus in Snorkel's database backend:

In [None]:
session.add(corpus)
session.commit()

### Repeating for development and test corpora
We will rerun the same operations for the other two News corpora: development and test. All we do is change the path that the `TSVDocParser` uses.

In [None]:
for name, path in [('News Development', 'data/articles-dev.tsv'),
                   ('News Test', 'data/articles-test.tsv')]:
    doc_parser.path=path
    %time corpus = cp.parse_corpus(session, name)
    session.commit()

Next, in Part 2, we will look at how to extract `Candidate` relations from our saved `Corpus`.

In [None]:
## This cell is just for speeding up automatic testing. You can safely ignore it!
import os
if 'CI' in os.environ:
    from snorkel.models import Corpus
    import random
    for corpus_name in ['News Training']:
        corpus = session.query(Corpus).filter(Corpus.name == corpus_name).one()
        docs = set([d for d in corpus.documents])
        for doc in docs:
            if random.random() > .10:
                corpus.remove(doc)
    session.commit()