### Snorkel

- Walking through how to implement Snorkel to generate training data with labels. For this example, we want to identify mentions of *spouses* in a corpus of news articles. There are a total of 50 documents in this corpus but only the first 10 will be considered due to run times. 

First,instantiate SnorkelSession, and this will manage a connection to the sqlite3 db automatically; otherwise it's mentioned to use Postgres db.  

In [373]:
from snorkel import SnorkelSession
session = SnorkelSession()

Now, load the Corpus for Pre-Processing. Note, the objective again is to identify the mention of spouses read in the documents and parse as tab separated value (tsv).

In [374]:
from snorkel.parser import TSVDocPreprocessor
doc_preprocessor = TSVDocPreprocessor('data/articles.tsv',max_docs=10)

### Document Example....
**0001_c9843f4e-9c43-4eca-9665-37aef78f5ea3**	The Duke of Cambridge has thrown his support behind an organisation's fight against bullying - and listed an enviable support network. \n \nWilliam wrote down Catherine, Harry, father, grandmother, grandfather and an extra - his dog Lupo - when he joined a Diana Fund trainee session for anti-bullying ambassadors. \n \nFifty youngsters from across the country were set the "high five" task of naming five people they would turn to for help with verbal, physical or cyber abuse. \n \nThe Duke was given a large cardboard hand to fill in and named his immediate family - better known as wife Kate, Prince Harry, the Prince of Wales, the Queen and the Duke of Edinburgh - before signing the palm with his name......"

In [375]:
#split docs into sentences and tokens 
from snorkel.parser import CorpusParser
corpus_parse = CorpusParser()
% time corpus_parse.apply(doc_preprocessor)

#you can run in parallel for faster run times

Clearing existing...
Running UDF...
CPU times: user 343 ms, sys: 27.8 ms, total: 371 ms
Wall time: 3.76 s


In [376]:
#Check the number of documents created --> this case should be 10
#Count the number of sentences in the documents
#Query the SQLITE database

from snorkel.models import Document, Sentence

print "Total Number of Documents in Corpus: {}\m".format(session.query(Document).count())
print "Total Number of Sentences:{}".format(session.query(Sentence).count())

Total Number of Documents in Corpus: 10\m
Total Number of Sentences:184


**Tables created in SQL**

**Connect to SQLITE and check if created**

In [377]:
import sqlite3
try:# try connecting to the snorkel db
    db = sqlite3.connect('snorkel.db')
    c = db.cursor()
except Exception as e:
    print(str(e))

#Run query to return the sentences
c.execute("SELECT * FROM sentence LIMIT 2")
for row in c.fetchall():
    print(row)
    print("\n")

(11, 1, 0, u"The Duke of Cambridge has thrown his support behind an organisation's fight against bullying - and listed an enviable support network.", <read-write buffer ptr 0x10f5f15d0, size 279 at 0x10f5f1590>, <read-write buffer ptr 0x10e65d250, size 54 at 0x10e65d210>, <read-write buffer ptr 0x10d33f310, size 267 at 0x10d33f2d0>, <read-write buffer ptr 0x10f5d2f10, size 218 at 0x10f5d2ed0>, <read-write buffer ptr 0x10e14fef0, size 168 at 0x10e14feb0>, <read-write buffer ptr 0x10d7f0520, size 54 at 0x10d7f04e0>, <read-write buffer ptr 0x10ce80d90, size 272 at 0x10ce80d50>, <read-write buffer ptr 0x10f937cd0, size 77 at 0x10f937c90>, <read-write buffer ptr 0x10f937bb0, size 77 at 0x10f937b70>)


(12, 1, 1, u'\\n \\nWilliam wrote down Catherine, Harry, father, grandmother, grandfather and an extra - his dog Lupo - when he joined a Diana Fund trainee session for anti-bullying ambassadors.', <read-write buffer ptr 0x10f6e1930, size 383 at 0x10f6e18f0>, <read-write buffer ptr 0x10f937d60,

**Extract spouse relations** or what snorkel calls "candidates" from the corpus (i.e. doc_preprocessor) and determine whether the candidates are pairs of people that would be classified as maried or not (yes/no). Note, the term candidate is what we want to make a prediction on.

**Defining a Schema for the candidate**
- Creating a binary spouse relation mention by connecting two span objects (i.e. the given sequences of start and end positions). 

In [378]:
from snorkel.models import candidate_subclass

try: #try creating the spouse table if not exist
    Spouse = candidate_subclass('Spouse', ['person1', 'person2'])
except: # pass if spouse table created
    pass 

**Check that the spouse table has been created**

**Table Schema**
- Columns: person1 and person2 created

**Extract candidate spouse relation mentions from the corpus**



In [379]:
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher

- *ngrams*: Models that assign probabilities to sequences of words are refered to as Language Models. One example is Ngram, which is a sequence of "N" words. 
- For this example, only three consecutive words (i.e. "The Duke of") is considered. This model estimates the probability of the last word given the previous words and will assign a probability to the seqeunce of three words in this case. 
- However, for this example, we want to extract candidate objects for "Spouse" type that are tagged as "Person" for Ngrams = 3.  

In [380]:
ngrams = Ngrams(n_max=3)

- *PersonMatcher* -- Looks for all n_grams such that matches the names of **people** identified by CoreNLP

In [381]:
person_matcher = PersonMatcher(longest_match_only=True)

*CandidateExtractor* = Look for all n_grams up to 3 words long and filters the candidates by this criteria. Candidates are then stored in the Spouse table of the db

In [382]:
cand_extractor = CandidateExtractor(Spouse, 
    [ngrams, ngrams], [person_matcher, person_matcher],
    symmetric_relations=False)

In [383]:
#Filter out the sentences based on number of people mentioned:
def num_people(sentence):
    active_sequence = False
    count = 0
    for tag in sentence.ner_tags:
        if tag == "Person" and not active_sequence:
            active_sequence = True
            count += 1
        elif tag != "Person" and active_sequence:
            active_sequence = False
    return count

**Split Data into Train/Dev/Test sets:** (i.e. 90%/5%/5% and will split in non-random ordertp preserve the splits already labled as person). Note, the train/dev/test groups will be labeled as 0/1/2.

- In this example, we will filiter out sentences that mention at least 5 people. 

In [442]:
from snorkel.models import Document

docs = session.query(Document).order_by(Document.name).all()
ld   = len(docs)

#initialize sets
train_sents = set()
dev_sents   = set()
test_sents  = set()

#specifiy splits
splits = (0.8, 0.9) 
for i,doc in enumerate(docs):
    for s in doc.sentences:
        if num_people(s) < 5:
            if i < splits[0] * ld:
                #print(i)
                train_sents.add(s)
            elif i < splits[1] * ld:
                dev_sents.add(s)
            else:
                test_sents.add(s)

**Candidate Extractor**

Applying the *cand_extractor* function to the training sentences. Again, here we are trying to extract feature names for "person" in the training set. You can executet this using a parallelism parameter if using databases other than SQLITE. 

In [443]:
%time cand_extractor.apply(train_sents, split=0)

Clearing existing...
Running UDF...

CPU times: user 568 ms, sys: 82.9 ms, total: 651 ms
Wall time: 614 ms


- Now, get the candidates that were just extracted

In [444]:
#Run query to return the sentences
train_cands = session.query(Spouse).filter(Spouse.split == 0).all()
print("Number of candidates in training set: {}".format(len(train_cands)))

Number of candidates in training set: 80


In [445]:
# try:# try connecting to the snorkel db
#     db = sqlite3.connect('snorkel.db')
#     c = db.cursor()
# except Exception as e:
#     print(str(e))
    
# c.execute("SELECT COUNT('type') FROM Spouse WHERE('type') == 0")
# for row in c.fetchone():
#      print("Number of Candidates: {}".format(row))

**Inspect the Candidates:** Note, the objective is to Maximize the Recall -> which is the ratio of the predicted true positives over the actual true positives. 

In [446]:
# from snorkel.viewer import SentenceNgramViewer

# # NOTE: This if-then statement is only to avoid opening the viewer during automated testing of this notebook
# # You should ignore this!
# import os
# if 'CI' not in os.environ:
#     sv = SentenceNgramViewer(train_cands[:300], session)
# else:
#     sv = None

- Note: Candidates are tuples of Context-type objects

In [447]:
#There should be 80 Candidates
count = 0
for c in train_cands:
    count +=1
    print(c.person1)
print(count)

Span("Andres", sentence=61, chars=[62,67], words=[14,14])
Span("Enoch", sentence=86, chars=[136,140], words=[25,25])
Span("Enoch", sentence=86, chars=[136,140], words=[25,25])
Span("Enoch", sentence=86, chars=[136,140], words=[25,25])
Span("Nelson Johnson", sentence=86, chars=[364,377], words=[70,71])
Span("Nelson Johnson", sentence=86, chars=[364,377], words=[70,71])
Span("Johnson", sentence=86, chars=[150,156], words=[29,29])
Span("Steven Zaillian", sentence=77, chars=[61,75], words=[14,15])
Span("Steven Zaillian", sentence=77, chars=[61,75], words=[14,15])
Span("Steven Zaillian", sentence=77, chars=[61,75], words=[14,15])
Span("Frank Lucas", sentence=77, chars=[225,235], words=[46,47])
Span("Frank Lucas", sentence=77, chars=[225,235], words=[46,47])
Span("Ridley Scott", sentence=77, chars=[285,296], words=[57,58])
Span("Serzh Sargsyan", sentence=112, chars=[51,64], words=[11,12])
Span("Denzel Washington", sentence=78, chars=[49,65], words=[9,10])
Span("Tony Camonte", sentence=83, ch

**The hierarchy of context objects in Snorkel is:**
    - Documents
    - Sentences
    - Spans

In [448]:
span1 = c.get_contexts()[0]
span2 = c.get_contexts()[1]

print span.get_parent().get_parent()

print("\n")

print span.get_parent()

print("\n")

print(span1)
print(span2)

Document 0004_4233a2bb-4611-4993-9fe4-da863fe62488


Sentence(Document 0004_4233a2bb-4611-4993-9fe4-da863fe62488, 4, u'"Crazy Heart" writer-director Scott Cooper helmed the drama, based on the 2000 book Black Mass: The True Story of an Unholy Alliance Between the FBI and the Irish Mob by Dick Lehr and Gerard O\'Neill, which tells the sordid tale of Whitey Bulger, a merciless South Boston mobster who collaborated with the Feds to bring down his Italian rivals.')


Span("Scott Cooper", sentence=75, chars=[30,41], words=[5,6])
Span("Whitey Bulger", sentence=75, chars=[231,243], words=[46,47])


**Example of getting Span, tokens, and tag**

In [449]:
print span.get_span()
print span.get_attrib_tokens()
print(span.get_attrib_tokens('pos_tags')) # Proper Nouns

Gerard O'Neill
[u'Gerard', u"O'Neill"]
[u'NNP', u'NNP']


**Repeat for development and testing corpora** 

- Repeat same process above. First, load in the corpus object, collect the sentence objects, and then run the CandidateExtractor on both the development and testing sets. 

In [450]:
%%time
for i, sents in enumerate([dev_sents, test_sents]):
    cand_extractor.apply(sents, split=i+1)
    print(i+1)
    print "Number of candidates:", session.query(Spouse).filter(Spouse.split == i+1).count()

Clearing existing...
Running UDF...

1
Number of candidates: 4
Clearing existing...
Running UDF...

2
Number of candidates: 11
CPU times: user 263 ms, sys: 68.1 ms, total: 331 ms
Wall time: 288 ms


### Creating or Loading Evaluation Labels

- Recall that Snorkel is utilized to enable training of machine learning models without the task of hand-labeling training data for classifcation types of problems. However, a small amount of labeled data to help develop and evaluate the application is stil required.

Required is two small labeled data sets:

- A development set, which can be a subset of the training set, which we use to help guide in this process. 


- A test set to evaluate the final performance against. **Important:** You should get someone that is not involved in development of your application to label the test set.

In [451]:
dev_cands = session.query(Spouse).filter(Spouse.split == 1).all()
print("Testing Candidates: {}".format(len(dev_cands)))

Testing Candidates: 4


In [452]:
test_cands = session.query(Spouse).filter(Spouse.split == 2).all()
print("Testing Candidates: {}".format(len(test_cands)))

Testing Candidates: 11


In [453]:
sv

**External annotations**

In [455]:
from load_external_annotations import load_external_labels
load_external_labels(session, Spouse, annotator_name='gold')

IOError: File gold_labels.tsv does not exist

In [438]:
from snorkel.viewer import SentenceNgramViewer
#if 'CI' not in os.environ:
    #sv = SentenceNgramViewer(dev_cands, session)3