In [1]:
from tcre.env import *
from tcre.nb import utils
import pathlib as pl
from IPython.display import display, HTML
utils.FORCE_NB_RELOAD = True

def get_cell(path, name):
    return utils.get_tag_html(path, name, prefix='summary')

def show_cell(path, name):
    return display(HTML(get_cell(path, name)[0]))

def plink(path, use_path_name=False):
    rel = path.relative_to(REPO_DIR)
    name = rel if use_path_name else path.name
    return '<a href="{}">{}</a>'.format(pl.Path('..') / rel, name)

META_DATA_DIR = pl.Path(REPO_DATA_DIR) / 'meta'
SRC_TCRE = pl.Path(REPO_DIR) / 'src' / 'tcre'
SRC_TRAIN = SRC_TCRE / 'exec' / 'v1' / 'train.py'
SRC_MODEL = SRC_TCRE / 'exec' / 'v1' / 'model.py'
NB_PIPELINE = pl.Path(REPO_DIR) / 'pipeline'
NB_ENTREZ_IMPORT = NB_PIPELINE / '01-entrez-import.ipynb'
NB_PMCOA_IMPORT = NB_PIPELINE / '01-pmcoa-import.ipynb'
NB_PRO_IMPORT = NB_PIPELINE / '02-meta-import-pro.ipynb'
NB_CL_IMPORT = NB_PIPELINE / '02-meta-import-cl.ipynb'
NB_META_CT = NB_PIPELINE / '03-meta-cell-types.ipynb'
NB_META_TF = NB_PIPELINE / '03-meta-transcription-factors.ipynb'
NB_META_CK = NB_PIPELINE / '03-meta-cytokines.ipynb'
NB_META_PR = NB_PIPELINE / '03-meta-proteins.ipynb'
NB_PHEN_INF = NB_PIPELINE / 'misc' / 'eda' / 'eda-phenotype.ipynb'
NB_LBLMDL_V2 = NB_PIPELINE / 'misc' / 'modeling' / 'label-model-training-v2.ipynb'
NB_DEV_GRID = NB_PIPELINE / '11-modeling-rnn-strong.ipynb'
NB_TRAIN_MARGINALS = NB_PIPELINE / '08-candidate-lfs.ipynb'
NB_TRAIN_GENMODEL = NB_PIPELINE / '10-modeling-sgm.ipynb'
NB_END_TRAIN = NB_PIPELINE / '11-modeling-rnn-weak.ipynb'
NB_ANALYSIS_SCORES = NB_PIPELINE / '12-analysis-scores.ipynb'

# **NOTE** if the markdown substitution isn't working, make sure the notebook is "Trusted" at the top

## Project Summary

Top-level stats, motivations, and results that are otherwise difficult to compile from individual project files.

Contents:

- [Corpora](#corpora)
- [Controlled Vocabularies](#controlled-vocabularies)
- [Phenotype Inference](#phenotype-inference)
- [Candidate Annotation](#candidate-annotation)
- [Labeling Functions](#labeling-functions)
- [Generative Model](#generative-model)
- [Strong Supervision](#strong-supervision)
- [Weak Supervision](#weak-supervision)
- [Relation Extraction](#relation-extraction)

<h1><a id='corpora'>Corpora</a></h1>

There are two main corpora used:
- A smaller dev corpus built from a targeted Entrez query
- A larger PMC open acess corpus filtered for key words/phrases

### Dev Corpus

Notebook: <a href="../pipeline/01-entrez-import.ipynb">01-entrez-import.ipynb</a>

This was collected using the Entrez api:

```python
handle = Entrez.esearch(db='pmc', sort='relevance', retmode='xml', term=query)
Entrez.efetch(db='pmc', rettype="full", retmode='xml', id=Entrez.read(handle))
```

Results were sorted by relevance and arbitrarily limited to the first ~20k documents (out of 124,720 as of 2019-03-11).  The query associated with these results was:

In [2]:
show_cell(NB_ENTREZ_IMPORT, 'query')

The collected documents contain a mix of full text and abstract-only articles.  Frequencies: 

In [3]:
show_cell(NB_ENTREZ_IMPORT, 'textct')

### Notes on PMC Queries

To compare results for keyword searchs vs MeSH terms, here are few key result set sizes (as of 2019-07-08):


- Target Documents: ```Humans AND T-cells AND cytokines AND differentiation AND induction```
    - With keywords:
        - PMC Search: *(human) AND ( (t cell) OR (t lymphocyte) ) AND (cytokine) AND ((differentiate) OR (differentiation) OR (differentiated)) AND ((polarization) OR (polarize) OR (induce) OR (induction))*
        - Results: **129,982**
    - With MeSH terms:
        - PMC Search: ```"humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms]) AND "cytokines"[MeSH Terms] AND "cell differentiation"[MeSH Terms] AND "transcriptional activation"[MeSH Terms]```
        - Results: **16**
- Target Documents: Humans AND T-cells AND cytokines AND differentiation 
    - PMC Search: ```"humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms]) AND "cytokines"[MeSH Terms] AND "cell differentiation"[MeSH Terms]```
    - Results: **1,544**
- Target Documents: Humans AND T-cells AND differentiation 
    - PMC Search: ```"humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms]) AND "cell differentiation"[MeSH Terms]```
    - Results: **3,222**
- Target Documents: Humans AND T-cells AND cytokines
    - PMC Search: ```"humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms]) AND "cytokines"[MeSH Terms]```
    - Results: **16,874**
- Target Documents: Humans AND T-cells AND induction
    - PMC Search: ```"humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms]) AND "transcriptional activation"[MeSH Terms]```
    - Results: **378**
- Target Documents: Humans AND T-cells
    - PMC Search: ```"humans"[MeSH Terms] AND ("t lymphocytes"[MeSH Terms] OR "t lymphocyte subsets"[MeSH Terms])```
    - Results: **48,059**
    
MeSH Terms Above:

- [Cell Differentiation](https://meshb.nlm.nih.gov/record/ui?ui=D002454)
- [Humans](https://meshb.nlm.nih.gov/record/ui?ui=D006801)
- [T-Lymphocytes](https://meshb.nlm.nih.gov/record/ui?ui=D013601)
- [Transcriptional Activation](https://meshb.nlm.nih.gov/record/ui?ui=D015533)
- [Cytokines](https://meshb.nlm.nih.gov/record/ui?ui=D016207)

### Primary Corpus

Notebook: <a href="../pipeline/01-pmcoa-import.ipynb">01-pmcoa-import.ipynb</a>

This was extracted from the PMC OA subset.  Documents are filtered based on appearance of the following key words/phrases:

In [4]:
show_cell(NB_PMCOA_IMPORT, 'searchterms')

A first pass over all documents aggregates indicators for the above string presence and other meta data for each document.  Results shown below:

In [5]:
show_cell(NB_PMCOA_IMPORT, 'metainfo')

The search terms a grouped into coarser categorizations (e.g. "has common CD markers" = "cd3" or "cd4" or "cd8")
and frequencies of some relevant ones (out of ~2.5 docs) can be seen here:

In [6]:
show_cell(NB_PMCOA_IMPORT, 'termfreq')

Target documents are then extract for some combination of matches above to give resulting datasets like this (first 1k rows):

In [7]:
show_cell(NB_PMCOA_IMPORT, 'extractinfo')

NOTE: As of yet, I haven't done anything more with this.  I was planning though to move forward with a corpus filtered by the criteria:
- Must contain the word "Human"
- Must contain the word "T cell" (or nearby variants)
- Must contain one of CD3, CD4 or CD8 

This restricts the 2.5M documents to just ~48k.

<h1><a id='controlled-vocabularies'>Controlled Vocabularies</a></h1>

Ontologies or other sources are used separately to build vocabulary tables for the following entity types:

- Surface Proteins ([Protein Ontology (PRO)](https://paperpile.com/app/p/3f1c6c43-0bca-0760-a9e0-1ff7016060b0))
    - PRO import notebook: <a href="../pipeline/02-meta-import-pro.ipynb">02-meta-import-pro.ipynb</a>
    - Integration notebook: <a href="../pipeline/03-meta-proteins.ipynb">03-meta-proteins.ipynb</a> (filters to immunology marker proteins e.g. CD markers)
    - All nodes imported descend from [PR_000000001](http://purl.obolibrary.org/obo/PR_000000001)
    - When using it for surface marker token matching, any subtree with a node including a label or synonym starting with "CD" is extracted 
    - The result from the importation enforces that synonyms are unique by prioritizing based on synonym type (e.g. BROAD = lowest priority, EXACT = high priority)
- Cell Types ([Cell Ontology (CL)](https://paperpile.com/app/p/7706c6ce-ba79-067b-86b9-deb6ce967dd6))
    - CL import notebook: <a href="../pipeline/02-meta-import-cl.ipynb">02-meta-import-cl.ipynb</a> 
    - Integration notebook: <a href="../pipeline/03-meta-cell-types.ipynb">03-meta-cell-types.ipynb</a> (incorporates sources beyond CL)
    - All imported terms descend from [CL_0000084](http://purl.obolibrary.org/obo/CL_0000084)
    - Again, synonym uniqueness is enforced by resolving conflicts based on synonym type (and arbitrarily for ties)
- Cytokines ([Cytokine Registry (CKR)](https://www.immport.org/resources/cytokineRegistry))
    - CKR import notebook <a href="../pipeline/03-meta-cytokines.ipynb">03-meta-cytokines.ipynb</a>
    - In this case, as CKR isn't a very structured ontology, terms and synonyms are imported much like other sources of cytokine synonyms 
- Transcription Factors ([The Human Transcription Factors](https://paperpile.com/app/p/962005d3-58ab-063c-ac54-abebd479a0d9))
    - TF integration notebook: <a href="../pipeline/03-meta-transcription-factors.ipynb">03-meta-transcription-factors.ipynb</a>

In each case:

- Additional sources of aliases for terms (e.g. MyGene) are mapped to the primary source (i.e. those linked above)
- Most sources have a list of manual items appended from tables in <a href="../data/meta/raw">data/meta/raw</a>
- All synonyms are filtered to those with a length greater than 3 or 4 characters (depending on the type)
    - This is also followed a by blacklist of synonyms to exclude (e.g. "OUT", "GENESIS", "IFI", etc.) that are too ambiguous to use for token matching

Resulting tables are all serialized in <a href="../data/meta">data/meta</a>

<h1><a id='cell-phenotype-inference'>Phenotype Inference</a></h1>

This analysis was an aside to the issue discussion here: https://github.com/hammerlab/t-cell-relation-extraction/issues/2

The problem that motivated this work was that currently, all cell type mentions are based on exact token sequence matches via [SpaCy pattern matchers](https://spacy.io/usage/rule-based-matching) (EntityRuler specifically).  To expand this to mentions of T cell types that don't necessarily include a label like "Th1", this analysis was an attempt to see how often cells are instead referenced by surface marker expression.

This is also related to the ontology expansion discussion [here](https://github.com/hammerlab/t-cell-relation-extraction/issues/2#issuecomment-501504614) as a precursor to using something like HiExpan first involves building a set of noun phrases to place in an ontology, and using the JNLPBA-trained tagger won't work for that.  For example, on the 20K document dev corpus the tagger produces about **70K** unique noun phrases that either have a substring like "T cell" or "T-lymphocyte" OR a substring matching any one of the 627 synonyms for T cell types ("Th1", "MAIT", "iTreg", etc.).  There are only **136** T cell terms in CL meaning that even if the 70K terms from the tagger include cell types not in CL, it would still have to be highly compressed before applying it to taxonomy expansion.

The essence of the analysis then was to figure out the following things: 

- Which naming modality is more common, ```CD4+CD25+FoxP3+ T cells``` or ```CD4+CD25+FoxP3+ Treg cells```?
- If the modality above with no "Treg" label is very common, can it be shown that the expression markers can be used to cluster the mentions near other mentions where the label (i.e. a ground truth) is present.

The process for doing this in the notebook <a href="../pipeline/misc/eda/eda-phenotype.ipynb">pipeline/misc/eda/eda-phenotype.ipynb</a> includes:

- Pull a list of overlapping tags from both the JNLPBA model and the T cell type token matcher (using 627 aliases)
- Filter to tags that have exactly one token phrase match or a substring like "T cell" (must match case-insensitive regex ```t[- ]?(?:cell|lymphocyte)```)
- Re-tokenize protein strings
    - This is in the separate ```ptkn``` src package (usage example: <a href="../pipeline/experiments/protein-tokenization.ipynb">pipeline/experiments/protein-tokenization.ipynb</a>)
- Visualize noun phrases with a known cell type label
- Visualize an embedding based on marker expression (with and without a known cell type label)

### Naming Modality Frequencies

To answer the ```CD4+CD25+FoxP3+ T cells``` vs ```CD4+CD25+FoxP3+ Treg cells``` question, here are relevant findings:

In [8]:
# This shows the number of unique noun phrases that have zero, one, or more than one substring
# matching a known T cell type synonym
show_cell(NB_PHEN_INF, 'uniqct1')

In [9]:
# This shows the same as the above but also split by how many matches the
# phrase has on '(CD\d+|CCR\d+|CXCR\d+|IL\-\d+|TNF|TGF|IFN)' as a rough indication
# of whether or not surface markers are present (note that CD3, CD4 and CD8 markers are ignored)
show_cell(NB_PHEN_INF, 'uniqct2')

Unnamed: 0_level_0,count,count,count,percent,percent,percent
marker_ct_bin,none,one,2+,none,one,2+
match_ct_bin,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
none,33843,7364,2705,46.4,10.1,3.71
one,18838,4350,1992,25.83,5.96,2.73
2+,3461,209,177,4.75,0.29,0.24


In [10]:
# Same as above but with total tag counts rather than unique tags
show_cell(NB_PHEN_INF, 'totalct1')

Unnamed: 0_level_0,count,count,count,percent,percent,percent
marker_ct_bin,none,one,2+,none,one,2+
match_ct_bin,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
none,183794,18476,4046,48.52,4.88,1.07
one,148997,10671,2996,39.34,2.82,0.79
2+,9352,237,212,2.47,0.06,0.06


For a more intuitive sense of what these buckets above mean, here are examples:
    
match_ct_bin | marker_ct_bin | examples | desc   | approx. frequency (from table above)
-------------|---------------|----------|--------|-----------------------------
none | none  | human primary T cells | No type or marker indication | 50%
none | none  | CD4+ T cells | CD3, CD4, and CD8 are generally not informative markers so they are ignored | 50%
none | one   | CD45RbHi T cells | One informative marker, no type indication | 5%
none | 2+    | CX3CR1+CD45RO+CD8+ T cells | No type, several markers | 1.1%
one  | none  | Treg cells | Type only | 38%
one  | one   | CD39- Tregs | Type and single informative marker | 2.8 %
one  | 2+    | CD4+CD25HighCD127- Treg cells | Type + several markers | .8 %
2+   | none  | Th1 and Th17 cells | Tagger lumps multiple cell types together sometimes | 2.5%
2+   | one   | TNF-α producing Th1 and Tc1 cells | Sometimes markers and multiple cells are extracted as long phrases | .06%

### tl;dr

Together, this implies that of ~70K T cell phrases, 64% do NOT have a nice, neat label to match on and that this percentage stays roughly the same when you further restrict to phrases with at least one informative (i.e. not CD[348]) surface marker

### Signed Surface Marker Frequencies

The protein tokenizer gives for each marker and normalizes them to common names in PRO, CKR or the TF list:

In [11]:
show_cell(NB_PHEN_INF, 'tknex')

Applying this to all mentions with a discernable cell type ("CD4+CD25+FoxP3+ **Treg** cells") gives:

In [12]:
show_cell(NB_PHEN_INF, 'gtplot')

This is a bit noisy but otherwise seems like a good place from which to assume that the association between cell types and relevant markers is strong in this approach (i.e. clustering of the mentions should be meaningful).

### Clustering

The last step of this analysis was to try visualize how well relating mentions based on surface markers could work.  Here, the markers are associated with a sign (-1, 0, or 1) based on how they appeared in text and then mapped to (-1, 0, 1, 2) in an indicator matrix with 0 representing marker absence, 1 being neutral, and 2 being positive.  An embedding of this matrix overlayed with cell type looks like this ([interactive version](https://plot.ly/~eczech/2/mait-nkt-none-tcm-tem-temra-tmem-tn-tfh-th-th0-th1-th17-th2-th22-th9-treg-treg1-/)):

In [13]:
show_cell(NB_PHEN_INF, 'umaptreg')

The overlap between the blue and red dots suggest that the red dots could accurately be associated with the cell type label in the blue dots (Treg in this case).  The bigger clusters represent alternative contexts/conventions like:

- CD4+ CD25+ + (one or more other markers)
- CD4+ CD25+ FoxP3+ + (one or more other markers)
- CD25+ CD127- 
- CD44- CD62L+ Treg (mice)
- IFN- IL-17- Tregs (functional profiling)

Here are some additional mentions for other cell types that cluster closely and would otherwise be difficult to match without appealing to a parse and normalization of the associated markers:

- CD4+CD45RO+CXCR5+ **Tfh** cells [[PMC5519210]()]
    - neighbors:
        - CXCR5+ expressing CD4+CD45RO+ T cells [[PMC4504540]()]
        - memory (CD45RO+) CXCR5+CD4+ T cells [[PMC6409398]()]
        - CXCR5hiBTLAhiCD4+ T cells [[PMC4972135]()]
- VP1-specific CD69+/CD103+ tissue-resident memory (**TRM**) cells [[PMC5056763]()]
    - neighbors:
        - human CD69+CD103+ CD8 T cells [[PMC5461007]()]
        - SIINFEKL-specific tetramer-binding resident memory-like CD103+CD69+CXCR3+ CD8 T cells [[PMC5173246]()]
        - intrahepatic CD45RA−CD69+CD103+ CD8 T cells [[PMC5461007]()]
- Tscm CD4+CD45RA+CD45RO−CD62L+CCR7+CD127+CD27+CD28+CD95+CD122+ T (**Tscm**) cell [[PMC4902324]()]
    - neighbors:
        - CD4+CD45RA+CD45RO–CD62L+CCR7+CD127+CD27+CD28+CD95+CD122+T-Cell Subsets [[PMC4902324]()]
        - CD4+CD45RA+CD45RO−CD62L+CCR7+CD127+CD27+CD28+CD95+CD122+ T cells [[PMC4902324]()]
- **Th2** (IL-4+ IFN-γ−) cells [[PMC2212016]()]
    - neighbors:
        - IFN-γ–IL-4+CD4+ T cells [[PMC3767795]()]
        - IL-4+IFN-γ− NKT2 cells [[PMC2193332]()]
        - IL-4/IFN-γ-producing T cells [[PMC4718521]()]

<h1><a id='candidate-annotation'>Candidate Annotation</a></h1>

3 separate rounds of annotation were run, each using a different tool:

1. The dev corpus was annotated using BRAT, primarily
2. Snorkel also provides a jupyter widget that was used to annotate several hundred candidates
3. [Doccano](https://github.com/chakki-works/doccano) was used for the majority of the non-dev annotation (i.e. validation and test datasets)

The [Annotation Guideline](../docs/annotations/guideline.md) contains the rules followed in all annotation efforts.

<h1><a id='labeling-functions'>Labeling Functions</a></h1>

The process for building labeling functions so far has been:

1. Take notes on text patterns that came up while doing the original dev set annotations
2. Try to account for overlap in those patterns and coalesce those that are very similar (though I didn't worry about this too much since correlations between them should hypothetically not be a problem)
3. Apply these patterns to the dev set in <a href="../pipeline/08-candidate-lfs.ipynb">pipeline/08-candidate-lfs.ipynb</a>
    - I started with a single task (inducing cytokines) and initially translated only a fraction of my notes on patterns into labeling functions (maybe 15 or so rules).  Each of these had very low coverage (at best 4 or 5 candidates out of ~400).
    - Next, I slowly added more rules (up to about 50) and began to create something like a regex templating language (out of necessity as the boilerplate was getting annoying to manage).  Example:
        - ```r'{A}{wc_md}{r_push_v}{wc_md}{B}{wc_md}{r_diff_n}'```
            - "A" and "B" are the candidate positions (A = cytokine, B = cell type)
            - "**wc_md**" is a medium size wildcard placeholder (literally ```[^;]{0,50}```)
            - "**r_push_v**" is a set of words like "driver", "promoter", "regulator" with different parts of speech as well (drives, driving, driven)
            - "**r_diff_n**" is a similar set of words with a sense close to "differentiation" (polarization, induction, generation, etc.)
            - This example corresponds to phrases like this found in annotation:
                - *predominance of [cytokine] drives [cell type] differentiation*
                - *[cytokine] regulates [cell type] differentiation*
        - This part was time consuming primarily due to:
            - Coming up with synonymous words and making sure they are applied to other patterns
            - Periodically needing to consolidate patterns and synonyms 
            - Needing to build the labeling functions dynamically based on configurations (i.e. literally writing a separate function for each pattern would have been hard to change)
4. After reaching reasonable coverage and accuracy (see the ```LF_ind*_txtptn_[pos|neg]_all``` function stats below), I also added LFs for:
    - Distant supervision from iX
        - I was not using every single link in iX and instead filtering to those that appear in >= N distinct publications
        - Currently, there are separate LFs for different cutoffs (e.g. N = 4, 8, 10, 20, etc.)
    - Heuristics
        - Negative heuristics:
            - Is there an entity between the entities in question?
            - Are the entities more than X characters apart?
            - Using LFs from opposing tasks as negative examples (secreted vs inducing cytokines are almost always exclusive)
            - Identifying sentences that propose a hypothesis or a relationship that **might** be true; examples:
                - "To determine if IL-4 induces Th2 differentiation ..." 
                - "In order to assess whether IL-4 induces Th2 differentiation ..."
            - Sentence complexity (in particular, this is good at identifying figure legends): 
                - Number of characters
                - Number of entities
                - Number of punctuation characters
                - Number of parenthetical enclosures
                - Number of newlines
         
        - Positive heuristics:
            - Searching for keywords in a window around the entities (e.g. "express", "differentiate", "induce")
                - There are several LFs for different window sizes
            - Ensuring that the entities are downstream of a verb in the dependency parse tree that is often associated with the relation class.  For example, verbs like "induces", "generates", and "polarizes" are good indicators of an inducing cytokine relation when both the cell type and cytokine in question are semantically connected.
    - Composite Functions
        - Several attempts were made to combine LFs good at identifying negative examples (e.g. sentence complexity measures) with those fairly good at identifying positive examples (e.g. the text pattern regexes extracted during annotation)
        
These statistics below show the empirical accuracy + coverage on the dev set (split 1):

In [14]:
show_cell(NB_TRAIN_MARGINALS, 'lfstats')

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.,Empirical F1
LF_indck_comp_imexpresso_nonneg,0,0.0165854,0.0165854,0.0165854,1,1,0,15,0.941176,0.666667
LF_indck_comp_neg_sec,1,0.131707,0.131707,0.130732,0,0,7,128,0.948148,
LF_indck_comp_xor,2,0.257561,0.257561,0.256585,64,68,5,127,0.723485,0.636816
LF_indck_dsup_imexpresso_mp04,3,0.217561,0.217561,0.210732,38,29,6,150,0.843049,0.684685
LF_indck_dsup_imexpresso_mp08,4,0.134634,0.134634,0.12878,16,7,3,112,0.927536,0.761905
LF_indck_dsup_imexpresso_mp12,5,0.12,0.12,0.114146,16,7,3,97,0.918699,0.761905
LF_indck_dsup_imexpresso_mp20,6,0.0956098,0.0956098,0.0858537,16,7,9,66,0.836735,0.666667
LF_indck_heur_closer_ck_to_ct,7,0.347317,0.347317,0.254634,0,0,33,323,0.907303,
LF_indck_heur_closer_ct_to_ck,8,0.253659,0.253659,0.156098,0,0,10,250,0.961538,
LF_indck_heur_closer_ref,9,0.427317,0.427317,0.323902,0,0,41,397,0.906393,

Unnamed: 0,j,Coverage,Overlaps,Conflicts
LF_indck_comp_imexpresso_nonneg,0,0.0117,0.0117,0.0116
LF_indck_comp_neg_sec,1,0.1442,0.1442,0.1438
LF_indck_comp_xor,2,0.167,0.167,0.1662
LF_indck_dsup_imexpresso_mp04,3,0.2039,0.2039,0.1935
LF_indck_dsup_imexpresso_mp08,4,0.1358,0.1358,0.1271
LF_indck_dsup_imexpresso_mp12,5,0.1088,0.1088,0.1008
LF_indck_dsup_imexpresso_mp20,6,0.0905,0.0905,0.0862
LF_indck_heur_closer_ck_to_ct,7,0.2585,0.2585,0.2041
LF_indck_heur_closer_ct_to_ck,8,0.1912,0.1912,0.1484
LF_indck_heur_closer_ref,9,0.3445,0.3445,0.2843

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.,Empirical F1
LF_secck_comp_imexpresso_nonneg,0,0.0245339,0.0245339,0.0225711,9,13,0,3,0.48,0.580645
LF_secck_comp_neg_ind,1,0.133464,0.133464,0.133464,0,0,0,136,1.0,
LF_secck_comp_xor,2,0.247301,0.247301,0.243376,72,47,0,133,0.813492,0.753927
LF_secck_dsup_imexpresso_mp04,3,0.244357,0.244357,0.244357,54,99,5,91,0.582329,0.509434
LF_secck_dsup_imexpresso_mp08,4,0.154073,0.154073,0.15211,35,98,0,24,0.375796,0.416667
LF_secck_dsup_imexpresso_mp12,5,0.13739,0.13739,0.135427,24,92,0,24,0.342857,0.342857
LF_secck_dsup_imexpresso_mp20,6,0.101079,0.101079,0.0991168,22,57,0,24,0.446602,0.435644
LF_secck_heur_closer_ck_to_ct,7,0.335623,0.335623,0.266928,0,0,13,329,0.961988,
LF_secck_heur_closer_ct_to_ck,8,0.249264,0.249264,0.181551,0,0,4,250,0.984252,
LF_secck_heur_closer_ref,9,0.412169,0.412169,0.334642,0,0,17,403,0.959524,

Unnamed: 0,j,Coverage,Overlaps,Conflicts
LF_secck_comp_imexpresso_nonneg,0,0.02,0.02,0.0186
LF_secck_comp_neg_ind,1,0.025,0.025,0.025
LF_secck_comp_xor,2,0.1649,0.1649,0.1625
LF_secck_dsup_imexpresso_mp04,3,0.2184,0.2184,0.2133
LF_secck_dsup_imexpresso_mp08,4,0.128,0.128,0.1257
LF_secck_dsup_imexpresso_mp12,5,0.1079,0.1079,0.1059
LF_secck_dsup_imexpresso_mp20,6,0.0892,0.0892,0.0878
LF_secck_heur_closer_ck_to_ct,7,0.257,0.257,0.2203
LF_secck_heur_closer_ct_to_ck,8,0.2014,0.2014,0.1692
LF_secck_heur_closer_ref,9,0.3477,0.3477,0.3055

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.,Empirical F1
LF_indtf_comp_xor,0,0.800638,0.800638,0.787879,15,9,47,431,0.888446,0.348837
LF_indtf_heur_closer_ct_to_tf,1,0.307815,0.307815,0.304625,0,0,4,189,0.979275,
LF_indtf_heur_closer_ref,2,0.385965,0.385965,0.37799,0,0,13,229,0.946281,
LF_indtf_heur_closer_tf_to_ct,3,0.207337,0.207337,0.202552,0,0,9,121,0.930769,
LF_indtf_heur_complex_cand_01,4,0.936204,0.936204,0.92504,0,0,89,498,0.848382,
LF_indtf_heur_complex_cand_02,5,0.708134,0.708134,0.69697,0,0,63,381,0.858108,
LF_indtf_heur_distref,6,0.208931,0.208931,0.197767,0,0,3,128,0.977099,
LF_indtf_heur_distref_10,7,1.0,1.0,0.985646,84,215,23,305,0.620415,0.413793
LF_indtf_heur_distref_15,8,1.0,1.0,0.985646,95,293,12,227,0.513557,0.383838
LF_indtf_heur_distref_20,9,1.0,1.0,0.985646,101,341,6,179,0.446571,0.367942

Unnamed: 0,j,Coverage,Overlaps,Conflicts
LF_indtf_comp_xor,0,0.773,0.773,0.7472
LF_indtf_heur_closer_ct_to_tf,1,0.1633,0.1633,0.1525
LF_indtf_heur_closer_ref,2,0.2569,0.2569,0.2405
LF_indtf_heur_closer_tf_to_ct,3,0.1452,0.1452,0.1343
LF_indtf_heur_complex_cand_01,4,0.8832,0.8832,0.8593
LF_indtf_heur_complex_cand_02,5,0.5269,0.5269,0.5038
LF_indtf_heur_distref,6,0.1749,0.1749,0.1508
LF_indtf_heur_distref_10,7,1.0,1.0,0.9739
LF_indtf_heur_distref_15,8,1.0,1.0,0.9739
LF_indtf_heur_distref_20,9,1.0,1.0,0.9739


<h1><a id='generative-model'>Generative Model</a></h1>

The following cell in {{plink(NB_TRAIN_MARGINALS)}} shows the parameters used to train the Snorkel generative model:

In [15]:
show_cell(NB_TRAIN_GENMODEL, 'gmtrain')

Following training of this model, these tables show the resulting learned and empirical accuracies for each labeling function on the validation and test splits:

In [16]:
show_cell(NB_TRAIN_GENMODEL, 'gmstats')

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.,Learned Acc.
LF_indck_comp_imexpresso_nonneg,0,0.0107914,0.0107914,0.00719424,1,0,0,2,1.0,0.269231
LF_indck_comp_neg_sec,1,0.133094,0.133094,0.129496,0,0,1,36,0.972973,0.676633
LF_indck_comp_xor,2,0.158273,0.158273,0.154676,4,3,0,37,0.931818,0.704342
LF_indck_dsup_imexpresso_mp04,3,0.230216,0.230216,0.223022,7,12,0,45,0.8125,0.663228
LF_indck_dsup_imexpresso_mp08,4,0.158273,0.158273,0.147482,3,8,0,33,0.818182,0.651713
LF_indck_dsup_imexpresso_mp12,5,0.129496,0.129496,0.118705,3,6,0,27,0.833333,0.580882
LF_indck_dsup_imexpresso_mp20,6,0.104317,0.104317,0.0971223,3,6,1,19,0.758621,0.527629
LF_indck_heur_closer_ck_to_ct,7,0.208633,0.208633,0.165468,0,0,3,55,0.948276,0.803698
LF_indck_heur_closer_ct_to_ck,8,0.172662,0.172662,0.147482,0,0,1,47,0.979167,0.748876
LF_indck_heur_closer_ref,9,0.294964,0.294964,0.248201,0,0,4,78,0.95122,0.838002

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.,Learned Acc.
LF_secck_comp_imexpresso_nonneg,0,0.0294118,0.0294118,0.0294118,3,3,0,2,0.625,0.225
LF_secck_comp_neg_ind,1,0.0257353,0.0257353,0.0257353,0,0,0,7,1.0,0.257669
LF_secck_comp_xor,2,0.202206,0.202206,0.202206,24,25,0,6,0.545455,0.687973
LF_secck_dsup_imexpresso_mp04,3,0.279412,0.279412,0.275735,16,28,1,31,0.618421,0.578977
LF_secck_dsup_imexpresso_mp08,4,0.158088,0.158088,0.158088,10,23,0,10,0.465116,0.492063
LF_secck_dsup_imexpresso_mp12,5,0.139706,0.139706,0.139706,7,22,0,9,0.421053,0.485163
LF_secck_dsup_imexpresso_mp20,6,0.106618,0.106618,0.106618,7,13,0,9,0.551724,0.486268
LF_secck_heur_closer_ck_to_ct,7,0.253676,0.253676,0.224265,0,0,10,59,0.855072,0.791267
LF_secck_heur_closer_ct_to_ck,8,0.213235,0.213235,0.191176,0,0,3,55,0.948276,0.717726
LF_secck_heur_closer_ref,9,0.360294,0.360294,0.327206,0,0,13,85,0.867347,0.830271

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.,Learned Acc.
LF_indtf_comp_xor,0,0.773333,0.773333,0.742222,1,3,24,146,0.844828,0.808724
LF_indtf_heur_closer_ct_to_tf,1,0.2,0.2,0.182222,0,0,3,42,0.933333,0.743263
LF_indtf_heur_closer_ref,2,0.28,0.28,0.253333,0,0,7,56,0.888889,0.769892
LF_indtf_heur_closer_tf_to_ct,3,0.133333,0.133333,0.12,0,0,4,26,0.866667,0.691525
LF_indtf_heur_complex_cand_01,4,0.88,0.88,0.848889,0,0,37,161,0.813131,0.722349
LF_indtf_heur_complex_cand_02,5,0.515556,0.515556,0.484444,0,0,17,99,0.853448,0.781595
LF_indtf_heur_distref,6,0.195556,0.195556,0.164444,0,0,1,43,0.977273,0.765036
LF_indtf_heur_distref_10,7,1.0,1.0,0.968889,33,71,11,110,0.635556,0.83997
LF_indtf_heur_distref_15,8,1.0,1.0,0.968889,36,108,8,73,0.484444,0.727091
LF_indtf_heur_distref_20,9,1.0,1.0,0.968889,40,122,4,59,0.44,0.650005

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.,Learned Acc.
LF_indck_comp_imexpresso_nonneg,0,0.0242588,0.0242588,0.0242588,0,1,1,7,0.777778,0.125
LF_indck_comp_neg_sec,1,0.140162,0.140162,0.140162,0,0,2,50,0.961538,0.663752
LF_indck_comp_xor,2,0.175202,0.175202,0.175202,8,5,2,50,0.892308,0.698882
LF_indck_dsup_imexpresso_mp04,3,0.247978,0.247978,0.239892,8,14,1,69,0.836957,0.669921
LF_indck_dsup_imexpresso_mp08,4,0.191375,0.191375,0.169811,2,8,1,60,0.873239,0.65264
LF_indck_dsup_imexpresso_mp12,5,0.16442,0.16442,0.145553,2,7,1,51,0.868852,0.588542
LF_indck_dsup_imexpresso_mp20,6,0.118598,0.118598,0.107817,2,7,3,32,0.772727,0.569177
LF_indck_heur_closer_ck_to_ct,7,0.234501,0.234501,0.177898,0,0,3,84,0.965517,0.809131
LF_indck_heur_closer_ct_to_ck,8,0.204852,0.204852,0.15903,0,0,1,75,0.986842,0.74871
LF_indck_heur_closer_ref,9,0.328841,0.328841,0.264151,0,0,3,119,0.97541,0.844015

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.,Learned Acc.
LF_secck_comp_imexpresso_nonneg,0,0.0137363,0.0137363,0.0137363,4,1,0,0,0.8,0.369231
LF_secck_comp_neg_ind,1,0.032967,0.032967,0.032967,0,0,0,12,1.0,0.379747
LF_secck_comp_xor,2,0.181319,0.181319,0.181319,34,20,0,12,0.69697,0.669923
LF_secck_dsup_imexpresso_mp04,3,0.241758,0.241758,0.236264,21,33,0,34,0.625,0.568761
LF_secck_dsup_imexpresso_mp08,4,0.104396,0.104396,0.104396,11,19,0,8,0.5,0.507903
LF_secck_dsup_imexpresso_mp12,5,0.0934066,0.0934066,0.0934066,8,18,0,8,0.470588,0.501553
LF_secck_dsup_imexpresso_mp20,6,0.0934066,0.0934066,0.0906593,7,19,0,8,0.441176,0.50096
LF_secck_heur_closer_ck_to_ct,7,0.252747,0.252747,0.208791,0,0,5,87,0.945652,0.786673
LF_secck_heur_closer_ct_to_ck,8,0.181319,0.181319,0.156593,0,0,3,63,0.954545,0.745797
LF_secck_heur_closer_ref,9,0.337912,0.337912,0.291209,0,0,8,115,0.934959,0.829495

Unnamed: 0,j,Coverage,Overlaps,Conflicts,TP,FP,FN,TN,Empirical Acc.,Learned Acc.
LF_indtf_comp_xor,0,0.796667,0.796667,0.77,7,3,31,198,0.857741,0.806997
LF_indtf_heur_closer_ct_to_tf,1,0.136667,0.136667,0.123333,0,0,4,37,0.902439,0.704527
LF_indtf_heur_closer_ref,2,0.236667,0.236667,0.22,0,0,8,63,0.887324,0.767345
LF_indtf_heur_closer_tf_to_ct,3,0.14,0.14,0.126667,0,0,4,38,0.904762,0.656716
LF_indtf_heur_complex_cand_01,4,0.853333,0.853333,0.826667,0,0,49,207,0.808594,0.732154
LF_indtf_heur_complex_cand_02,5,0.516667,0.516667,0.49,0,0,27,128,0.825806,0.780164
LF_indtf_heur_distref,6,0.18,0.18,0.153333,0,0,4,50,0.925926,0.753515
LF_indtf_heur_distref_10,7,1.0,1.0,0.973333,37,110,21,132,0.563333,0.84691
LF_indtf_heur_distref_15,8,1.0,1.0,0.973333,44,151,14,91,0.45,0.721139
LF_indtf_heur_distref_20,9,1.0,1.0,0.973333,50,178,8,64,0.38,0.636542


<h1><a id='strong-supervision'>Strong Supervision</a></h1>

To further improve labeling function accuracy, I opted not to create more advanced heuristics and instead focused on training supervised models on the dev set.  I tried searching over the following model configuration space:

1. Entity marker strategy
    - There are several ways to "mark" entities for RNNs applied to relation extraction:
        - TODO: add citations 
        - Enclose the entity spans in special characters
            ```IL-4 induces Th2 cells``` --> ```<< IL-4 >> induces [[ Th2 ]] cells```
        - Enclose off-target entities in different characters
            ```IL-4 does not induce Th1 cells but does induce Th2 cells``` --> ```<< IL-4 >> does not induce || Th1 || cells but does induce [[ Th2 ]] cells```
        - Replace the actual entity tokens with something generic
            ```IL-4 induces Th2 cells``` --> ```CYTOKINE induces PHENOTYPE cells```
2. Embeddings
    - Frozen pre-trained word2vec ([Pyysalo et al. 2013](http://bio.nlplab.org/))
    - Initialized pre-trained word2vec 
    - From scratch
3. Position Feature Embedding
    - Following [Zeng et al. 2014](), relative distances from each of the two entities are added as separate features in a learned embedding of some pre-configured size (generally ~10 dimensional)
4. Regularization
    - Dropout
    - Weight Decay
    - Structural (parameter count)
    
The following choices above are incorporated in the script {{plink(SRC_TRAIN, use_path_name=True)}}.  The model itself is defined in {{plink(SRC_MODEL, use_path_name=True)}}.  The notebook used for training is {{plink(NB_DEV_GRID, use_path_name=True)}}.

### Features

Example features generated are shown below for one combination of entity marking/swapping strategies:

In [17]:
# The "e0_dist" and "e1_dist" sequences show relative token distance to entities and 
# the "text" field shows results after marking
show_cell(NB_LBLMDL_V2, 'features')

Unnamed: 0,id,label,e0_dist,e0_text,e1_dist,e1_text,tags,text
0,12,0.0,"[-35, -34, -33, -32, -31, -30, -29, -28, -27, -26, -26, -25, -24, -24, -23, -22, -21, -20, -19, -18, -17, -16, -15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, -1, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...",IL-21,"[-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, -1, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 24, 25, 26, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]",Tfh,"[O, O, O, O, O, O, O, O, O, O, O, E:primary:immune_cell_type, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, E:primary:cytokine, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[Using, human, monocyte-derived, DCs, ,, Schmitt, et, al., compared, <, #, Tfh, #, >, inducing, capacities, of, different, TLR, agonists, and, show, that, TLR4, ,, TLR5, ,, and, TLR7/8, ,, but, not, TLR2, ,, activation, induces, <, @, IL-21, @, >..."
1,526,0.0,"[-62, -61, -60, -59, -58, -57, -56, -55, -54, -53, -52, -51, -50, -49, -48, -48, -47, -46, -46, -45, -44, -43, -42, -41, -40, -39, -38, -37, -36, -35, -34, -33, -33, -32, -31, -31, -30, -29, -29, -28, -27, -27, -26, -25, -25, -24, -23, -23, -22, ...",IL-6,"[-15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, -1, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 14, 15, 16, 16, 17, 18, 18, 19, 20, 20, 21, 22, 22, 23, 24, 24, 25, 26, 26, 27, 28, 28, 29, 30, 30, 31, 32, 32, 33, 34, 34...",EMT,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, E:primary:immune_cell_type, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, E:secondary:cytokine, O, O, O, O, O, E:secondary:cytokine, O, O, O, O, O, E:secondary:cytokine, O, O, O, O, O, E:secondar...","[APCs, ,, antigen-presenting, cells, (, dendritic, cells, ,, macrophages, ,, and, monocytes, ), ;, <, #, EMT, #, >, ,, epithelial, –, mesenchymal, transition, ;, HSC, ,, hepatic, stellate, cell, ;, |, @, IL-1, @, |, ,, |, @, interleukin-1, @, |, ..."
2,1312,0.0,"[-76, -75, -74, -73, -72, -71, -70, -69, -68, -67, -66, -65, -64, -63, -62, -61, -60, -59, -58, -57, -56, -55, -54, -53, -52, -51, -50, -49, -48, -47, -46, -45, -44, -43, -42, -41, -40, -39, -38, -37, -36, -36, -35, -34, -34, -33, -32, -31, -30, ...",transforming growth factor,"[-66, -65, -64, -63, -62, -61, -60, -59, -58, -57, -56, -55, -54, -53, -52, -51, -50, -49, -48, -47, -46, -45, -44, -43, -42, -41, -40, -39, -38, -37, -36, -35, -34, -33, -32, -31, -30, -29, -28, -27, -26, -26, -25, -24, -24, -23, -22, -21, -20, ...",T helper,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, E:secondary:cytokine, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, E:secondary:immune_cell_type, O, O...","[BM-MSCs, ,, bone, marrow-derived, mesenchymal, stem, cells, ;, Bregs, ,, regulatory, B, cell, ;, CD, ,, Crohn, ’s, disease, ;, EAE, ,, experimental, autoimmune, encephalomyelitis, ;, GVHD, ,, graft-versus-host, diseases, ;, hUC-MSCs, ,, human, u..."
3,2022,0.0,"[-90, -89, -88, -87, -86, -85, -84, -83, -82, -81, -80, -79, -78, -77, -76, -75, -74, -73, -72, -71, -70, -69, -68, -68, -67, -66, -66, -65, -64, -64, -63, -62, -62, -61, -60, -60, -59, -58, -57, -56, -56, -55, -54, -53, -53, -52, -51, -51, -50, ...",IL-1β,"[-38, -37, -36, -35, -34, -33, -32, -31, -30, -29, -28, -27, -26, -25, -24, -23, -22, -21, -20, -19, -18, -17, -16, -16, -15, -14, -14, -13, -12, -12, -11, -10, -10, -9, -8, -8, -7, -6, -5, -4, -4, -3, -2, -1, -1, 0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7,...",TH2,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, E:secondary:immune_cell_type, O, O, O, O, O, E:secondary:cytokine, O, O, O, O, O, E:secondary:cytokine, E:secondary:cytokine, E:secondary:cytokine, O, O, O, O, O, O, E:prima...","[In, cultures, from, 20-, to, 24-week-old, Winnie, mice, ,, there, was, a, highly, significant, increase, (, P<0.01, ), in, the, production, of, |, #, TH1, #, |, (, |, @, IL-12p70, @, |, and, |, @, tumor, necrosis, factor-α, @, |, ), ,, <, #, TH2..."
4,2188,0.0,"[-29, -28, -27, -26, -25, -24, -23, -22, -21, -20, -19, -18, -17, -17, -16, -15, -15, -14, -13, -12, -11, -10, -9, -8, -7, -7, -6, -5, -5, -4, -3, -2, -1, -1, 0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 9, 10, 10, 11, 12, 13, 14]",IL-6,"[-23, -22, -21, -20, -19, -18, -17, -16, -15, -14, -13, -12, -11, -11, -10, -9, -9, -8, -7, -6, -5, -4, -3, -2, -1, -1, 0, 1, 1, 2, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 14, 14, 15, 16, 16, 17, 18, 19, 20]",Th2,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, E:secondary:immune_cell_type, O, O, O, O, O, O, O, O, O, O, O, E:primary:immune_cell_type, O, O, O, O, O, O, O, E:primary:cytokine, O, O, O, O, O, E:secondary:cytokine, O, O, O, O, O, O, E:secondary:cyto...","[HPV-related, lesions, were, found, to, be, characterized, by, weak, or, absent, IFNγ-associated, |, #, Th1, #, |, cell, responses, and, by, an, upregulation, of, <, #, Th2, #, >, cytokines, (, including, <, @, IL-6, @, >, ,, |, @, IL-8, @, |, ,,..."


### Results

For the following results models were trained on the ```dev``` split, validated (early stopping based on this) on the ```val``` split and tested on the ```test``` split.

The grid used for hyperparameters is currently defined as follows:

In [18]:
show_cell(NB_DEV_GRID, 'paramspace')

Label distributions across the tasks (**NOTE**: The balance is about 85/15 and the test/val splits are very small):

In [19]:
show_cell(NB_DEV_GRID, 'lblbalance')

Unnamed: 0_level_0,agg,count,count,percent,percent
Unnamed: 0_level_1,label,0,1,0,1
task,split,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
inducing_cytokine,predict,23,23,0.5,0.5
inducing_cytokine,test,29,29,0.5,0.5
inducing_cytokine,train,145,145,0.5,0.5
inducing_cytokine,val,23,23,0.5,0.5
inducing_transcription_factor,predict,44,44,0.5,0.5
inducing_transcription_factor,test,58,58,0.5,0.5
inducing_transcription_factor,train,107,107,0.5,0.5
inducing_transcription_factor,val,44,44,0.5,0.5
secreted_cytokine,predict,49,49,0.5,0.5
secreted_cytokine,test,64,64,0.5,0.5


Resulting score distributions across the model space:

In [20]:
show_cell(NB_DEV_GRID, 'scoredist')

### tl;dr

The small, manually labeled datasets were used to create models with no weak supervision.  This is the baseline to which I will compare the weakly supervised "Discriminative Model".  These may also serve as labeling functions, but whether or not they help remains to be seen.  See the section below on the "Discriminative Model" for a higher level performance statistic tables.

#### BERT

I tried pulling down and running the pytorch BERT script (using sciBERT) for sequence classification on data from one of the tasks but it wasn't learning anything (predictions always negative).  Some further things to try here:

- Figure out how to tune that script 
- Compare vanilla BERT to sciBERT for these tasks?
- Compare results from head trained on CLS vector to results from token embeddings extracted as features and fed into an LSTM (my bet is on the latter working much better for RE)

<h1><a id='weak-supervision'>Weak Supervision</a></h1>

The weak supervision model (aka "Discriminative Model") is the final classifier trained on the large set of weak labels from the Snorkel generative model.

The notebook that trains this model (well, currently a grid of models) is {{plink(NB_END_TRAIN, use_path_name=True)}}.

This model is trained on the ```train``` split of the data, evaluated on the ```test``` split and stopped based on performance for the ```val``` split.  This is the same procedure used in the section above on building supervised models for the ```dev``` split of the data, with the only difference being that the ```train``` split (i.e. candidates with weak labels) are used for training rather than the ```dev``` split (i.e. candidates with gold labels).

A hyperparameter seach was used with an identical grid to the section above.

### Results

In order to compare the results from weak supervision in context, the score summaries below show F1 and precision scores for the following:

1. Strongly supervised, sentence-level models that take ONLY **labeling function outputs** as features and predict labels only for splits with gold annotations (these do not rely on sequence inputs as shown in the features of the above "Strong Supervision" section -- all features are simply the -1, 0, or 1 outputs of LFs)
2. Strongly supervised RNN models trained on the same data as the sentence-level models
3. The weakly supervised RNN models

All of the following represent the scores at the maximum validation F1 score found when using the ```DEV``` split as training data (or the ```TRAIN``` split in the weakly-supervised case), the ```VAL``` split for validation, and the ```TEST``` split for test data: 

In [21]:
show_cell(NB_ANALYSIS_SCORES, 'f1')

And this plot shows several metrics for the test dataset only, as this should be the most authoritative indication of model performance:

In [22]:
show_cell(NB_ANALYSIS_SCORES, 'test')

**Conclusion**: The sentence-level models are very reliably out-performing the RNN models, weakly-supervised or otherwise, despite the fact that both are utilizing the same information provided by the labeling functions.

<hr>

### Rendering

To render this notebook for github (necessary to substitute variables embedded in markdown):

```bash
cd $REPO_DIR/results
jupyter nbconvert --to notebook --execute summary.ipynb --output summary.render.ipynb
```