# Introduction and Organization

This notebook organizes the notebooks used in the preliminary analysis of Mark2Cure's relationship extraction module.

The 'Data' folder will contain data that is used in the analysis, including any exports that have been manually inspected/altered (such as the quality checked samples).  The 'Export' folder is reserved for outputs of the scripts (which may be used as direct (unaltered) inputs for other scripts used in the analysis.

### Dependencies

__The notebooks in this repo have the following dependencies__:
- pandas 0.19.2
- bokeh 0.12.4
- matplotlib 2.0.0
- mygene 3.0.0
- networkx 1.11
- nltk 3.2.1
- seaborn 0.7.1

and were generated in python 3.4

## Table of Contents

### Figure 1-  

__Purpose__: To determine User Contribution distribution, accuracy, and aggregation threshold. To explore NER issues as a source of error

__Analysis Workflow__: 
<img src="https://docs.google.com/drawings/d/e/2PACX-1vT8UQpjsttyORMnlP_f_1RABnopHINSMzFwWZ7zm8LSX2MRs3PtzvP84bE633olDGnCLEseqEzm12Ec/pub?w=960&amp;h=720">

### [Sampling](0.%20Sample%20for%20QC--correctness.ipynb)
__Modules needed__: pandas, random, m2c_rel_basic, relationship_dictionaries

__Data files used__: ["2017.11.22 RE anns export.txt"](data/2017.11.22 RE anns export.txt)

__Data exported__: ["annresults.txt"](data/annresults), ["all_completed_anns.txt"](data/all_completed_anns.txt), ["all_completed_anns_pmids.txt"](exports/all_completed_anns_pmids.txt), sample_[0](exports/sample_0_for_expert_ann.txt),[1](exports/sample_1_for_expert_ann.txt),[2](exports/sample_2_for_expert_ann.txt)[3](exports/sample_3_for_expert_ann.txt)_for_expert_ann.txt

### [Figures a-e](Figure 1a-e. User Contribution Distribution.ipynb) and  [Figure 1c](Figure 1c. Accuracy and Threshold Evaluation.ipynb)
__Modules needed__: pandas, matplotlib, random, m2c_rel_basic, relationship_dictionaries

__Data files used__: ["all_completed_anns.txt"](data/all_completed_anns.txt"), sample[0](data/QCd samples/sample_0_for_expert_ann.txt),[1](data/QCd samples/sample_1_for_expert_ann.txt),[2](data/QCd samples/sample_2_for_expert_ann.txt),[3](data/QCd samples/sample_3_for_expert_ann.txt)_for_expert_ann.txt 

__Data exported__: ["task_distribution.txt"](exports/task_distribution.txt), ["concept_nontossers.txt"](data/concept_nontossers.txt), ["fig1c_data.txt"](exports/fig1c_data.txt)


### [Supplemental Figure 1](Supp Fig 1. Investigating Concept Annotations.ipynb) & [Mapping](Supp Fig 1. Mapping Dropped Concepts.ipynb)
__Modules needed__: pandas, numpy, random, matplotlib, nltk, m2c_rel_basic, relationship_dictionaries

__Data files used__: ["all_completed_anns.txt"](data/all_completed_anns.txt), ["2017.11.22 RE pubmed concept export.txt"](data/2017.11.22 RE pubmed concept export.txt), ["2017.11.22 pubtator export_parsed_pubtator_anns_from_db_all_anns.txt"](data/2017.11.22 pubtator export_parsed_pubtator_anns_from_db_all_anns.txt)

__Data exported__: ["2017.11.22 pubtator export_parsed_pubtator_anns_from_db_all_anns.txt"](data/2017.11.22 pubtator export_parsed_pubtator_anns_from_db_all_anns.txt),
<br>["concepts_anns_from_db.txt"](data/concepts_anns_from_db.txt),
<br>["dropped_by_pubtator.txt"](data/dropped_by_pubtator.txt),
<br>["REanns_on_concepts_with_identifiers_only.txt"](data/REanns_on_concepts_with_identifiers_only.txt),<br>["identified_broken_anns_vs_dropped_pubtators.txt"](exports/identified_broken_anns_vs_dropped_pubtators.txt)
<br>["tokenized_pmids.txt"](exports/tokenized_pmids.txt)

### Figure 2- 
__Purpose__: Perform qualitative analysis to find relationships currently not available in the system and to look for and understand other sources of error such as design/training issues and areas for improvement 

__Analysis Workflow__:
<img src="https://docs.google.com/drawings/d/e/2PACX-1vSeyk5LCL06yg76WvpEEaxxZh1WJhBBTGrGI74xfYNYOmGKZ1Ly-51FPbJzgZW3-Nnd93uNGyhZqPNu/pub?w=967&amp;h=563">

### [Sampling](Sampling for Qualitative Inspection of Annotations.ipynb)

__Modules needed__: pandas, random, matplotlib, m2c_rel_basic, import relationship_dictionaries

__Data files used__: ["all_completed_anns.txt"](data/all_completed_anns.txt)

__Data exported__:
["relation_unclear_min_6.txt"](exports/relation_unclear_min_6.txt),
["no_rel_min_6.txt"](exports/no_rel_min_6.txt),
["relates_min_6.txt"](exports/relates_min_6.txt),
["relates_under_6.txt"](exports/relates_under_6.txt),
["unrelated_under_6.txt"](exports/unrelated_under_6.txt)

### [Figure 2](Figure 2. Qualitative Inspection of Relations Annotations.ipynb)

__Modules needed__: pandas, matplotlib, bokeh

__Data files used__: ["relates_QCd.txt"](data/QCd samples/relates_QCd.txt) 

__Data exported__: ["has_relation_broad_categories.html"](exports/has_relation_broad_categories.html)

### [Figure 3 and Supplemental Figure 2](Sampling for Qualitative Inspection of Annotations.ipynb)
__Purpose__: Verify annotations are correctly marked as unrelated, inspect 'unrelated' annotations for sources of ambiguity

__Analysis Workflow__: See Figure 2 analysis workflow

__Modules needed__: pandas, matplotlib, bokeh

__Data files used__: ["no_relations_QCd.txt"](data/QCd samples/no_relations_QCd.txt)

__Data exported__: ["no_relation_broad_categories.html"](exports/no_relation_broad_categories.html)

### [Figure 4](Figure 4 and Supp Fig 3. Concept Distance and Relationship Annotation.ipynb)

__Purpose__: Many relationship extraction algorithms tokenize at the sentence level. The code explores verifies 1. relationships of interest exist outside of the sentence level, and 2. checks whether or not concept distance affects citizen science accuracy

__Analysis Workflow__:
<img src="https://docs.google.com/drawings/d/e/2PACX-1vQBU6_RRwSdmu6ib3OthwgcR6sVlVJCjN1DFhGsk0sMq4fcJ08lIipG5zvot1OToWtg81ZxP8ZFy_tR/pub?w=960&amp;h=720">

__Modules needed__: pandas, random, matplotlib, m2c_rel_basic, relationship_dictionaries

__Data files used__: ["all_completed_anns.txt"](data/all_completed_anns.txt),
<br>["concept_anns_from_updated_pub_files.txt"](data/concept_anns_from_updated_pub_files.txt),
["tokenized_pmids.txt"](exports/tokenized_pmids.txt), 
<br>["dropped_anns_offsets.txt"](data/dropped_anns_offsets.txt),
sample_[0](exports/sample_0_for_expert_ann.txt),[1](exports/sample_1_for_expert_ann.txt),[2](exports/sample_2_for_expert_ann.txt)[3](exports/sample_3_for_expert_ann.txt)_for_expert_ann.txt,

__Data exported__: ["accuracy_for_concept_distance.txt"](exports/accuracy_for_concept_distance.txt),
["accuracy_stats_for_concept_distance.txt"](exports/accuracy_stats_for_concept_distance.txt)

### Figure 5- 

__Purpose__: To compare the gene, disease, drug relationships coming out of Mark2Cure with those mined using an algorithm--in particular SemmedDB gene, disease, drug relationships.

### [Figure 5 mapping and mapping verification](0. SemmedDB M2C mapping.ipynb)

__Analysis Workflow__:
<img src="https://docs.google.com/drawings/d/e/2PACX-1vTZTFN50CY1ovneDC2EWjSLczCG7iF7d4XZZphg9y_YUpRTkeqZZLwy-Jt5xu5KSGr12A2w_u-b7c8u/pub?w=965&amp;h=1261">

__Modules needed__: pandas, matplotlib, mygene, difflib, itertools (optional SPARQLWrapper, if using Wikidata for mapping)

__Data files used__: ["annresults.txt"](data/annresults.txt), ["all_completed_anns_semmed_triples.tsv"](data/all_completed_anns_semmed_triples.tsv),["2017.11.22 RE pubmed concept export.txt"](data/2017.11.22 RE pubmed concept export.txt), ["MeSH-Descriptor_to_UMLS-CUI.tsv"](data/MeSH-Descriptor_to_UMLS-CUI.tsv)

__Data exported__:
['constrained_semmed_mesh.txt'](exports/constrained_semmed_mesh.txt)
['majority_result.txt'](exports/majority_result.txt)
['semmed_merged.txt'](exports/semmed_merged.txt)
['constrained_semmed.txt'](exports/constrained_semmed.txt)

### [Figure 5](Figure 5. Basic SemmedDB and M2C comparison.ipynb)
__Analysis workflow__:
<img src="https://docs.google.com/drawings/d/e/2PACX-1vTfzKbWfnLbDJeJ8R2N9Lwk09k7LuOedddV-kKqFD_J7iGSchqCptS2lbK8_3USyq3f6kuhmDjNEBC6/pub?w=985&amp;h=1029">

__Modules needed__: pandas, matplotlib, seaborn

__Data files used__: ['semmed_merged.txt'](exports/semmed_merged.txt), ['constrained_semmed_mesh.txt'](exports/constrained_semmed_mesh.txt), ['majority_result.txt'](exports/majority_result.txt)

__Data exported__: ['M2C_semmed_all_merge.txt'](exports/M2C_semmed_all_merge.txt')

### [Figure 6](Figure 6. Concept Pair appearance comparison between SemmedDB and M2C.ipynb) with [fewer](Figure 6 alternative 1. Concept pair comparison, unfiltered.ipynb) or [more](Figure 6 alternative 2. Concept pair comparison, highly constrained.ipynb) constraints

__Purpose__: To compare the coverage of concept pairs in Mark2Cure and SemmedDB given the limitations imposed by sentence-level tokenization in SemmedDB, and investigate how Mark2Cure might be used to complement SemmedDB based on coverage differences. Note that sample cases for verifying the results of the code have been included in the notebook, but are not included in the schematic.

__Analysis Workflow__: 
<img src="https://docs.google.com/drawings/d/e/2PACX-1vQZdhNbFUiv-GbEBCpPPWb4mwqOrEL8OJaAKu_Wt4KFn4jyf_ChkjFHArnn9MbRyHsKHgfPmW8zyXWY/pub?w=964&amp;h=745">

__Modules needed__: pandas, matplotlib, seaborn, m2c_rel_basic

__Data files used__: sample_[0](exports/sample_0_for_expert_ann.txt),[1](exports/sample_1_for_expert_ann.txt),[2](exports/sample_2_for_expert_ann.txt)[3](exports/sample_3_for_expert_ann.txt)_for_expert_ann.txt, ['annresults.txt'](data/annresults.txt), ['semmed_merged.txt'](exports/semmed_merged.txt)

__Data exported__: N/A