In [2]:
import json, pandas as pd

# Run the CaseOLAP Pipeline

## Server-Wide Steps
Steps 1-5 should be performed once and is used globally, across the whole server. The later steps should be customized for each project

### 1. Download the documents

In [11]:
!python '01_run_download.py'

Downloading baseline files from ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ For details, see the download logfile
Finished downloading PubMed baseline files. 3s
Downloading update files from ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/ For details, see the download logfile
Finished downloading PubMed update files.1.3055191040039062
==== Start checking md5 in ./ftp.ncbi.nlm.nih.gov/pubmed/baseline/ ====
100 files checked
200 files checked
300 files checked
400 files checked
500 files checked
600 files checked
700 files checked
800 files checked
900 files checked
1000 files checked
1100 files checked

==== All md5 checks succeeded (1114 files) ====
==== Start checking md5 in ./ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/ ====
100 files checked
200 files checked
300 files checked

==== All md5 checks succeeded (314 files) ====
==== Start extracting in ./ftp.ncbi.nlm.nih.gov/pubmed/baseline/ ====
50 files extracted,77.79513263702393 seconds
100 files extracted,155.49333715438843 seconds
150 

### 2. Parse the documents


In [16]:
!python '02_run_parsing.py'

{'title': True, 'PMID': True, 'author': False, 'abstract': True, 'MeSH': True, 'location': True, 'date': True, 'journal': True, 'full_text': True}
Parsing baseline containing 1114 bulk files
Parsing finished. 29999 articles parsed.Total time: 10.836343765258789
Parsing 1/1114 from baseline:pubmed22n0100.xml
Parsing finished. 29999 articles parsed.Total time: 11.532426118850708
Parsing 2/1114 from baseline:pubmed22n0918.xml
Parsing finished. 29999 articles parsed.Total time: 6.053693771362305
Parsing 3/1114 from baseline:pubmed22n0425.xml
Parsing finished. 29999 articles parsed.Total time: 9.248413324356079
Parsing 4/1114 from baseline:pubmed22n0395.xml
Parsing finished. 29999 articles parsed.Total time: 12.04988694190979
Parsing 5/1114 from baseline:pubmed22n1062.xml
Parsing finished. 29999 articles parsed.Total time: 10.014681816101074
Parsing 6/1114 from baseline:pubmed22n0549.xml
Parsing finished. 29999 articles parsed.Total time: 11.252095460891724
Parsing 7/1114 from baseline:pubm

### 3. Map MeSH to PMID (for document-category relationships)

In [7]:
!python '03_run_mesh2pmid.py'

MeSH to PMID mapping is running...
Exporting MeSH to PMID mappingp to MeSH terms
Finished MeSH to PMID mapping. Time = 17.583333333333332 minutes
MeSH to PMID mapping stat is running.


In [243]:
!python '03_run_mesh2pmid.py'

MeSH to PMID mapping is running...
Exporting MeSH to PMID mapping
Finished MeSH to PMID mapping. Time = 17.166666666666668 minutes
MeSH to PMID mapping stat is running.


### 4 & 5. Index the documents (for document-entity relationships)
Make sure Elasticsearch is properly configured and running first.
4. Initialize the index. 
5. Run the index.


In [114]:
!python '04_run_index_init.py'

Deleted index: pubmed_x 
Response: {'acknowledged': True} 



In [14]:
!python '05_run_index_populate.py'

50000 documents indexed in 0.181 minutes
100000 documents indexed in 0.292 minutes
150000 documents indexed in 0.472 minutes
200000 documents indexed in 0.674 minutes
250000 documents indexed in 0.814 minutes
300000 documents indexed in 1.009 minutes
350000 documents indexed in 1.204 minutes
400000 documents indexed in 1.399 minutes
450000 documents indexed in 1.599 minutes
500000 documents indexed in 1.752 minutes
550000 documents indexed in 1.924 minutes
600000 documents indexed in 2.122 minutes
650000 documents indexed in 2.214 minutes
700000 documents indexed in 2.321 minutes
750000 documents indexed in 2.413 minutes
800000 documents indexed in 2.606 minutes
850000 documents indexed in 2.741 minutes
900000 documents indexed in 2.943 minutes
950000 documents indexed in 3.11 minutes
1000000 documents indexed in 3.319 minutes
1050000 documents indexed in 3.518 minutes
1100000 documents indexed in 3.66 minutes
1150000 documents indexed in 3.838 minutes
1200000 documents indexed in 4.03

## Project-Specific Steps
These steps should be customized for the project

### 6. Categorize the documents of interest

In [58]:
!python '06_run_textcube.py'

8 categories:  ['IHD', 'CM', 'ARR', 'VD', 'CHD', 'CCD', 'VOO', 'OTH']
Collecting categories' subcategory MeSH terms...
Textcube category to PMID mapping is being created....
Textcube PMID to category mapping is being created....nds
Textcube category statistics is being created....

Category: Documents
IHD: 235,278
CM: 230,362
ARR: 150,600
VD: 132,444
CHD: 461,353
CCD: 98,623
VOO: 40,896
OTH: 209,154
TOTAL: 1,157,787


### 7. Vary the synonyms' cases
Makes case-senstivie variations of the synonyms. This increases discovery of the synonyms within the case-sensitive text.

In [90]:
! python '07_run_vary_synonyms_cases.py'

Running jobs...
Done! 1 progress: 15 / 15


### 8. Count all synonyms in the indexed text
This counts the case-varied synonyms.

In [103]:
! python '08_run_count_synonyms.py'

Synonym count is running .....
Running jobs. 0.0 seconds
10030  synonyms successfully counted! 56.7 seconds


### 9. Screen for ambiguous synonyms
Some synonyms will likely be ambiguous, leading to false positives. This step identifies the synonyms presumed be potentially ambiguous (i.e. short synonyms, synonyms that are single English words)

In [104]:
! python '09_run_screen_synonyms.py'

Running jobs...
Exporting the suspect synonyms to data/remove_these_synonyms.txt
Inspect this file when the process is complete

Synonyms in data/remove_these_synonyms.txt will be removed. Check here.


# Next steps:
- Add/remove synonyms as described in the next block
- Run steps 10-13. 
- Inspect the scores. 
- Add/remove synonyms again
- Repeat the process until you're satisfied.

### Modify the file from step 9 (data/remove_these_synonyms.txt). 
Add bad synonyms, remove good synonyms. The case-varied versions are in here. The first part has synonyms that are English words. The second part has synonyms that are very short.
- If you add, add the case-varied versions (e.g., "Added Protein", "added protein", "Added protein", "added Protein"). 
- If you remove, remove the case-varied versions of the entity.

###  10. Get the entity counts
Using the synonyms that aren't bad synonyms and their synonym counts, this assemble the entity counts

In [110]:
! python '10_run_make_entity_counts.py'

### 11. Update the metadata

In [111]:
!python '11_run_metadata_update.py'

Updating the PMID -to-> Entity Count dictionary
Updating the Category -to-> Entity-Containing-PMID dictionary



### 12. Produce CaseOLAP scores for the entities

In [112]:
!python '12_run_caseolap_score.py'

Category: # PMIDs collected
IHD: 3677 PMIDs
CM: 1632 PMIDs
ARR: 339 PMIDs
VD: 447 PMIDs
CHD: 4636 PMIDs
CCD: 590 PMIDs
VOO: 92 PMIDs
OTH: 1957 PMIDs

Category: # Entities Found
IHD 53
CM 48
ARR 28
VD 33
CHD 48
CCD 39
VOO 16
OTH 55
Total entities: 65


### 13. Inspect entity scores
- You may notice that some entities score highly due to false positive synonyms. In that case, go back to the file mentioned just before step 9. Add the bad synonyms to the list, and run steps 10-13 again until you are satisfied with the quality of the results.
- Check the files in in *results/ranked_entities* and *results/ranked_synonyms* (unless you changed where they're stored)

In [108]:
! python '13_run_inspect_entity_scores.py'

65/117 (Found/Searched) Entities 
55.56%

Format:
# [Rank] [ID] | Score [x.xxxx]
#counts in all categories: synonym


*********************************
******** Ranked Entities ********
*********************************

#1 P17302 | 0.7613 Total CaseOLAP Score
3179: Cx43
809: connexin-43
165: Connexin-43

#2 Q92736 | 0.7398 Total CaseOLAP Score
2680: RyR2
150: ryanodine receptor 2
27: type 2 ryanodine receptor
11: Ryanodine receptor 2
8: Ryanodine Receptor 2
6: Type 2 ryanodine receptor
2: RYR-2
1: cardiac muscle ryanodine receptor

#3 P29474 | 0.7384 Total CaseOLAP Score
4994: eNOS
152: endothelial NOS
90: cNOS
21: constitutive NOS
10: Endothelial NOS
8: NOSIII
2: 1.14.13.39
1: Constitutive NOS
1: nitric oxide synthase, endothelial

#4 P19429 | 0.7103 Total CaseOLAP Score
3363: cardiac troponin I
476: Cardiac troponin I
130: Cardiac Troponin I
76: cardiac Troponin I
1: troponin I, cardiac muscle

#5 P14780 | 0.7102 Total CaseOLAP Score
3444: MMP-9
608: matrix metalloproteinase-9
75: M

NOTE: Watch out for making conclusions based on proteins that cluster tightly together. They might just be clustering together because they have the same synonyms (although they may have the same synonyms because they are similar, but that could have been determined 
without looking at the score)