# Run the CaseOLAP Pipeline

## Server-Wide Steps
Steps 1-5 should be performed once and is used globally, across the whole server. The later steps should be customized for each project

### 1. Download the documents

In [1]:
!python '01_run_download.py'

### 2. Parse the documents

In [2]:
!python '02_run_parsing.py'

### 3. Map MeSH to PMID (for document-category relationships)

In [14]:
!python '03_run_mesh2pmid.py'

In [13]:
!python '03_run_mesh2pmid.py'

### 4 & 5. Index the documents (for document-entity relationships)
Make sure Elasticsearch is properly configured and running first.
4. Initialize the index. 
5. Run the index.


In [3]:
!python '04_run_index_init.py'

In [4]:
!python '05_run_index_populate.py'

## Project-Specific Steps
These steps should be customized for the project

### 6. Categorize the documents of interest

In [12]:
!python '06_run_textcube.py'

### 7. Vary the synonyms' cases
Makes case-senstivie variations of the synonyms. This increases discovery of the synonyms within the case-sensitive text.

In [11]:
! python '07_run_vary_synonyms_cases.py'

### 8. Count all synonyms in the indexed text
This counts the case-varied synonyms.

In [10]:
! python '08_run_count_synonyms.py'

### 9. Screen for ambiguous synonyms
Some synonyms will likely be ambiguous, leading to false positives. This step identifies the synonyms presumed be potentially ambiguous (i.e. short synonyms, synonyms that are single English words)

In [9]:
! python '09_run_screen_synonyms.py'

# Next steps:
- Add/remove synonyms as described in the next block
- Run steps 10-13. 
- Inspect the scores. 
- Add/remove synonyms again
- Repeat the process until you're satisfied.

### Modify the file from step 9 (data/remove_these_synonyms.txt). 
Add bad synonyms, remove good synonyms. The case-varied versions are in here. The first part has synonyms that are English words. The second part has synonyms that are very short.
- If you add, add the case-varied versions (e.g., "Added Protein", "added protein", "Added protein", "added Protein"). 
- If you remove, remove the case-varied versions of the entity.

###  10. Get the entity counts
Using the synonyms that aren't bad synonyms and their synonym counts, this assemble the entity counts

In [8]:
! python '10_run_make_entity_counts.py'

### 11. Update the metadata

In [7]:
!python '11_run_metadata_update.py'


### 12. Produce CaseOLAP scores for the entities

In [6]:
!python '12_run_caseolap_score.py'

### 13. Inspect entity scores
- You may notice that some entities score highly due to false positive synonyms. In that case, go back to the file mentioned just before step 9. Add the bad synonyms to the list, and run steps 10-13 again until you are satisfied with the quality of the results.
- Check the files in in *results/ranked_entities* and *results/ranked_synonyms* (unless you changed where they're stored)

In [5]:
! python '13_run_inspect_entity_scores.py'

NOTE: Watch out for making conclusions based on proteins that cluster tightly together. They might just be clustering together because they have the same synonyms (although they may have the same synonyms because they are similar, but that could have been determined 
without looking at the score)