Skip to content
This repository has been archived by the owner on Jan 13, 2023. It is now read-only.
Honghan Wu edited this page Feb 18, 2018 · 21 revisions

Welcome to the SemEHR wiki!

Run SemEHR pipeline

A typical SemEHR process contains the following steps:

  • query a database to get the documents for processing
  • NLP processing (e.g., using bio-yodie to annotate umls concepts)
  • index contextualised concepts into an elaticsearch instance
  • do patient centric indexing to integrate all patient docs and annotations

To do the process, the easiest way is to

  1. (only do this ONCE) initialise SemEHR index using the mapping file.
  2. setup the database view from which SemEHR will pull documents from.
  3. edit the process configuration file using this template.
  4. run the script python semehr_processor.py PATH_TO_YOUR_CONFIGURATION

semehr_process_settings.json explained

  • env - system variables for running SemEHR
    • java_home - path to JRE
    • gcp_home - path to GCP (Gate Cloud Processing toolkit)
    • gate_home - path to Gate
    • yodie_path - path to bio-yodie
    • ukb_home - path to UKB (used by bio-yodie to do PageRank computation for disambiguation)
  • yodie - settings for running bio-yodie NLP pipeline on documents
    • "os" - the type of Operating System; possible values: win, linux
    • "gcp_run_path" - bio-yodie working folder
    • "input_doc_file_path" - (optional) path to a folder containing a text document that lists all document ids to be processed
    • "thread_num" - number of concurrent threads to run bio-yodie
    • "memory" - max memory to run bio-yodie, e.g., 30g or 600m
    • "config_xml_path" - the full path to store bio-yodie configuration file (the file will be automatically generated)
    • "output_file_path" - (optional) path to the folder where JSON dumps of bio-yodie will be saved to
    • "output_destination" - output type of bio-yodie including 'sql', 'json'. sql - to be saved to a database server; json - to be saved as dumps of annotation files in JSON format.
    • "output_dbconn_setting_file" - path to a json database configuration for saving annotations to; check this example.
    • "output_table" - the table name to save annotations to if using sql output, e.g., [kconnect_annotations];
    • "output_concept_filter_file" - (optional) path to a text document containing concept IDs that should be saved; all other concepts will be discarded. The format is each line a UMLS CUI
    • "input_source" - where to read documents from. possible values include "sql" and "elasticsearch". Essentially, the system will use different input handlers for running bio-yodie. sql - read from database; elasticsearch - read from a elasticsearch server specified in the semehr section of this configuration
    • "input_dbconn_setting_file" - (optional) input document database configuration, only needed when input_source is sql. check this example.

Useful Links

troubleshooting

  • when you see no concepts indexed for patients, please double check the index mapping to make sure the mappings are correct as defined in the script.
Clone this wiki locally