Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions medcat-service/.gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
# IDE envs and system folders
.DS_Store
**/.DS_Store
.idea
.vscode
.venv
.ruff_cache
**__pycache__
venv
venv-test
tmp_cat.log
__pycache__
.mypy_cache
*.pyc
.pyc
Expand All @@ -21,7 +24,11 @@ models/examples/example-deid-model-pack
tmp/*

# env folder
env/*
.env
*.env
**/.env
**/*.env

# log files
tmp_cat.log
medcat.log
53 changes: 34 additions & 19 deletions medcat-service/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@ A REST API wrapper for [MedCAT](https://github.com/CogStack/cogstack-nlp/blob/ma

Feel free to ask questions on the github issue tracker or on our [discourse website](https://discourse.cogstack.org) which is frequently used by our development team!

# API specification
## API specification

The API definition follows the one defined in [CogStack GATE NLP Service](https://github.com/CogStack/gate-nlp-service/). Currently, there are 3 endpoints defined, that consume and return data in JSON format:

- *GET* `/api/info` - displays general information about the MedCAT application,
- *POST* `/api/process` - processes the provided documents and returns back the annotations,
- *POST* `/api/process_bulk` - processes the provided list of documents and returns back the annotations.

The full specification is available is [OpenAPI](https://github.com/CogStack/gate-nlp-service/tree/devel/api-specs) specification.


# Running the application
## Running the application

The application can be run either as a standalone Python application or as running inside the Docker container (recommended).

Expand All @@ -23,12 +23,13 @@ The application can be run either as a standalone Python application or as runni
Please note that prior running the application a number of requirements need to installed (see: `requirements.txt`).

There are two scripts provided implementing starting the application:

- `start_service_debug.sh` - starts the application in the development mode
- `start_service_production.sh` - starts the application in 'production' mode and using `gunicorn` server.

## Running in a Docker container

The recommended way to run the application is to use the provided Docker image. The Docker image can be either downloaded from the Docker Hub (`cogstacksystems/medcat-service:latest`) or build manually using the provided `Dockerfile`.
The recommended way to run the application is to use the provided Docker image. The Docker image can be either downloaded from the Docker Hub (`cogstacksystems/medcat-service:latest`) or build manually using the provided `Dockerfile`.
Please note that by default the built docker image will run the Flask application in 'production' mode running `start_service_production.sh` script.

To build the Docker image manually:
Expand All @@ -38,36 +39,40 @@ To build the Docker image manually:
To run the container using the built image:

```
docker run -it -p 5000:5000 \
--env-file=envs/env_app --env-file=envs/env_medcat \
-v <models-local-dir>:/cat/models:ro \
cogstacksystems/medcat-service:latest
docker run -it -p 5000:5000 \
--env-file=envs/env_app --env-file=envs/env_medcat \
-v <models-local-dir>:/cat/models:ro \
cogstacksystems/medcat-service:latest
```

By default the MedCAT service will be running on port `5000`. MedCAT models will be mounted from local directory `<models-local-dir>` into the container at `/cat/models`.
By default the MedCAT service will be running on port `5000`. MedCAT models will be mounted from local directory `<models-local-dir>` into the container at `/cat/models`.

### GPU support

If you have a gpu and wish to use it, please change the `docker/docker-compose.yml` file, use the `cogstacksystems/medcat-service-gpu:latest` image or change the `build:` directive to build `../Dockerfile_gpu`.

### <span style="color:red">IMPORTANT !</span>
If you wish to run this docker service manually, use the docker/docker-compose.yml file, execute `docker compose up -d` whilst in the `docker` folder.

If you wish to run this docker service manually, use the docker/docker-compose.yml file, execute `docker compose up -d` whilst in the `docker` folder.

Alternatively, an example script `./docker/run_example_medmen.sh` was provided to run the Docker container with MedCAT service. The script will download an example model (using the `./scripts/download_medmen.sh` script),it will use an example environment configuration, then it will build and start the service using the provided Docker Compose file, the service <b><span style="color:red">WONT WORK</span></b> without the model being present.

All models should be mounted from the `models/` folder.

<br>

### Manual docker start-up steps:
### Manual docker start-up steps

```
1. cd ./models/
2. bash ./download_medmen.sh
3. cd ../docker/
4. docker compose up -d
DONE!
```

Or, if you wish to use the above mentioned script ( the sample model is downloaded via script, you don't need to do anything):

```
1. cd ./docker/
2. bash ./run_example_medmen.sh
Expand All @@ -77,6 +82,7 @@ Or, if you wish to use the above mentioned script ( the sample model is download
# API Example use

Assuming that the application is running on the `localhost` with the API exposed on port `5000`, one can run:

```
curl -XPOST http://localhost:5000/api/process \
-H 'Content-Type: application/json' \
Expand All @@ -100,9 +106,9 @@ and the received result:

Additional DE-ID query sample (make sure you have a de-id model loaded):

curl -XPOST http://localhost:5555/api/process \
curl -XPOST <http://localhost:5555/api/process> \
-H 'Content-Type: application/json' \
-d '{"content":{"text":"Patient Information: Full Name: John Michael Doe \n Gender: Male \n Date of Birth: January 15, 1975 (Age: 49) \n Patient ID: 567890123 \n Address: 1234 Elm Street, Springfield, IL 62701 \n Phone Number: (555) 123-4567 \n Email: johnmdoe@example.com \n Emergency Contact: Jane Doe (Wife) \n Phone: (555) 987-6543 \n Relationship: Spouse"}}'
-d '{"content":{"text":"Patient Information: Full Name: John Michael Doe \n Gender: Male \n Date of Birth: January 15, 1975 (Age: 49) \n Patient ID: 567890123 \n Address: 1234 Elm Street, Springfield, IL 62701 \n Phone Number: (555) 123-4567 \n Email: <johnmdoe@example.com> \n Emergency Contact: Jane Doe (Wife) \n Phone: (555) 987-6543 \n Relationship: Spouse"}}'

Make sure you have the following option enabled in `envs/env_medcat` , `DEID_MODE=True`.

Expand All @@ -114,7 +120,7 @@ curl -XPOST http://localhost:5000/api/process_bulk \
-d '{"content": [{"text":"The patient was diagnosed with leukemia."}, {"text": "The patient was diagnosed with cancer."}] }'
```

example bulk result :
example bulk result :

```
{
Expand Down Expand Up @@ -269,48 +275,56 @@ As the changes from MedCAT intoduced dictionary annotation/entity output.

The mode in which annotation entities should be outputted in the JSON response,
by default this was outputted as a "list" of dicts in older versions, so the output would be :

```
{"annotations": [{"id": "0", "cui" : "C1X..", ..}, {"id":"1", "cui": "...."}]}
```

newer versions of MedCAT (1.2+) output entities as a dict, where the id of the entity is a key and the rest of the data is a value, so for "dict",
the output is

```
{"annotations": [{"0": {"cui": "C0027361", "id": 0,.....}, "1": {"cui": "C001111", "id": 1......}}]}
```

This setting can be configured in the ```./env/env_medcat``` file, using the ```ANNOTATIONS_ENTITY_OUTPUT_MODE``` variable.
By default, the output of these entities is set to respect the output of the MedCAT package, hence the latter will be used. Please change the above mentioned env variable and make sure your CogStack-Nifi annotation script is adapted accordingly.
<br>
Please note that the returned NLP annotations will depend on the underlying model used. For evaluation, we can only provide a very basic model trained on [MedMentions](https://github.com/chanzuckerberg/MedMentions). Models utilising [SNOMED CT](https://www.england.nhs.uk/digitaltechnology/digital-primary-care/snomed-ct/) or [UMLS](https://www.nlm.nih.gov/research/umls/index.html) may require applying for licenses from the copyright holders.
<br>
<br>

# Configuration
## Configuration

In the current implementation, configuration for both MedCAT Service application and MedCAT NLP library is based on environment variables. These will be provided usually in two files in `env` directory:

- `env_app` - configuration of MedCAT Service app,
- `env_medcat` - configuration of MedCAT library.

Both files allow tailoring MedCAT for specific use-cases. When running MedCAT Service, these variables need to be loaded into the current working environment.

## spaCy models

When using MedCAT for a different language than English, it can be useful to use a different spaCy model. A spaCy model can be included in the MedCAT model pack, but when not using this functionality, it can be useful to install models in the Docker image. This can be done by setting a build-time variable. See the `SPACY_MODELS` variable in [Dockerfile](Dockerfile) for default value and usage.

## MedCAT Service
## Service Environment vars

MedCAT Service application are defined in `envs/env_app` file.

The following environment variables are available for tailoring the MedCAT Service `gunicorn` server:

- `SERVER_HOST` - specifies the host address (default: `0.0.0.0`),
- `SERVER_PORT` - the port number used (default: `5000`),
- `SERVER_WORKERS` - the number of workers serving the Flask app working in parallel (default: `1` ; only used in production server).
- `SERVER_WORKER_TIMEOUT` - the max timeout (in sec) for receiving response from worker (default: `300` ; only used with production server).

The following environment variables are available for tailoring the MedCAT Service wrapper:
- `APP_MODEL_NAME` - an informative name of the model used by MedCAT (optional),

- `APP_MODEL_NAME` - an informative name of the model used by MedCAT (optional),
- `APP_MODEL_CDB_PATH` - the path to the model's concept database,
- `APP_MODEL_VOCAB_PATH` - the path to the model's vocabulary,
- `APP_MODEL_META_PATH_LIST` - the list of paths to meta-annotation models, each separated by `:` character (optional),
- `APP_BULK_NPROC` - the number of threads used in bulk processing (default: `8`),
- `APP_TRAINING_MODE` - whether to run the application with MedCAT in training mode (default: `False`).
- `APP_MEDCAT_MODEL_PACK` - MedCAT Model Pack path, if this parameter has a value IT WILL BE LOADED FIRST OVER EVERYTHING ELSE (CDB, Vocab, MetaCATs, etc.) declared above.

## Performance Tuning
Expand All @@ -319,6 +333,7 @@ Theres a range of factors that might impact the performance of this service, the
The main settings that can be used to improve the performance when querying large amounts of documents are : `SERVER_WORKERS` (number of flask web workers that chan handle parallel requests) and `APP_BULK_NPROC` (threads for annotation processing).

## MedCAT library
MedCAT parameters are defined in selected `envs/env_medcat*` file.

MedCAT parameters are defined in selected `envs/medcat*` file.

For details on available MedCAT parameters please refer to [the official GitHub repository](https://github.com/CogStack/cogstack-nlp/blob/main/medcat-v2/).
3 changes: 2 additions & 1 deletion medcat-service/env/app.env
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ APP_MODEL_VOCAB_PATH=/cat/models/medmen/vocab.dat

# optionally, can include multiple models for meta tasks, separated using ':'
APP_MODEL_META_PATH_LIST=/cat/models/medmen/Status
APP_MODEL_REL_PATH_LIST=

# MedCAT Model Pack path
# IMPORTANT: if this parameter has value IT WILL BE LOADED FIRST OVER EVERYTHING ELSE (CDB, Vocab, MetaCATs, etc.) declared above.
Expand All @@ -39,4 +40,4 @@ APP_TORCH_THREADS=8

# GPU SETTING
# CAUTION, use only if you are using the GPU docker image.
APP_CUDA_DEVICE_COUNT=1
APP_CUDA_DEVICE_COUNT=1
5 changes: 3 additions & 2 deletions medcat-service/env/app_deid.env
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,12 @@ APP_MODEL_VOCAB_PATH=/cat/models/medmen/vocab.dat

# optionally, can include multiple models for meta tasks, separated using ':'
APP_MODEL_META_PATH_LIST=/cat/models/medmen/Status
APP_MODEL_REL_PATH_LIST=

# MedCAT Model Pack path
# IMPORTANT: if this parameter has value IT WILL BE LOADED FIRST OVER EVERYTHING ELSE (CDB, Vocab, MetaCATs, etc.) declared above.
# Respect the same paths as above : /cat/models/model_pack_name.zip
APP_MEDCAT_MODEL_PACK=/cat/models/de_id_base.zip
APP_MEDCAT_MODEL_PACK=

# optionally, an filter the reported concepts by CUIs
# APP_MODEL_CUI_FILTER_PATH=/cat/models/cui_filter.txt
Expand All @@ -39,4 +40,4 @@ APP_TORCH_THREADS=8

# GPU SETTING
# CAUTION, use only if you are using the GPU docker image.
APP_CUDA_DEVICE_COUNT=1
APP_CUDA_DEVICE_COUNT=1
2 changes: 1 addition & 1 deletion medcat-service/env/general.env
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@
# remove if this causes issues on any other platform
# possible values: amd64, arm64
CPU_ARCHITECTURE=amd64
DOCKER_DEFAULT_PLATFORM=linux/${CPU_ARCHITECTURE:-amd64}
DOCKER_DEFAULT_PLATFORM=linux/${CPU_ARCHITECTURE:-amd64}
25 changes: 5 additions & 20 deletions medcat-service/env/medcat.env
Original file line number Diff line number Diff line change
@@ -1,25 +1,10 @@
TYPE=NOT_UMLS

# IMPORTANT : log level set
# CRITICAL - 50, ERROR - 40, WARNING - 30, INFO - 20, DEBUG - 10, NOTSET - 0
LOG_LEVEL=40

NESTED_ENTITIES=False

CNTX_SPAN=9
CNTX_SPAN_SHORT=3
MIN_CUI_COUNT=30000
MIN_CUI_COUNT_STRICT=-1
MIN_ACC=0.2
MIN_ACC_TH=0.2

LEARNING_RATE=0.1
ANNEAL=False
KEEP_PUNCT=":|."
MEDCAT_LOG_LEVEL=40

# can be left empty if you want to use the CDB spacy model, otherwise, possible values are : en_core_sci_lg, en_core_sci_md, etc...
# please note that only SpaCy models that are installed on the docker container are usable (take a look at the .Dockerfile and install any other custom models you may want to use)
SPACY_MODEL=
MEDCAT_SPACY_MODEL=

# IMPORTANT:
# Mode in which annotation entities should be outputted in the JSON response,
Expand All @@ -30,7 +15,7 @@ SPACY_MODEL=
# Be mindful of this option as it can affect other services that rely directly on the responses of the service
# (the NiFi groovy scripts and annotation ingester are two such services that process the output, and so they might require further customisation)
# POSSIBLE VALUES: [list, dict], if left empty then "dict" is the default.
ANNOTATIONS_ENTITY_OUTPUT_MODE=dict
MEDCAT_ANNOTATIONS_ENTITY_OUTPUT_MODE=dict

DEID_MODE=False
DEID_REDACT=False
MEDCAT_DEID_MODE=False
MEDCAT_DEID_REDACT=False
25 changes: 5 additions & 20 deletions medcat-service/env/medcat_deid.env
Original file line number Diff line number Diff line change
@@ -1,25 +1,10 @@
TYPE=NOT_UMLS

# IMPORTANT : log level set
# CRITICAL - 50, ERROR - 40, WARNING - 30, INFO - 20, DEBUG - 10, NOTSET - 0
LOG_LEVEL=40

NESTED_ENTITIES=False

CNTX_SPAN=9
CNTX_SPAN_SHORT=3
MIN_CUI_COUNT=30000
MIN_CUI_COUNT_STRICT=-1
MIN_ACC=0.2
MIN_ACC_TH=0.2

LEARNING_RATE=0.1
ANNEAL=False
KEEP_PUNCT=:|.
MEDCAT_LOG_LEVEL=40

# can be left empty if you want to use the CDB spacy model, otherwise, possible values are : en_core_sci_lg, en_core_sci_md, etc...
# please note that only SpaCy models that are installed on the docker container are usable (take a look at the .Dockerfile and install any other custom models you may want to use)
SPACY_MODEL=
MEDCAT_SPACY_MODEL=

# IMPORTANT:
# Mode in which annotation entities should be outputted in the JSON response,
Expand All @@ -30,7 +15,7 @@ SPACY_MODEL=
# Be mindful of this option as it can affect other services that rely directly on the responses of the service
# (the NiFi groovy scripts and annotation ingester are two such services that process the output, and so they might require further customisation)
# POSSIBLE VALUES: [list, dict], if left empty then "dict" is the default.
ANNOTATIONS_ENTITY_OUTPUT_MODE=dict
MEDCAT_ANNOTATIONS_ENTITY_OUTPUT_MODE=dict

DEID_MODE=True
DEID_REDACT=True
MEDCAT_DEID_MODE=True
MEDCAT_DEID_REDACT=True
26 changes: 22 additions & 4 deletions medcat-service/export_env_vars.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
#!/bin/bash
#!/usr/bin/env bash

# Enable strict mode (without -e to avoid exit-on-error)
set -uo pipefail


echo "🔧 Running $(basename "${BASH_SOURCE[0]}")..."

set -a

Expand All @@ -10,8 +16,20 @@ env_files=("env/general.env"
)


for env_file in ${env_files[@]}; do
source $env_file
for env_file in "${env_files[@]}"; do
if [ -f "$env_file" ]; then
echo "✅ Sourcing $env_file"
# shellcheck disable=SC1090
source "$env_file"
else
echo "⚠️ Skipping missing env file: $env_file"
fi
done

set +a

# Disable auto-export
set +a

# Restore safe defaults for interactive/dev shell
set +u
set +o pipefail
Loading
Loading