![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# DocMapperModel

In this notebook, we will examine the `DocMapperModel` annotator.

This annotator maps document typed strings based on pre-defined dictionary with no machine learning/deep learning model. 


**📖 Learning Objectives:**

1. Understand how to map document based strings by using pre-defined dictionary. 

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb)

Python Documentation: [DocMapperModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/docmapper/index.html#sparknlp_jsl.annotator.chunker.docmapper.DocMapperModel)

Scala Documentation: [DocMapperModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/finance/chunk_classification/resolution/DocMapperModel.html)


## **📜 Background**


Any `ChunkMapperModel` can be used with the `DocMapperModel` and as its name suggests, it is used to map short strings via `DocumentAssembler` without using any other annotator between to convert strings to `Chunk` type that `ChunkMapperModel` expects.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [5]:
spark

In [8]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T

## **🖨️ Input/Output Annotation Types**
- Input: `DOCUMENT`
- Output: `LABEL_DEPENDENCY`

## **🔎 Parameters**


- `setRels` *(List[str])*: Relations that we are going to use to map the document
- `setLowerCase` *(Boolean)*: Set if we want to map the documents in lower case or not (Default: True)
- `setAllowMultiTokenChunk` *(Boolean)*: Whether to skip relations with multitokens (Default: True)

- `setMultivaluesRelations` *(Boolean)*:  Whether to decide to return all values in a relation together or separately (Default: False)





### `setRels()`

This parameter is being set to choose the relation types of the pretrained mapper models. 

In [6]:
#ChunkMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"])


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


test_data = spark.createDataFrame([["Dermovate"], ["Aspagin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


We can see the action and treatment mappings since we set `.setRels(["action", "treatment"])`. 

In [9]:
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result        |relation |all_mappings                                                                                                                                                                                                           |
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|anti-inflammatory     |action   |corticosteroids::: dermatological preparations:::very strong                                                                                                                 

This time let's set only "action" and see the results. 

In [10]:
#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action"])


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


test_data = spark.createDataFrame([["Dermovate"], ["Aspagin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [11]:
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+-----------------+--------+------------------------------------------------------------+
|ner_chunk|mapping_result   |relation|all_mappings                                                |
+---------+-----------------+--------+------------------------------------------------------------+
|Dermovate|anti-inflammatory|action  |corticosteroids::: dermatological preparations:::very strong|
|Aspagin  |analgesic        |action  |anti-inflammatory:::antipyretic                             |
+---------+-----------------+--------+------------------------------------------------------------+



As seen above, we have only "action" mappings. 

### `setLowerCase()`

This parameter is being used for selecting if you want to use the keys in lower case. <br/>

Firstly, we will use `setLowerCase(False)` and see the result. <br/>

We are expecting no matching since our example drug names(Dermovate, Aspagin) are not lowercased. So, if there is no exact matching between these drug names and the training drug names of the model, there will not be any matching result. 

In [12]:
#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setLowerCase(False)


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


test_data = spark.createDataFrame([["Dermovate"], ["Aspagin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [13]:
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+--------------+--------+------------+
|ner_chunk|mapping_result|relation|all_mappings|
+---------+--------------+--------+------------+
|Dermovate|NONE          |null    |null        |
|Aspagin  |NONE          |null    |null        |
+---------+--------------+--------+------------+



This time we will set `setLowerCase(True)` and see the results. 

In [14]:
#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setLowerCase(True)


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


test_data = spark.createDataFrame([["Dermovate"], ["Aspagin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [15]:
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result        |relation |all_mappings                                                                                                                                                                                                           |
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|anti-inflammatory     |action   |corticosteroids::: dermatological preparations:::very strong                                                                                                                 

As seen above, our model performed mapping even though our drug names are not matched(because they are capitalized) with the training data. 

### `setAllowMultiTokenChunk()`

If the chunk includes multi-tokens splitted by a whitespace, we can filter that chunk by using `setAllowMultiTokenChunk()` parameter.

In [16]:
test_data = spark.createDataFrame([["Warfarina Lusa"], ["amlodipine"], ["coumadin"], ["Aspagin"], ["metamorfin"]]).toDF("text")

Firstly, we will set this parameter as `setAllowMultiTokenChunk(False)`. Therefore, we are expecting no mapping for "Warfarina Lusa" chunk since it is multi-token chunk. 

In [17]:
#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setLowerCase(True) \
    .setAllowMultiTokenChunk(False)



mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


res = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [18]:
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+--------------+------------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk     |mapping_result          |relation |all_mappings                                                                                                                                                                                                           |
+--------------+------------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Warfarina Lusa|NONE                    |null     |null                                                                                                                                             

As seen above, there is no mapping for "Warfarina Lusa". <br/>

This time we will set `setAllowMultiTokenChunk(True)` and check the results. 

In [19]:
#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setLowerCase(True) \
    .setAllowMultiTokenChunk(True)

mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


res = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [20]:
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+--------------+------------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk     |mapping_result          |relation |all_mappings                                                                                                                                                                                                           |
+--------------+------------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Warfarina Lusa|anticoagulant           |action   |                                                                                                                                                 

As seen above, our model returned mappings for "Warfarina Lusa". 

### setMultivaluesRelations()

This parameter is used to decide to return all values in a relation together or separately. Default value is False. 

A mapper model with `setMultivaluesRelations(True)`: 

In [21]:
#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setLowerCase(True) \
    .setAllowMultiTokenChunk(True) \
    .setMultivaluesRelations(True)

mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


res = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [22]:
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation")).show(truncate=False)

+--------------+------------------------+---------+
|ner_chunk     |mapping_result          |relation |
+--------------+------------------------+---------+
|Warfarina Lusa|anticoagulant           |action   |
|Warfarina Lusa|heart disease           |treatment|
|Warfarina Lusa|cerebrovascular accident|treatment|
|Warfarina Lusa|pulmonary embolism      |treatment|
|Warfarina Lusa|rheumatic heart disease |treatment|
|Warfarina Lusa|heart attack            |treatment|
|Warfarina Lusa|af                      |treatment|
|Warfarina Lusa|embolization            |treatment|
|amlodipine    |NONE                    |null     |
|coumadin      |anticoagulant           |action   |
|coumadin      |cerebrovascular accident|treatment|
|coumadin      |pulmonary embolism      |treatment|
|coumadin      |heart attack            |treatment|
|coumadin      |af                      |treatment|
|coumadin      |embolization            |treatment|
|Aspagin       |analgesic               |action   |
|Aspagin    

A mapper model with `setMultivaluesRelations(False)`: 

In [23]:
#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setLowerCase(True) \
    .setAllowMultiTokenChunk(True) \
    .setMultivaluesRelations(False)

mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


res = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [24]:
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_relations")).show(truncate=False)

+--------------+------------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk     |mapping_result          |relation |all_relations                                                                                                                                                                                                          |
+--------------+------------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Warfarina Lusa|anticoagulant           |action   |                                                                                                                                                 

As seen above, since we set `setMultivaluesRelations(False)`, we see all mappings for a relation under the metadata in **all_relations** column.

## Lexical Fuzzy Matching Options in the ChunkMapperModel annotator

There are multiple options to achieve fuzzy matching using the `ChunkMapperModel`: <br/>

**Partial Token NGram Fingerprinting**:  Specially useful to combine two frequent usecases; when there are noisy non informative tokens at the beginning / end of the chunk and the order of the chunk is not absolutely relevant. i.e. stomach acute pain --> acute pain stomach ; metformin 100 mg --> metformin.

Parameters that can be used in order to enable that feature: <br/>

- `setEnableTokenFingerprintMatching()` *(Boolean)*: Whether to apply partial token Ngram fingerprint matching; this will create matching keys with partial Ngrams driven by three params: minTokenNgramFingerprint, maxTokenNgramFingerprint, maxTokenNgramDropping (Default: False)

- `setMinTokenNgramFingerprint()` *(Int)*: When enableTokenFingerprintMatching is true, the min number of tokens for partial Ngrams in Fingerprint (Default: 2)
- `setMaxTokenNgramFingerprint()` *(Integer)*: When enableCharFingerprintMatching is true, the max number of chars for Ngrams in Fingerprint (Default: 3)
- `setMaxTokenNgramDroppingCharsRatio()` *(Float)*: When enableTokenNgramMatching is true, this value drives the max amount of tokens to allow dropping based on the maximum ratio of chars allowed to be dropped from the full chunk; whenever it is desired for all Ngrams to be used as keys, no matter how short the final chunk is, this param should be set to 1.0 (Default is 0.0) <br/>


**Char NGram Fingerprinting**: Specially useful in usecases that involve typos or different spacing patterns for chunks. i.e. head ache / ache head --> headache ; metformini / metformoni / metformni --> metformin

Parameters that can be used in order to enable that feature: <br/>

- `setEnableCharFingerprintMatching()` *(Boolean)*: 
Whether to apply char Ngram fingerprint matching (Default: False)
- `setMinCharNgramFingerprint()` *(Integer)*: When enableCharFingerprintMatching is true, the min number of chars for Ngrams in Fingerprint (Default: 2)
- `setMaxCharNgramFingerprint()` *(Integer)*: When enableCharFingerprintMatching is true, the max number of chars for Ngrams in Fingerprint (Default: 3) <br/>


**Fuzzy Distance (Slow)**: Specially useful when the mapping can be defined in terms of edit distance thresholds using functions like char based like Levenshtein, Hamming, LongestCommonSubsequence or token based like Cosine, Jaccard.

Parameters that can be used in order to enable that feature: <br/>

- `setEnableFuzzyMatching()` *(Boolean)*: Whether to apply fuzzy matching (Default: False)

- `setFuzzyMatchingDistanceThresholds()` *(Float)*: When enableFuzzyMatching is true, the threshold value for distance

- `setFuzzyMatchingDistances()` *(List[Str])*: When enableFuzzyMatching is true, this array contains the distances to calculate; possible values are: levenshtein, longest-common-subsequence, cosine, jaccard (Default: levenshtein)

- `setFuzzyDistanceScalingMode()` *(String)*: When enableFuzzyMatching is true, the scaling mode for Integer Edit Distances; possible values are: left, right, long, short, none (Default: long)



The mapping logic will be run in the previous order also ordering by longest key inside each option as an intuitive way to minimize false positives.



**NOTE**: We will cover the Partial Token NGram Fingerprinting and the Char NGram Fingerprinting in `DocMapperApproach()` notebook since we can see their effects if we had trained a model using these features. 

### `setEnableFuzzyMatching()`
This parameter is used to decide whether to apply fuzzy matching or not. 

Firstly, we will set `setEnableFuzzyMatching(False)` and see the result.

In [25]:
#ChunkMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setAllowMultiTokenChunk(True) \
    .setEnableFuzzyMatching(False) 


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin 5 mg"], ["coumadin"], ["metamorfin"]]).toDF("text")

result_df = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [26]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result, 
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+-----------+------------------------+---------+
|ner_chunk         |fixed_chunk|action_mapping_result   |relation |
+------------------+-----------+------------------------+---------+
|Lusa Warfarina 5mg|null       |NONE                    |null     |
|amlodipine 10     |null       |NONE                    |null     |
|Aspaginaspa       |null       |NONE                    |null     |
|coumadin 5 mg     |null       |NONE                    |null     |
|coumadin          |coumadin   |anticoagulant           |action   |
|coumadin          |coumadin   |cerebrovascular accident|treatment|
|metamorfin        |null       |NONE                    |null     |
+------------------+-----------+------------------------+---------+



Now, we will set `setEnableFuzzyMatching(True)` to see the difference.

In [27]:
#ChunkMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setAllowMultiTokenChunk(True) \
    .setEnableFuzzyMatching(True) \
    .setFuzzyMatchingDistanceThresholds(0.6)


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin 5 mg"], ["coumadin"], ["metamorfin"]]).toDF("text")

result_df = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [28]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result, 
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------------------+----------------------+---------+
|ner_chunk         |fixed_chunk               |action_mapping_result |relation |
+------------------+--------------------------+----------------------+---------+
|Lusa Warfarina 5mg|pravastatina fg           |lipid modifying agents|action   |
|Lusa Warfarina 5mg|pravastatina fg           |heart disease         |treatment|
|Lusa Warfarina 5mg|warfarin pmcs             |anticoagulant         |action   |
|Lusa Warfarina 5mg|warfarin pmcs             |heart disease         |treatment|
|Lusa Warfarina 5mg|warfarina mk              |anticoagulant         |action   |
|Lusa Warfarina 5mg|warfarina mk              |heart disease         |treatment|
|amlodipine 10     |nisoldipine yd            |hypertension          |treatment|
|amlodipine 10     |boie amlodipine besilate  |antianginal           |action   |
|amlodipine 10     |boie amlodipine besilate  |hypertension          |treatment|
|amlodipine 10     |azathiop

As seen above, for the chunk which has not any one-to-one matching, fuzzy matching worked and returned mappings based on the fuzzy distance. 

### `setFuzzyMatchingDistanceThresholds()`
This parameter is used to set the threshold value for the distance when `enableFuzzyMatching` is `True`. 

Let's define `DocMapperModel` with `setFuzzyMatchingDistanceThresholds(0.8)` and see the difference. 

In [29]:
#ChunkMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setAllowMultiTokenChunk(True) \
    .setEnableFuzzyMatching(True) \
    .setFuzzyMatchingDistanceThresholds(0.8)


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin 5 mg"], ["coumadin"], ["metamorfin"]]).toDF("text")

result_df = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [30]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result, 
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+----------------------------------------+------------------------------------------------------+---------+
|ner_chunk         |fixed_chunk                             |action_mapping_result                                 |relation |
+------------------+----------------------------------------+------------------------------------------------------+---------+
|Lusa Warfarina 5mg|liquid soap pre-op wash (triclosan 1%)  |bactericidal                                          |action   |
|Lusa Warfarina 5mg|liquid soap pre-op wash (triclosan 1%)  |blackheads                                            |treatment|
|Lusa Warfarina 5mg|fluvastatina pharmathen international   |hypocholesterolemic                                   |action   |
|Lusa Warfarina 5mg|fluvastatina pharmathen international   |heterozygous familial hypercholesterolemia            |treatment|
|Lusa Warfarina 5mg|supositorios glicerina micralax infantil|laxative                                          

### `setFuzzyMatchingDistances()`
When enableFuzzyMatching is true, this parameter accepts an array contains the distances to calculate; possible values are: `levenshtein`, `longest-common-subsequence`, `cosine`, `jaccard`. 

Now, we will define a `DocMapperModel` with `setFuzzyMatchingDistances(["longest-common-subsequence"])` and see the difference.

In [31]:
#ChunkMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setAllowMultiTokenChunk(True) \
    .setEnableFuzzyMatching(True) \
    .setFuzzyMatchingDistanceThresholds(0.8) \
    .setFuzzyMatchingDistances(["longest-common-subsequence"])


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin 5 mg"], ["coumadin"], ["metamorfin"]]).toDF("text")

result_df = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [32]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result, 
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+----------------------------------------+---------------------------------------------+---------+
|ner_chunk         |fixed_chunk                             |action_mapping_result                        |relation |
+------------------+----------------------------------------+---------------------------------------------+---------+
|Lusa Warfarina 5mg|panwarfin                               |anticoagulant                                |action   |
|Lusa Warfarina 5mg|panwarfin                               |heart disease                                |treatment|
|Lusa Warfarina 5mg|warfarin                                |anticoagulant                                |action   |
|Lusa Warfarina 5mg|warfarin                                |heart disease                                |treatment|
|Lusa Warfarina 5mg|cipla-warfarin                          |anticoagulant                                |action   |
|Lusa Warfarina 5mg|cipla-warfarin                      

As seen above, fuzzyMatching result has been changed compared to the default value which is `levenshtein`. 

### `setFuzzyDistanceScalingMode()`
When `enableFuzzyMatching` is `True`, this parameter is used to decide the scaling mode for Integer Edit Distances; possible values are: `left`, `right`, `long`, `short`, `none`. 

Now, we will define a `DocMapperModel` with `setFuzzyDistanceScalingMode("right")` and see the difference.


In [33]:
#ChunkMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"]) \
    .setAllowMultiTokenChunk(True) \
    .setEnableFuzzyMatching(True) \
    .setFuzzyMatchingDistanceThresholds(0.8) \
    .setFuzzyMatchingDistances(["longest-common-subsequence"]) \
    .setFuzzyDistanceScalingMode("right")


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin 5 mg"], ["coumadin"], ["metamorfin"]]).toDF("text")

result_df = mapperPipeline.fit(test_data).transform(test_data)

drug_action_treatment_mapper download started this may take some time.
[OK!]


In [34]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result, 
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------------------------------------+---------------------------------------------+---------+
|ner_chunk         |fixed_chunk                                 |action_mapping_result                        |relation |
+------------------+--------------------------------------------+---------------------------------------------+---------+
|Lusa Warfarina 5mg|warfarina mk                                |anticoagulant                                |action   |
|Lusa Warfarina 5mg|warfarina mk                                |heart disease                                |treatment|
|amlodipine 10     |perindopril tert-butylamine/amlodipine a    |agents acting in the renin-angiotensin system|action   |
|amlodipine 10     |perindopril tert-butylamine/amlodipine a    |heart disease                                |treatment|
|amlodipine 10     |ritemed felodipine                          |antianginal                                  |action   |
|amlodipine 10     |rite

As seen above, fuzzyMatching result has been changed compared to the default value which is `long`.