![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# DocMapperApproach

In this notebook, we will examine the `DocMapperApproach` to create custom mapper model based on the given json file.

This annotator ensures creating of a mapper to map the document typed strings based on a pre-defined dictionary with no machine learning/deep learning model.




**📖 Learning Objectives:**

1. Understand how to create a mapper model by using pre-defined dictionary.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb)

Python Documentation: [DocMapperApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/docmapper/index.html#sparknlp_jsl.annotator.chunker.docmapper.DocMapperApproach.name)

Scala Documentation: [DocMapperApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/finance/chunk_classification/resolution/DocMapperApproach.html)


## **📜 Background**


The `DocMapperApproach` loads a `JsonDictionary` that have the relations to be mapped in the `DocMapperModel`.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T

## **🖨️ Input/Output Annotation Types**
- Input: `DOCUMENT`
- Output: `LABEL_DEPENDENCY`

## **🔎 Parameters**


- `setDictionary` *(Str)*: Dictionary path where is the JsonDictionary that contains the mappings columns
- `setRels` *(Boolean)*: Relations that we are going to use to map the document
- `setLowerCase` *(Boolean)*: Set if we want to map the documents in lower case or not (Default: True)
- `setAllowMultiTokenChunk` *(Boolean)*: Whether to skip relations with multitokens (Default: True)
- `setMultivaluesRelations` *(Boolean)*:  Whether to decide to return all values in a relation together or separately (Default: False)





### `setDictionary()`

This parameter is used for giving the dictionary path of the JsonDictionary that contains the mappings columns.

Let's create an example Json, then create a drug mapper model. This model will match the given drug name (only "metformin" for our example) with correpsonding action and treatment.
The format of json file should be like following:

In [None]:
data_set= {
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}

import json
with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

Let's create a pipeline and give this Json file's path through `setDictinary` parameter and see it in action.

In [None]:
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

chunkerMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action"])

pipeline = nlp.Pipeline().setStages([document_assembler,
                                     chunkerMapper])

Fit/transform the pipeline with a sample text

In [None]:
test_data = spark.createDataFrame([["metformin"]]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

Checking the mapper results

In [None]:
res.selectExpr("mappings.metadata").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                                                                                                                                                                       

In [None]:
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("document"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+--------------+--------+----------------------+
|document |mapping_result|relation|all_mappings          |
+---------+--------------+--------+----------------------+
|metformin|hypoglycemic  |action  |Drugs Used In Diabetes|
+---------+--------------+--------+----------------------+



As seen above, we mapped the correpsonding "action" relation of given document based on pre-defined JsonDictionary.

### `setRels()`

This parameter is being set to choose the relation types of the mapper model.

We will set `.setRels(["action", "treatment"])` so that we can see the action and treatment mappings.

In [None]:
chunkMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action", "treatment"])

pipeline = nlp.Pipeline().setStages([document_assembler,
                                 chunkMapper])

Fit/transform the pipeline with a sample text

In [None]:
test_data = spark.createDataFrame([["metformin"]]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

Checking mapper results

In [None]:
res.selectExpr("mappings.metadata").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+--------------+---------+----------------------+
|ner_chunk|mapping_result|relation |all_mappings          |
+---------+--------------+---------+----------------------+
|metformin|hypoglycemic  |action   |Drugs Used In Diabetes|
|metformin|diabetes      |treatment|t2dm                  |
+---------+--------------+---------+----------------------+



As seen above, we mapped the correpsonding "action" and "treatment" relations of given document based on pre-defined JsonDictionary.

### `setLowerCase()`

This parameter is being used for selecting if you want to use the keys in lower case. <br/>

Firstly, we will use `setLowerCase(False)` and see the result. <br/>

We are expecting no matching if our document names(E.g. Amlodipine, Aspagin) are not lowercased. So, if there is no exact matching between these documents and the documents in the json of the approach, there will not be any matching result.

Creating example Json:

In [None]:
data_set= {
  "mappings": [
    {
            "key": "Warfarina lusa",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Analgesic",
                        "Antipyretic"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "diabetes",
                        "t2dm"
                    ]
                }
            ]
        },

      {
            "key": "amlodipine",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Calcium Ions Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "hypertension"
                    ]
                }
            ]
        },

        {
            "key": "coumadin",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Coagulation Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "hypertension"
                    ]
                }
            ]
        },

        {
            "key": "aspagin",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Cycooxygenase Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "arthritis"
                    ]
                }
            ]
        },
        {
            "key": "metformin",
            "relations": [
                {
                  "key": "action",
                  "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
                },
                {
                  "key": "treatment",
                  "values" : ["diabetes", "t2dm"]
                }
            ]
        }






  ]
}

import json
with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

We will create a mapper pipeline with `DocMapperApproach()` by setiing `setLowerCase(False)`. By setting this parameter as `False`, casing of words will be considered while matching.

In [None]:
#DocMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action", "treatment"]) \
      .setLowerCase(False)


mapperPipeline = nlp.Pipeline().setStages([
      document_assembler,
      docMapper])

Fit/transform the pipeline.

In [None]:
test_data = spark.createDataFrame([["Amlodipine"], ["ASPAGIN"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

Mapping results

In [None]:
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("document"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+----------+--------------+--------+------------+
|document  |mapping_result|relation|all_mappings|
+----------+--------------+--------+------------+
|Amlodipine|NONE          |null    |null        |
|ASPAGIN   |NONE          |null    |null        |
+----------+--------------+--------+------------+



As we expected, there is no matching since the words does not match because of their casings.

This time we will set `setLowerCase(True)` and see the results.

In [None]:
#DocMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action", "treatment"]) \
      .setLowerCase(True)


mapperPipeline = nlp.Pipeline().setStages([
      document_assembler,
      docMapper])

test_data = spark.createDataFrame([["Amlodipine"], ["ASPAGIN"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

In [None]:
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("document"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation")).show(truncate=False)

+----------+-----------------------+---------+
|document  |mapping_result         |relation |
+----------+-----------------------+---------+
|Amlodipine|Calcium Ions Inhibitor |action   |
|Amlodipine|hypertension           |treatment|
|ASPAGIN   |Cycooxygenase Inhibitor|action   |
|ASPAGIN   |arthritis              |treatment|
+----------+-----------------------+---------+



As seen above, our model performed mapping even though our drug names are not matched(because they are capitalized) with the training data.

### `setAllowMultiTokenChunk()`

If the document includes multi-tokens splitted by a whitespace, we can filter that document by using `setAllowMultiTokenChunk()` parameter.

Firstly, we will set this parameter as `setAllowMultiTokenChunk(False)`. Therefore, we are expecting no mapping for "Warfarina Lusa" since it is multi-token document.

In [None]:
docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action", "treatment"]) \
      .setLowerCase(True) \
      .setAllowMultiTokenChunk(False)


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])

test_data= spark.createDataFrame([["Warfarina Lusa"], ["Aspagin"], ["coumadin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

Checking the results

In [None]:
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+--------------+-----------------------+---------+------------+
|ner_chunk     |mapping_result         |relation |all_mappings|
+--------------+-----------------------+---------+------------+
|Warfarina Lusa|NONE                   |null     |null        |
|Aspagin       |Cycooxygenase Inhibitor|action   |            |
|Aspagin       |arthritis              |treatment|            |
|coumadin      |Coagulation Inhibitor  |action   |            |
|coumadin      |hypertension           |treatment|            |
+--------------+-----------------------+---------+------------+



As seen above, there is no mapping for "Warfarina Lusa". <br/>

This time we will set `setAllowMultiTokenChunk(True)` and check the results.

In [None]:
docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action", "treatment"]) \
      .setLowerCase(True) \
      .setAllowMultiTokenChunk(True)


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])

test_data= spark.createDataFrame([["Warfarina Lusa"], ["Aspagin"], ["coumadin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

Checking the results

In [None]:
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+--------------+-----------------------+---------+------------+
|ner_chunk     |mapping_result         |relation |all_mappings|
+--------------+-----------------------+---------+------------+
|Warfarina Lusa|Analgesic              |action   |Antipyretic |
|Warfarina Lusa|diabetes               |treatment|t2dm        |
|Aspagin       |Cycooxygenase Inhibitor|action   |            |
|Aspagin       |arthritis              |treatment|            |
|coumadin      |Coagulation Inhibitor  |action   |            |
|coumadin      |hypertension           |treatment|            |
+--------------+-----------------------+---------+------------+



As seen above, our model returned mappings for "Warfarina Lusa".

### setMultivaluesRelations()

This parameter is used to decide to return all values in a relation together or separately. Default value is False.

A mapper model with `setMultivaluesRelations(True)`:

In [None]:
docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action", "treatment"]) \
      .setLowerCase(True) \
      .setAllowMultiTokenChunk(True)\
      .setMultivaluesRelations(True)


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])

test_data= spark.createDataFrame([["Warfarina Lusa"], ["Aspagin"], ["coumadin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

In [None]:
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation")).show(truncate=False)

+--------------+-----------------------+---------+
|ner_chunk     |mapping_result         |relation |
+--------------+-----------------------+---------+
|Warfarina Lusa|Analgesic              |action   |
|Warfarina Lusa|Antipyretic            |action   |
|Warfarina Lusa|diabetes               |treatment|
|Warfarina Lusa|t2dm                   |treatment|
|Aspagin       |Cycooxygenase Inhibitor|action   |
|Aspagin       |arthritis              |treatment|
|coumadin      |Coagulation Inhibitor  |action   |
|coumadin      |hypertension           |treatment|
+--------------+-----------------------+---------+



A mapper model with `setMultivaluesRelations(False)`:

In [None]:
docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action", "treatment"]) \
      .setLowerCase(True) \
      .setAllowMultiTokenChunk(True)\
      .setMultivaluesRelations(False)


mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])

test_data= spark.createDataFrame([["Warfarina Lusa"], ["Aspagin"], ["coumadin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

In [None]:
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_relations")).show(truncate=False)

+--------------+-----------------------+---------+-------------+
|ner_chunk     |mapping_result         |relation |all_relations|
+--------------+-----------------------+---------+-------------+
|Warfarina Lusa|Analgesic              |action   |Antipyretic  |
|Warfarina Lusa|diabetes               |treatment|t2dm         |
|Aspagin       |Cycooxygenase Inhibitor|action   |             |
|Aspagin       |arthritis              |treatment|             |
|coumadin      |Coagulation Inhibitor  |action   |             |
|coumadin      |hypertension           |treatment|             |
+--------------+-----------------------+---------+-------------+



As seen above, since we set `setMultivaluesRelations(False)`, we do not see all mappings listed. In that case, only the first mapping can be seen in the output.

## 🔎 Lexical Fuzzy Matching Options in the DocMapperApproach annotator

There are multiple options to achieve fuzzy matching using the `DocMapperApproach`: <br/>

**Partial Token NGram Fingerprinting**:  Specially useful to combine two frequent usecases; when there are noisy non informative tokens at the beginning / end of the chunk and the order of the chunk is not absolutely relevant. i.e. stomach acute pain --> acute pain stomach ; metformin 100 mg --> metformin.

Parameters that can be used in order to enable that feature: <br/>

- `setEnableTokenFingerprintMatching()` *(Boolean)*: Whether to apply partial token Ngram fingerprint matching; this will create matching keys with partial Ngrams driven by three params: minTokenNgramFingerprint, maxTokenNgramFingerprint, maxTokenNgramDropping (Default: False)

- `setMinTokenNgramFingerprint()` *(Int)*: When enableTokenFingerprintMatching is true, the min number of tokens for partial Ngrams in Fingerprint (Default: 2)
- `setMaxTokenNgramFingerprint()` *(Integer)*: When enableCharFingerprintMatching is true, the max number of chars for Ngrams in Fingerprint (Default: 3)
- `setMaxTokenNgramDroppingCharsRatio()` *(Float)*: When enableTokenNgramMatching is true, this value drives the max amount of tokens to allow dropping based on the maximum ratio of chars allowed to be dropped from the full chunk; whenever it is desired for all Ngrams to be used as keys, no matter how short the final chunk is, this param should be set to 1.0 (Default is 0.0) <br/>


**Char NGram Fingerprinting**: Specially useful in usecases that involve typos or different spacing patterns for chunks. i.e. head ache / ache head --> headache ; metformini / metformoni / metformni --> metformin

Parameters that can be used in order to enable that feature: <br/>

- `setEnableCharFingerprintMatching()` *(Boolean)*:
Whether to apply char Ngram fingerprint matching (Default: False)
- `setMinCharNgramFingerprint()` *(Integer)*: When enableCharFingerprintMatching is true, the min number of chars for Ngrams in Fingerprint (Default: 2)
- `setMaxCharNgramFingerprint()` *(Integer)*: When enableCharFingerprintMatching is true, the max number of chars for Ngrams in Fingerprint (Default: 3) <br/>


**Fuzzy Distance (Slow)**: Specially useful when the mapping can be defined in terms of edit distance thresholds using functions like char based like Levenshtein, Hamming, LongestCommonSubsequence or token based like Cosine, Jaccard.

Parameters that can be used in order to enable that feature: <br/>

- `setEnableFuzzyMatching()` *(Boolean)*: Whether to apply fuzzy matching (Default: False)

- `setFuzzyMatchingDistanceThresholds()` *(Float)*: When enableFuzzyMatching is true, the threshold value for distance

- `setFuzzyMatchingDistances()` *(List[Str])*: When enableFuzzyMatching is true, this array contains the distances to calculate; possible values are: levenshtein, longest-common-subsequence, cosine, jaccard (Default: levenshtein)

- `setFuzzyDistanceScalingMode()` *(String)*: When enableFuzzyMatching is true, the scaling mode for Integer Edit Distances; possible values are: left, right, long, short, none (Default: long)



The mapping logic will be run in the previous order also ordering by longest key inside each option as an intuitive way to minimize false positives.



### `setEnableFuzzyMatching()`
This parameter is used to decide whether to apply fuzzy matching or not.

We will create a `DocMapperApproach()` with example Json file which has "action" and "treatment" relation types.

In [None]:
data_set= {
  "mappings": [
    {
            "key": "Warfarina lusa",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Analgesic",
                        "Antipyretic"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "diabetes",
                        "t2dm"
                    ]
                }
            ]
        },

      {
            "key": "amlodipine",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Calcium Ions Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "hypertension"
                    ]
                }
            ]
        },

        {
            "key": "coumadin",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Coagulation Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "hypertension"
                    ]
                }
            ]
        },

        {
            "key": "aspagin",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Cycooxygenase Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "arthritis"
                    ]
                }
            ]
        },
        {
            "key": "metformin",
            "relations": [
                {
                  "key": "action",
                  "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
                },
                {
                  "key": "treatment",
                  "values" : ["diabetes", "t2dm"]
                }
            ]
        }






  ]
}

import json
with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

Firstly, we will set `setEnableFuzzyMatching(False)` and see the result.

In [None]:
docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action"]) \
      .setLowerCase(True) \
      .setAllowMultiTokenChunk(True) \
      .setEnableFuzzyMatching(False)


mapperPipeline = nlp.Pipeline().setStages([document_assembler,
                                      docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin"], ["coumadin 5 mg"], ["metamorfin"]]).toDF("text")
result_df = mapperPipeline.fit(test_data).transform(test_data)

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result,
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+-----------+---------------------+---------+
|ner_chunk         |fixed_chunk|action_mapping_result|relation |
+------------------+-----------+---------------------+---------+
|Lusa Warfarina 5mg|null       |NONE                 |null     |
|amlodipine 10     |null       |NONE                 |null     |
|Aspaginaspa       |null       |NONE                 |null     |
|coumadin          |coumadin   |Coagulation Inhibitor|action   |
|coumadin 5 mg     |null       |NONE                 |null     |
|metamorfin        |null       |NONE                 |null     |
+------------------+-----------+---------------------+---------+



Now, we will set `setEnableFuzzyMatching(True)` to see the difference.

In [None]:
docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action"]) \
      .setLowerCase(True) \
      .setAllowMultiTokenChunk(True) \
      .setEnableFuzzyMatching(True)


mapperPipeline = nlp.Pipeline().setStages([document_assembler,
                                      docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin"], ["coumadin 5 mg"], ["metamorfin"]]).toDF("text")
result_df = mapperPipeline.fit(test_data).transform(test_data)

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result,
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("document"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+-----------+----------------------+---------+
|document          |fixed_chunk|action_mapping_result |relation |
+------------------+-----------+----------------------+---------+
|Lusa Warfarina 5mg|null       |NONE                  |null     |
|amlodipine 10     |amlodipine |Calcium Ions Inhibitor|action   |
|Aspaginaspa       |null       |NONE                  |null     |
|coumadin          |coumadin   |Coagulation Inhibitor |action   |
|coumadin 5 mg     |null       |NONE                  |null     |
|metamorfin        |null       |NONE                  |null     |
+------------------+-----------+----------------------+---------+



As seen above, the "amlodipine 10" chunk which has not any one-to-one matching, was converted to the "amlodipine" based on the fuzzy distance and matched according to the "amlodipine".

### `setFuzzyMatchingDistanceThresholds()`
This parameter is used to set the threshold value for the distance when `enableFuzzyMatching` is `True`.

Let's define `DocMapperApproach` with `setFuzzyMatchingDistanceThresholds(0.8)` and see the difference.





In [None]:
docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action"]) \
      .setLowerCase(True) \
      .setAllowMultiTokenChunk(True) \
      .setEnableFuzzyMatching(True) \
      .setFuzzyMatchingDistanceThresholds(0.8)


mapperPipeline = nlp.Pipeline().setStages([document_assembler,
                                      docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin"], ["coumadin 5 mg"], ["metamorfin"]]).toDF("text")
result_df = mapperPipeline.fit(test_data).transform(test_data)

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result,
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("document"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------+-----------------------+---------+
|document          |fixed_chunk   |action_mapping_result  |relation |
+------------------+--------------+-----------------------+---------+
|Lusa Warfarina 5mg|Warfarina lusa|Analgesic              |action   |
|Lusa Warfarina 5mg|aspagin       |Cycooxygenase Inhibitor|action   |
|amlodipine 10     |aspagin       |Cycooxygenase Inhibitor|action   |
|amlodipine 10     |amlodipine    |Calcium Ions Inhibitor |action   |
|Aspaginaspa       |Warfarina lusa|Analgesic              |action   |
|Aspaginaspa       |aspagin       |Cycooxygenase Inhibitor|action   |
|coumadin          |coumadin      |Coagulation Inhibitor  |action   |
|coumadin 5 mg     |coumadin      |Coagulation Inhibitor  |action   |
|coumadin 5 mg     |Warfarina lusa|Analgesic              |action   |
|coumadin 5 mg     |aspagin       |Cycooxygenase Inhibitor|action   |
|metamorfin        |aspagin       |Cycooxygenase Inhibitor|action   |
|metamorfin        |

As seen above, we modified the distance threshold and number of chunks with mappings increased.

### `setFuzzyMatchingDistances()`
When enableFuzzyMatching is true, this parameter accepts an array contains the distances to calculate; possible values are: `levenshtein`, `longest-common-subsequence`, `cosine`, `jaccard`.

Now, we will define a `DocMapperApproach` with `setFuzzyMatchingDistances(["longest-common-subsequence"])` and see the difference.

In [None]:
docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action"]) \
      .setLowerCase(True) \
      .setAllowMultiTokenChunk(True) \
      .setEnableFuzzyMatching(True) \
      .setFuzzyMatchingDistanceThresholds(0.8) \
      .setFuzzyMatchingDistances(["longest-common-subsequence"])


mapperPipeline = nlp.Pipeline().setStages([document_assembler,
                                      docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin"], ["coumadin 5 mg"], ["metamorfin"]]).toDF("text")
result_df = mapperPipeline.fit(test_data).transform(test_data)

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result,
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------+-----------------------+---------+
|ner_chunk         |fixed_chunk   |action_mapping_result  |relation |
+------------------+--------------+-----------------------+---------+
|Lusa Warfarina 5mg|Warfarina lusa|Analgesic              |action   |
|amlodipine 10     |amlodipine    |Calcium Ions Inhibitor |action   |
|Aspaginaspa       |Warfarina lusa|Analgesic              |action   |
|Aspaginaspa       |aspagin       |Cycooxygenase Inhibitor|action   |
|coumadin          |coumadin      |Coagulation Inhibitor  |action   |
|coumadin 5 mg     |coumadin      |Coagulation Inhibitor  |action   |
|metamorfin        |metformin     |hypoglycemic           |action   |
+------------------+--------------+-----------------------+---------+



As seen above, fuzzyMatching result has been changed compared to the default value which is `levenshtein`.

### `setFuzzyDistanceScalingMode()`
When `enableFuzzyMatching` is `True`, this parameter is used to decide the scaling mode for Integer Edit Distances; possible values are: `left`, `right`, `long`, `short`, `none`.

Now, we will define a `DocMapperApproach` with `setFuzzyDistanceScalingMode("right")` and see the difference.


In [None]:
docMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action"]) \
      .setLowerCase(True) \
      .setAllowMultiTokenChunk(True) \
      .setEnableFuzzyMatching(True) \
      .setFuzzyMatchingDistanceThresholds(0.8) \
      .setFuzzyMatchingDistances(["longest-common-subsequence"]) \
      .setFuzzyDistanceScalingMode("right")


mapperPipeline = nlp.Pipeline().setStages([document_assembler,
                                      docMapper])


test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10"], ["Aspaginaspa"], ["coumadin"], ["coumadin 5 mg"], ["metamorfin"]]).toDF("text")
result_df = mapperPipeline.fit(test_data).transform(test_data)

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings.result,
                                  result_df.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------+-----------------------+---------+
|ner_chunk         |fixed_chunk   |action_mapping_result  |relation |
+------------------+--------------+-----------------------+---------+
|Lusa Warfarina 5mg|null          |NONE                   |null     |
|amlodipine 10     |amlodipine    |Calcium Ions Inhibitor |action   |
|Aspaginaspa       |Warfarina lusa|Analgesic              |action   |
|Aspaginaspa       |aspagin       |Cycooxygenase Inhibitor|action   |
|coumadin          |coumadin      |Coagulation Inhibitor  |action   |
|coumadin 5 mg     |coumadin      |Coagulation Inhibitor  |action   |
|metamorfin        |metformin     |hypoglycemic           |action   |
+------------------+--------------+-----------------------+---------+



As seen above, fuzzyMatching result has been changed compared to the default value which is `long`.

### Token Fingerprinting

**Note:** In that part, we will test different mapping sizes to test `DocMapperApproach`'s sensitivity in terms of speed and efficiency.

In [None]:
data_set_mappings = [
        {
            "key": "Warfarina lusa",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Analgesic",
                        "Antipyretic"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "diabetes",
                        "t2dm"
                    ]
                }
            ]
        },
        {
            "key": "amlodipine",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Calcium Ions Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "hypertension"
                    ]
                }
            ]
        },
        {
            "key": "coumadin",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Coagulation Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "hypertension"
                    ]
                }
            ]
        },
        {
            "key": "aspagin",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Cycooxygenase Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "arthritis"
                    ]
                }
            ]
        },
        {
            "key": "metformin",
            "relations": [
                {
                  "key": "action",
                  "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
                },
                {
                  "key": "treatment",
                  "values" : ["diabetes", "t2dm"]
                }
            ]
        }
    ]

In [None]:
# Keys to test speed and efficiency
extra_keys = {
    "s500": [{"key": f"short key {i}", "relations": [
                {
                    "key": "any",
                    "values": [
                        "anyvalue",
                        "anyvalue"
                    ]
                }]} for i in range(500)],
    "s5000": [{"key": f"short key {i}", "relations": [
                {
                    "key": "any",
                    "values": [
                        "anyvalue",
                        "anyvalue"
                    ]
                }]} for i in range(5000)],
    "l5000": [{"key": f"a bit longer key {i}", "relations": [
                {
                    "key": "any",
                    "values": [
                        "anyvalue",
                        "anyvalue"
                    ]
                }]} for i in range(15000)]
}

In [None]:
import json
for c, extra_mappings in extra_keys.items():
    with open(f'mappings_{c}.json', 'w', encoding='utf-8') as f:
        json.dump({'mappings': data_set_mappings + extra_mappings}, f, ensure_ascii=False, indent=4)

Defining sample documents

In [None]:
test_data = spark.createDataFrame([["Lusa Warfarina 5mg"], ["amlodipine 10 MG"], ["Aspaginaspa"], ["coumadin 5 mg"], ["coumadin"], ["metamorfin"]]).toDF("text")

document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

pipeline = nlp.Pipeline().setStages([document_assembler])

cached_df = pipeline.fit(test_data).transform(test_data).cache()
cached_df.selectExpr("explode(document) as chunk").show(truncate=False)

+----------------------------------------------------------+
|chunk                                                     |
+----------------------------------------------------------+
|{document, 0, 17, Lusa Warfarina 5mg, {sentence -> 0}, []}|
|{document, 0, 15, amlodipine 10 MG, {sentence -> 0}, []}  |
|{document, 0, 10, Aspaginaspa, {sentence -> 0}, []}       |
|{document, 0, 12, coumadin 5 mg, {sentence -> 0}, []}     |
|{document, 0, 7, coumadin, {sentence -> 0}, []}           |
|{document, 0, 9, metamorfin, {sentence -> 0}, []}         |
+----------------------------------------------------------+



Creating `ChunkMapperapproach` with only token fingerprinting:

In [None]:
dm = medical.DocMapperApproach() \
        .setInputCols(["document"]) \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \
        .setAllowMultiTokenChunk(True) \
        .setEnableTokenFingerprintMatching(True) \
        .setMinTokenNgramFingerprint(1) \
        .setMaxTokenNgramFingerprint(3) \
        .setMaxTokenNgramDroppingCharsRatio(0.5)

docMappers = [
    dm.copy().setOutputCol(f"mappings_{c}").setDictionary(f"mappings_{c}.json") \
    for c in extra_keys]

result_df = nlp.Pipeline(stages=docMappers).fit(cached_df).transform(cached_df)
result_df.selectExpr("explode(mappings_s500)").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                                                                    

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings_s500.result,
                                  result_df.mappings_s500.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("document"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------+----------------------+---------+
|document          |fixed_chunk   |action_mapping_result |relation |
+------------------+--------------+----------------------+---------+
|Lusa Warfarina 5mg|Warfarina lusa|Analgesic             |action   |
|Lusa Warfarina 5mg|Warfarina lusa|diabetes              |treatment|
|amlodipine 10 MG  |amlodipine    |Calcium Ions Inhibitor|action   |
|amlodipine 10 MG  |amlodipine    |hypertension          |treatment|
|Aspaginaspa       |null          |NONE                  |null     |
|coumadin 5 mg     |coumadin      |Coagulation Inhibitor |action   |
|coumadin 5 mg     |coumadin      |hypertension          |treatment|
|coumadin          |coumadin      |Coagulation Inhibitor |action   |
|coumadin          |coumadin      |hypertension          |treatment|
|metamorfin        |null          |NONE                  |null     |
+------------------+--------------+----------------------+---------+



Testing the sensitivity to the mapping size:

In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s500)").write.mode("overwrite").save("timing_test")

553 ms ± 75.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s5000)").write.mode("overwrite").save("timing_test")

409 ms ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_l5000)").write.mode("overwrite").save("timing_test")

276 ms ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Token and Char Fingerprinting

Creating `ChunkMapperapproach` with token and char fingerprinting:

In [None]:
dm = medical.DocMapperApproach() \
        .setInputCols(["document"]) \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \
        .setAllowMultiTokenChunk(True) \
        .setEnableTokenFingerprintMatching(True) \
        .setMinTokenNgramFingerprint(1) \
        .setMaxTokenNgramFingerprint(3) \
        .setMaxTokenNgramDroppingCharsRatio(0.5) \
        .setEnableCharFingerprintMatching(True) \
        .setMinCharNgramFingerprint(1) \
        .setMaxCharNgramFingerprint(3)

chunkerMappers = [
    dm.copy().setOutputCol(f"mappings_{c}").setDictionary(f"mappings_{c}.json") \
    for c in extra_keys]

result_df = nlp.Pipeline(stages=chunkerMappers).fit(cached_df).transform(cached_df)
result_df.selectExpr("explode(mappings_s500)").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                                                                    

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings_s500.result,
                                  result_df.mappings_s500.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------+-----------------------+---------+
|ner_chunk         |fixed_chunk   |action_mapping_result  |relation |
+------------------+--------------+-----------------------+---------+
|Lusa Warfarina 5mg|Warfarina lusa|Analgesic              |action   |
|Lusa Warfarina 5mg|Warfarina lusa|diabetes               |treatment|
|amlodipine 10 MG  |amlodipine    |Calcium Ions Inhibitor |action   |
|amlodipine 10 MG  |amlodipine    |hypertension           |treatment|
|Aspaginaspa       |aspagin       |Cycooxygenase Inhibitor|action   |
|Aspaginaspa       |aspagin       |arthritis              |treatment|
|coumadin 5 mg     |coumadin      |Coagulation Inhibitor  |action   |
|coumadin 5 mg     |coumadin      |hypertension           |treatment|
|coumadin          |coumadin      |Coagulation Inhibitor  |action   |
|coumadin          |coumadin      |hypertension           |treatment|
|metamorfin        |null          |NONE                   |null     |
+------------------+

Testing the sensitivity to the mapping size:

In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s500)").write.mode("overwrite").save("timing_test")

453 ms ± 78.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s5000)").write.mode("overwrite").save("timing_test")

237 ms ± 41.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_l5000)").write.mode("overwrite").save("timing_test")

222 ms ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
