![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **AssertionLogRegApproach**

This notebook will cover the different parameters and usages of `AssertionLogRegApproach`. . This annotator allows to train an AssertionLogRegModel.

**📖 Learning Objectives:**

1. Understand how to use AssertionLogRegApproach.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [AssertionLogRegApproach](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#assertionlogreg)

- Python Docs : [AssertionLogRegApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/assertion/assertion_dl_reg/index.html#sparknlp_jsl.annotator.assertion.assertion_dl_reg.AssertionLogRegApproach)

- Scala Docs : [AssertionLogRegApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/assertion/logreg/AssertionLogRegApproach.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb).

## **📜 Background**


This is a classification method, which uses LogisticRegression algorithm Contains all the methods for training a AssertionLogRegModel, together with trainWithChunk, trainWithStartEnd.

Train a Assertion algorithm using a regression log model.

## **🎬 Colab Setup**

In [1]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m73.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m8.0

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734.json to spark_nlp_for_healthcare_spark_ocr_8734.json


In [4]:
from johnsnowlabs import nlp

nlp.install(force_browser=True)

<IPython.core.display.Javascript object>

127.0.0.1 - - [10/Jun/2024 19:22:57] "GET /login?code=8ixBP4W4WIPn6Hm9a5FdAwqJmBYg8b HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [5]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`, `WORD_EMBEDDINGS`

- Output: `ASSERTION`

## **📂 Training Data**

The initial training data should consist of text, target, label, start and end columns.

For this example, we will download a dataset related to assertions(i2b2):

In [6]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/i2b2_assertion_sample_short.csv

In [7]:
assertion_df = spark.read.option("header", True).option("inferSchema", "True").csv("i2b2_assertion_sample_short.csv")

In [8]:
assertion_df.show(3, truncate=100)

+-------------------------------------------------+-------------------+-------+-----+---+
|                                             text|             target|  label|start|end|
+-------------------------------------------------+-------------------+-------+-----+---+
|She has no history of liver disease , hepatitis .|      liver disease| absent|    5|  6|
|                         1. Undesired fertility .|undesired fertility|present|    1|  2|
|                            3) STATUS POST FALL .|               fall|present|    3|  3|
+-------------------------------------------------+-------------------+-------+-----+---+
only showing top 3 rows



Now, let's see what's the distribution of labels in our dataset:

In [9]:
assertion_df.groupBy('label').count().orderBy('count', ascending=False).show(truncate=False)

+-------+-----+
|label  |count|
+-------+-----+
|present|663  |
|absent |228  |
+-------+-----+



And finally, let's split the dataset into training and test sets.

In [10]:
(trainingData, testData) = assertion_df.randomSplit([0.8, 0.2], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 721
Test Dataset Count: 170


### Preprocessing Pipeline

In [11]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("chunk")\
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(True)

tokenizer = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

embeddings_pipeline = nlp.Pipeline(
    stages = [
    document_assembler,
    chunk,
    tokenizer,
    word_embeddings])


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


**Doc2Chunk():**

- `setStartColByTokenIndex(True):` When set to true, this parameter means that the 'start' column indicates token indices, not character indices.

- `setFailOnMissing(False):` This parameter decides the behavior when a specified chunk is missing. If it's set to true, an error will occur when a chunk is missing. If set to false, the method will proceed without error even if a chunk is missing.

- `setLowerCase(True):` This parameter decides whether the chunks should be converted to lowercase. If set to true, all the chunks will be converted to lowercase.

And now we preprocess both the training and the test set.

In [12]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)

## **🔎 AssertionLogRegApproach Parameters**


- `label` : Column with label per each token

- `maxIter`: This specifies the maximum number of iterations to be performed in the model's training, default: 26

- `regParam` : This specifies the regularization parameter. Regularization helps to control the complexity of the model, aiding in preventing the issue of overfitting.

- `eNetParam` : Elastic net parameter

- `beforeParam` : Length of the context before the target

- `afterParam` : Length of the context after the target

- `startCol` : Column that contains the token number for the start of the target

- `endCol` : Column that contains the token number for the end of the target





## **🦾 Model Training**

The AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.

In [13]:
assertionStatus = medical.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setMaxIter(100)\
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

- `setBefore(11)` and `setAfter(13)`: These two parameters specify the size of the context that the model will take into account when classifying a certain target. For example, the 'before' parameter indicates that the model will consider the 11 tokens before the target, while the 'after' parameter indicates that the model will consider the 13 tokens after the target.

- `setStartCol("start")` and `setEndCol("end")`: These two parameters specify the columns that indicate the start and end token numbers of the target. This helps the model to determine which tokens it should work on.

In [14]:
clinical_assertion_pipeline = nlp.Pipeline(
    stages = [
    assertionStatus])

In [15]:
%%time

assertion_model = clinical_assertion_pipeline.fit(trainingData_with_embeddings)

CPU times: user 251 ms, sys: 25.4 ms, total: 276 ms
Wall time: 40.1 s


We can save the trained model using the code below:

In [16]:
assertion_model.stages[-1].write().overwrite().save('assertion_custom_model')

## **📈 Model Testing**

After training the model, it can be used to get predictions on the test set in order to calculate performance metrics.

In [17]:
preds = assertion_model.transform(testData_with_embeddings).select('label','assertion.result')

preds.show()

+-------+---------+
|  label|   result|
+-------+---------+
|present|[present]|
| absent| [absent]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present| [absent]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
+-------+---------+
only showing top 20 rows



In [18]:
from sklearn.metrics import classification_report

preds_df = preds.toPandas()
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report( preds_df['label'], preds_df['result']))

              precision    recall  f1-score   support

      absent       0.93      0.75      0.83        53
     present       0.90      0.97      0.93       117

    accuracy                           0.91       170
   macro avg       0.91      0.86      0.88       170
weighted avg       0.91      0.91      0.90       170



# **AssertionLogRegModel**

This notebook will cover the different parameters and usages of `AssertionLogRegModel`.

**📖 Learning Objectives:**

1. Understand how to use AssertionLogRegModel.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [AssertionLogRegModel](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#assertionlogreg)

- Python Docs : [AssertionLogRegModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/assertion/assertion_dl_reg/index.html#sparknlp_jsl.annotator.assertion.assertion_dl_reg.AssertionLogRegModel)

- Scala Docs : [AssertionLogRegModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/assertion/logreg/AssertionLogRegModel.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb).

## **📜 Background**

This is a main class in AssertionLogReg family. Logarithmic Regression is used to extract Assertion Status from extracted entities and text. AssertionLogRegModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity. Possible values are "Negated", "Affirmed" and "Historical".

Unlike the DL Model, this class does not extend AnnotatorModel. Instead it extends the RawAnnotator, that's why the main point of interest is method transform().

At the moment there are no pretrained models available for this class. Please refer to AssertionLogRegApproach to train your own model.

Model to extract assertion status of entities using Logarithmic Regression.

To train a custom model, use AssertionLogRegApproach instead.

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`, `WORD_EMBEDDINGS`

- Output: `ASSERTION`

## **🔎 AssertionLogRegModel Parameters**

- `afterParam` : Length of the context after the target (Default: 13)

- `beforeParam`: Length of the context before the target (Default: 11)

- `startCol` : Column that contains the token number for the start of the target

- `endCol` : Column that contains the token number for the end of the target

We load the trained assertion model using the code below:

In [19]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
   .setInputCols("document") \
   .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

clinical_assertion = medical.AssertionLogRegModel.load("/content/assertion_custom_model") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

nlpPipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion])


model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [20]:
sample_df = spark.createDataFrame([["Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain"]]).toDF("text")
result = model.transform(sample_df)
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|           assertion|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Patient has a hea...|[{document, 0, 11...|[{document, 0, 83...|[{token, 0, 6, Pa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 12, 21, ...|[{assertion, 12, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [21]:
result.select("assertion").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|assertion                                                                                                                                                                                                                                                                                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata,
                                     result.assertion.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['sentence']").alias("sent_id"),
              F.expr("cols['4']").alias("assertion")).show(truncate=False)