![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# GenericClassifierModel

In this notebook, we will examine the `GenericClassifierModel` annotator.

**📖 Learning Objectives:**

1. Understand how to map chunks by using pre-defined dictionary.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

Python Documentation: [GenericClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/generic_classifier/generic_classifier/index.html#sparknlp_jsl.annotator.generic_classifier.generic_classifier.GenericClassifierModel)

Scala Documentation: [GenericClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/generic_classifier/GenericClassifierModel.html)


## **📜 Background**


Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. The model operates on FEATURE_VECTOR annotations which can be produced using FeatureAssembler. Requires the FeaturesAssembler to create the input.


## **🎬 Colab Setup**

In [1]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.3/84.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.7/486.7 kB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m641.3/641.3 kB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m3.9 MB/s[

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_7139.json to spark_nlp_for_healthcare_spark_ocr_7139.json


In [3]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7139.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.4.1-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.4.3-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.4.1.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.4.3.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7139.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.4.3-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.4.3 installed! ✅ Heal the planet with NLP! 


In [4]:
from johnsnowlabs import nlp, medical
import pandas as pd
import pyspark.sql.functions as F

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7139.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.3, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**
- Input: `FEATURE_VECTOR`
- Output: `CATEGORY`

## **🔎 Parameters**


- `multiClass` *(Boolean)*: Whether to return all clases or only the one with highest score (Default: False)




### `multiClass()`

We will show the differences in the parameter when set to True and False using a pretrained model to classify tobacco usage.

In [7]:
sample_texts = [
    [
        "Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes"
    ],
    [
        "The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago."
    ],
    [
        "The patient denies any history of smoking or alcohol abuse. She lives with her one daughter."
    ],
    [
        "She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not reported by patient, but there is apparently a history of alochol abuse."
    ],
]

df = spark.createDataFrame(
    sample_texts,
).toDF("text")

In [9]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

sentence_embeddings = (
    nlp.BertSentenceEmbeddings.pretrained(
        "sbiobert_base_cased_mli", "en", "clinical/models"
    )
    .setInputCols(["document"])
    .setOutputCol("sentence_embeddings")
)

features_asm = (
    medical.FeaturesAssembler()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("features")
)

generic_classifier = (
    medical.GenericClassifierModel.pretrained(
        "genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli",
        "en",
        "clinical/models",
    )
    .setInputCols(["features"])
    .setOutputCol("class_")
)

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        features_asm,
        generic_classifier,
    ]
)

results = pipeline.fit(df).transform(df)

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli download started this may take some time.
[OK!]


Let's examine the results:

In [12]:
res = results.select(
    F.explode(
        F.arrays_zip(
            results.document.result,
            results.class_.result,
            results.class_.metadata,
        )
    ).alias("col")
).select(
    F.expr("col['1']").alias("prediction"),
    F.expr("col['2']['confidence']").alias("confidence"),
    F.expr("col['0']").alias("sentence"),
)

res.show(truncate=150)

+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|prediction|confidence|                                                                                                                                              sentence|
+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|   Present| 0.6574545|        Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes|
|      Past|  0.981618|The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol...|
|     Never| 0.9825732|                                                          The patient denies any history of smoking or

Now, we will set the multiClass to True and compare the results.

In [13]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

sentence_embeddings = (
    nlp.BertSentenceEmbeddings.pretrained(
        "sbiobert_base_cased_mli", "en", "clinical/models"
    )
    .setInputCols(["document"])
    .setOutputCol("sentence_embeddings")
)

features_asm = (
    medical.FeaturesAssembler()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("features")
)

generic_classifier = (
    medical.GenericClassifierModel.pretrained(
        "genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli",
        "en",
        "clinical/models",
    )
    .setInputCols(["features"])
    .setOutputCol("class_")
    .setMultiClass(True)
)

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        features_asm,
        generic_classifier,
    ]
)

results = pipeline.fit(df).transform(df)

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli download started this may take some time.
[OK!]


In [16]:
results.select("class_").show(truncate=200)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                  class_|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{category, 0, 0, Present, {confidence -> 0.6574545}, []}, {category, 0, 0, None, {confidence -> 0.0030241734}, []}, {category, 0, 0, Never, {confidence -> 0.002956534}, []}, {category, 0, 0, Past,...|
|[{category, 0, 0, Present, {confidence -> 0.014629182}, []}, {category, 0, 0, None, {confidence -> 0.0016301902}, []}, {category, 0, 0, Never, {confidence -> 0.0021226832}, []}, {category

In [14]:
res = results.select(
    F.explode(
        F.arrays_zip(
            results.document.result,
            results.class_.result,
            results.class_.metadata,
        )
    ).alias("col")
).select(
    F.expr("col['1']").alias("prediction"),
    F.expr("col['2']['confidence']").alias("confidence"),
    F.expr("col['0']").alias("sentence"),
)

res.show(truncate=150)

+----------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|prediction|  confidence|                                                                                                                                              sentence|
+----------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|   Present|   0.6574545|        Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes|
|      None|0.0030241734|                                                                                                                                                  null|
|     Never| 0.002956534|                                                                                          

We can see the confidences of the other labels in the table.