![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/GenericClassifierModel.ipynb)

# GenericClassifierModel

In this notebook, we will examine the `GenericClassifierModel` annotator.

**📖 Learning Objectives:**

1. Understand how to map chunks by using pre-defined dictionary.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

Python Documentation: [GenericClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/generic_classifier/generic_classifier/index.html#sparknlp_jsl.annotator.generic_classifier.generic_classifier.GenericClassifierModel)

Scala Documentation: [GenericClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/generic_classifier/GenericClassifierModel.html)


## **📜 Background**


Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. The model operates on FEATURE_VECTOR annotations which can be produced using FeatureAssembler. Requires the FeaturesAssembler to create the input.


## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m4.5 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**
- Input: `FEATURE_VECTOR`
- Output: `CATEGORY`

## **🔎 Parameters**


- `multiClass` *(Boolean)*: Whether to return all clases or only the one with highest score (Default: False)




### `multiClass()`

We will show the differences in the parameter when set to True and False using a pretrained model to classify tobacco usage.

In [None]:
sample_texts = [
    [
        "Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes"
    ],
    [
        "The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago."
    ],
    [
        "The patient denies any history of smoking or alcohol abuse. She lives with her one daughter."
    ],
    [
        "She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not reported by patient, but there is apparently a history of alochol abuse."
    ],
]

df = spark.createDataFrame(
    sample_texts,
).toDF("text")

In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

sentence_embeddings = (
    nlp.BertSentenceEmbeddings.pretrained(
        "sbiobert_base_cased_mli", "en", "clinical/models"
    )
    .setInputCols(["document"])
    .setOutputCol("sentence_embeddings")
)

features_asm = (
    medical.FeaturesAssembler()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("features")
)

generic_classifier = (
    medical.GenericClassifierModel.pretrained(
        "genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli",
        "en",
        "clinical/models",
    )
    .setInputCols(["features"])
    .setOutputCol("class_")
)

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        features_asm,
        generic_classifier,
    ]
)

results = pipeline.fit(df).transform(df)

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli download started this may take some time.
[OK!]


Let's examine the results:

In [None]:
res = results.select(
    F.explode(
        F.arrays_zip(
            results.document.result,
            results.class_.result,
            results.class_.metadata,
        )
    ).alias("col")
).select(
    F.expr("col['1']").alias("prediction"),
    F.expr("col['2']['confidence']").alias("confidence"),
    F.expr("col['0']").alias("sentence"),
)

res.show(truncate=150)

Now, we will set the multiClass to True and compare the results.

In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

sentence_embeddings = (
    nlp.BertSentenceEmbeddings.pretrained(
        "sbiobert_base_cased_mli", "en", "clinical/models"
    )
    .setInputCols(["document"])
    .setOutputCol("sentence_embeddings")
)

features_asm = (
    medical.FeaturesAssembler()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("features")
)

generic_classifier = (
    medical.GenericClassifierModel.pretrained(
        "genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli",
        "en",
        "clinical/models",
    )
    .setInputCols(["features"])
    .setOutputCol("class_")
    .setMultiClass(True)
)

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        features_asm,
        generic_classifier,
    ]
)

results = pipeline.fit(df).transform(df)

In [None]:
results.select("class_").show(truncate=200)

In [None]:
res = results.select(
    F.explode(
        F.arrays_zip(
            results.document.result,
            results.class_.result,
            results.class_.metadata,
        )
    ).alias("col")
).select(
    F.expr("col['1']").alias("prediction"),
    F.expr("col['2']['confidence']").alias("confidence"),
    F.expr("col['0']").alias("sentence"),
)

res.show(truncate=150)

We can see the confidences of the other labels in the table.