![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **DocumentLogRegClassifier**

This notebook covers the uses of `DocumentLogRegClassifier`. This annotator uses a supervised learning algorithm that learns to classify documents (or text) into predefined categories or classes based on the content of the text.




**📖 Learning Objectives:**

1. Understand how `DocumentLogRegClassifier` works.

2. Become comfortable using the parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [DocumentLogRegClassifier](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#documentlogregclassifier)

- Python Docs : [DocumentLogRegClassifier](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/document_log_classifier/index.html#sparknlp_jsl.annotator.classification.document_log_classifier.DocumentLogRegClassifierModel)

- Scala Docs : [DocumentLogRegClassifier](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/classification/DocumentLogRegClassifierModel.html)

- For extended examples of usage, see [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/30.Clinical_Text_Classification_with_Spark_NLP.ipynb#scrollTo=hJCE-sWM9oaK).


## **📜 Background**

`DocumentLogRegClassifier` is designed for text classification tasks. It is a supervised learning algorithm that learns to classify documents (or text) into predefined categories or classes based on the content of the text.

`Logistic Regression` is a statistical model that models the probability of a binary (or multi-class) outcome based on one or more predictor variables. In the context of text classification, the predictor variables are typically the features extracted from the text, such as the presence or absence of certain words, n-grams, or other text-based features.

The `Logistic Regression` model learns a set of weights (coefficients) for each feature during the training process, which determines the importance of that feature in predicting the class label. During inference or prediction, the learned model takes a new text document as input, extracts the relevant features, and computes the probability of the document belonging to each class using the learned weights and a logistic function.

## **🎬 Colab Setup**

In [None]:
! pip install -q johnsnowlabs==5.1.0

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `CATEGORY`

## **🔎 Parameters**

- `setLabels`:  Sets array to output the label in the original form.

- `setMergeChunks`: Whether to merge all chunks in a document or not (Default: false).   



## Train a Model

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mtsamples_classifier.csv

In [None]:
spark_df = spark.read.csv("mtsamples_classifier.csv", header = True)

spark_df.show(10,truncate=100)

+----------------+----------------------------------------------------------------------------------------------------+
|        category|                                                                                                text|
+----------------+----------------------------------------------------------------------------------------------------+
|Gastroenterology| PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active f...|
|Gastroenterology| OPERATION 1. Ivor-Lewis esophagogastrectomy. 2. Feeding jejunostomy. 3. Placement of two right-s...|
|Gastroenterology| PREOPERATIVE DIAGNOSES: 1. Gastroesophageal reflux disease. 2. Chronic dyspepsia. POSTOPERATIVE ...|
|Gastroenterology| PROCEDURE: Colonoscopy. PREOPERATIVE DIAGNOSES: Rectal bleeding and perirectal abscess. POSTOPER...|
|Gastroenterology| PREOPERATIVE DIAGNOSIS: Right colon tumor. POSTOPERATIVE DIAGNOSES: 1. Right colon cancer. 2. As...|
|Gastroenterology| PREOPERATIVE DIAGNOSI

In [None]:
spark_df.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|  115|
|       Neurology|  143|
|      Orthopedic|  223|
|Gastroenterology|  157|
+----------------+-----+



In [None]:
(trainingData, testData) = spark_df.randomSplit([0.8, 0.2], seed = 42)

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

logreg = medical.DocumentLogRegClassifierApproach()\
    .setInputCols("token")\
    .setLabelCol("category")\
    .setOutputCol("prediction")\
    .setMaxIter(10)\
    .setTol(1e-6)

clf_Pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        logreg
])

doclogreg_model = clf_Pipeline.fit(trainingData)

In [None]:
pred_df = doclogreg_model.transform(testData)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.88      0.88      0.88        25
       Neurology       0.81      0.77      0.79        22
      Orthopedic       0.86      0.89      0.87        35
         Urology       0.90      0.90      0.90        20

        accuracy                           0.86       102
       macro avg       0.86      0.86      0.86       102
    weighted avg       0.86      0.86      0.86       102



In [None]:
doclogreg_model.stages

[DocumentAssembler_6f12620a6eda,
 REGEX_TOKENIZER_e3b666d0b8d1,
 TLR_5d4d9581822a]

In [None]:
doclogreg_model.stages[2].write().overwrite().save('DocLogRegClf_model')

## Use the Model by **DocumentLogRegClassifierModel**

Text Classifier model was trained to identify between the following four specialties or branches of medicine:

`Gastroenterology`

`Urology`

`Neurology`

`Orthopedic`.

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

logreg = medical.DocumentLogRegClassifierModel.load("/content/DocLogRegClf_model")\
    .setInputCols("token")\
    .setOutputCol("prediction")\
    .setMergeChunks(True)\
    .setLabels(['Gastroenterology', 'Urology', 'Neurology', 'Orthopedic'])

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    logreg])

We will convert the text to Pyspark dataframe and then get predictions for specialties by using `.transform`.

In [None]:
data = spark.createDataFrame([["After administering appropriate antibiotics and MAC anesthesia, the upper extremity was prepped and draped in the usual sterile fashion. The arm was exsanguinated with Esmarch, and the tourniquet inflated to 250 mmHg. A transverse incision was made over the MPJ crease of the thumb. Dissection was carried down to the flexor sheath with care taken to identify and protect the neurovascular bundles. The flexor sheath was opened under direct vision with a scalpel, and then a scissor was used to release the A1 pulley under direct vision on the radial side, from its proximal extent to its distal extent at the junction of the proximal and middle thirds of the proximal phalanx. "],
 ["The patient was placed in the supine position and sterilely prepped and draped in the usual fashion. After 2% lidocaine was instilled, the anterior urethra is normal. The prostatic urethra reveals mild lateral lobe obstruction. There are no bladder tumors noted. IMPRESSION: The patient has some mild benign prostatic hyperplasia. At this point in time, we will continue with conservative observation. "],
 ["Bilateral lower extremity numbness. HX: 21 y/o RHM complained of gradual onset numbness and incoordination of both lower extremities beginning approximately 11/5/96. The symptoms became maximal over a 12-24 hour period and have not changed since. The symptoms consist of tingling in the distal lower extremities approximately half way up the calf bilaterally. He noted decreased coordination of both lower extremities which he thought might be due to uncertainty as to where his feet were being placed in space."],
 ["PROCEDURE: Upper endoscopy. PREOPERATIVE DIAGNOSIS: Dysphagia. POSTOPERATIVE DIAGNOSIS: 1. GERD, biopsied. 2. Distal esophageal reflux-induced stricture, dilated to 18 mm. 3. Otherwise normal upper endoscopy. MEDICATIONS: Fentanyl 125 mcg and Versed 7 mg slow IV push."]
                              ]).toDF("text")

In [None]:
result = clf_Pipeline.fit(data).transform(data)

In [None]:
result.select('prediction.result','text').show(truncate = 150)

+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|            result|                                                                                                                                                  text|
+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Gastroenterology]|After administering appropriate antibiotics and MAC anesthesia, the upper extremity was prepped and draped in the usual sterile fashion. The arm wa...|
|         [Urology]|PROCEDURE: Upper endoscopy. PREOPERATIVE DIAGNOSIS: Dysphagia. POSTOPERATIVE DIAGNOSIS: 1. GERD, biopsied. 2. Distal esophageal reflux-induced stri...|
|       [Neurology]|Bilateral lower extremity numbness. HX: 21 y/o RHM complained of gradual onset numbness and incoordination of both lower