![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/GenericLogRegClassifierModel.ipynb)

# **GenericLogRegClassifierModel**

This notebook will cover the different parameters and usages of `GenericLogRegClassifierModel`. This annotator derivative of GenericClassifier which implements a multinomial Logistic Regression.

**📖 Learning Objectives:**

1. This is a single layer neural network with the logistic function at the output. The input to the model is FeatureVector and the output is Category annotations with labels and corresponding confidence scores varying between 0 and 1. Training data requires "text" and their "label" columns only and the trained model will be a GenericLogRegClassifierModel().

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [GenericLogRegClassifierModel](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#genericlogregclassifier)

- For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/8.Generic_Classifier.ipynb)

- Python Docs : [GenericLogRegClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/generic_log_reg_classifier/index.html)

- Scala Docs : [GenericLogRegClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/classification/GenericLogRegClassifierModel.html)


## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m3.5 MB/

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()
spark

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `FEATURE_VECTOR`

- Output: `CATEGORY`

## **🔎 Parameters**


- `LabelColumn`: This parameter sets the name of the column in your input data that contains the labels (categories) for the classification task. The classifier will use this column to learn from the data and make predictions.

- `ModelFile`: This parameter specifies the path to the pre-trained model file for the logistic regression classifier. It should be a protobuf file containing the model graph and trained weights.

- `EpochsNumber`: This parameter sets the number of epochs (iterations) the classifier will go through during the training process. An epoch represents one complete pass through the entire training dataset.

- `BatchSize`: This parameter sets the batch size used during training. The training data is divided into batches, and the model's weights are updated after processing each batch. A larger batch size may speed up training, but it requires more memory.

- `LearningRate`: This parameter sets the learning rate for the optimization algorithm used during training. The learning rate determines how much the model's weights are updated based on the computed gradients. A higher learning rate may lead to faster convergence but risks overshooting the optimal solution.

- `OutputLogsPath`: This parameter specifies the path where the logs related to the training process will be stored. These logs can include information such as training loss, accuracy, and other metrics.

- `Dropout`: Dropout is a regularization technique used to prevent overfitting in neural networks. This parameter sets the dropout rate, which determines the probability that each neuron's output will be temporarily ignored during training.

- `FixImbalance`: Imbalance refers to the situation when some classes have significantly more training examples than others. Setting this parameter to True indicates that the classifier will handle class imbalance during training to help ensure that the model doesn't become biased towards the majority class.

- `ValidationSplit`: This line seems to be commented out, but it's worth mentioning its purpose. If uncommented and set to a value between 0 and 1, it would specify the fraction of the training data to be used for validation during the training process. The remaining data would be used for actual training.

# ✍  Explaining GenericLogRegClassifierModel with an Example

## 📑 Sample Data

In [None]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

In [None]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]

df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]

ade_df= pd.concat([df_neg, df_pos])
ade_df.category = ade_df.category.apply(lambda x: True if x=="pos" else False)
ade_df.sample(5, random_state=42)

Unnamed: 0,text,category
11571,"Nineteen months after diagnosis, her mother g...",False
4577,"An infant girl with choanal atresia, athelia, ...",True
5638,A causal relationship between the administrat...,False
12799,The absence of eosinophils does not exclude t...,False
427,While for ribavirin antidepressant effects are...,True


In [None]:
from pyspark.sql.types import StringType
sample_data = ade_df.sample(n= 100, random_state=1)  # taking 100 rows as sample data
sample_data = sample_data.reset_index(drop=True)
df = spark.createDataFrame(sample_data['text'].tolist(), StringType()).toDF("text")
df.show(5,truncate=60)

+------------------------------------------------------------+
|                                                        text|
+------------------------------------------------------------+
| Multicentric canine lymphoma in a 12-year-old keeshond: ...|
| Pyomyositis is a rare disease, encountered mainly in tro...|
| Both patients subsequently developed markedly elevated E...|
|Bortezomib-induced paralytic ileus is a potential gastroi...|
|However, given the clinically significant result to the i...|
+------------------------------------------------------------+
only showing top 5 rows



In [None]:
sample_data.head()

Unnamed: 0,text,category
0,Multicentric canine lymphoma in a 12-year-old...,False
1,"Pyomyositis is a rare disease, encountered ma...",False
2,Both patients subsequently developed markedly...,False
3,Bortezomib-induced paralytic ileus is a potent...,True
4,"However, given the clinically significant resu...",True


## **💻Pipeline**

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

generic_classifier = medical.GenericClassifierModel.pretrained("generic_logreg_classifier_ade", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("class")

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    features_asm,
    generic_classifier])

data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""],
 ["""I feel a bit drowsy & have a little blurred vision after taking an insulin"""]]).toDF("text")



embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
generic_logreg_classifier_ade download started this may take some time.
[OK!]


In [None]:
result = clf_Pipeline.fit(data).transform(df)
# result.select("text", "class.result").show(truncate=100)

In [None]:
result.select("text", "class.result").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-------+
|                                                                                                text| result|
+----------------------------------------------------------------------------------------------------+-------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...|[False]|
|However, given the clinically significant result to the interaction between tolazoline and cimeti...| [True]|
|                                              How much do novel antipsychotics benefit the patients?|[False]|
|

In [None]:
preds_df = result.select("text", "class.result").toPandas()
preds_df['result'] = preds_df['result'].apply(lambda x : True if x[0]=='True' else False)
preds_df


Unnamed: 0,text,result
0,Multicentric canine lymphoma in a 12-year-old...,False
1,"Pyomyositis is a rare disease, encountered ma...",False
2,Both patients subsequently developed markedly...,False
3,Bortezomib-induced paralytic ileus is a potent...,False
4,"However, given the clinically significant resu...",True
...,...,...
95,Subsequently he responded to the West Berlin ...,False
96,All three patients likely experienced electro...,False
97,We report a case of codeine intoxication in th...,False
98,METHODS: The clinical suspicion of chickenpox...,False


In [None]:
from sklearn.metrics import  classification_report
print (classification_report(sample_data['category'], preds_df['result']))

              precision    recall  f1-score   support

       False       0.86      0.96      0.91        72
        True       0.85      0.61      0.71        28

    accuracy                           0.86       100
   macro avg       0.86      0.78      0.81       100
weighted avg       0.86      0.86      0.85       100

