![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **GenericLogRegClassifierApproach**

This notebook will cover the different parameters and usages of `GenericLogRegClassifierApproach` annotator.

**📖 Learning Objectives:**

1. Understand how to use `GenericLogRegClassifierApproach`.

2. Become comfortable using the different parameters of the annotator.




**🔗 Helpful Links:**

- Documentation : [GenericLogRegClassifierApproach]()

- Python Docs : [GenericLogRegClassifierApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/generic_log_reg_classifier/index.html)

- Scala Docs : [GenericLogRegClassifierApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/classification/GenericLogRegClassifierApproach.html)



## **📜 Background**


`GenericLogRegClassifier` is a derivative of GenericClassifier which implements a multinomial logistic regression. This is a single layer neural network with the logistic function at the output. The input to the model is FeatureVector and the output is category annotations with labels and corresponding confidence scores varying between 0 and 1.


## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T

import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

## **🖨️ Input/Output Annotation Types**

- Input: `FEATURE_VECTOR`

- Output: `CATEGORY`

## **🔎 Parameters**


- `inputCols`: The name of the columns containing the input annotations. It can read either a String column or an Array.
- `outputCol`: The name of the column in Document type that is generated. We can specify only one column here.


All the parameters can be set using the corresponding set method in camel case. For example, `.setInputcols()`.

### `inputCols` and `outputCol`

Define the column names containing the `SENTENCE_EMBEDDINGS` or `FeatureVector`  annotations needed as input to the `GenericLogRegClassifierApproach ` and the name of the new column containg the identified entities.

Let's define a pipeline to process raw texts into `FeatureVector` annotations:

### Data Preprocessing

In [None]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

**ADE Negative Dataset**

In [None]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg.head()

Unnamed: 0,col1
0,6460590 NEG Clioquinol intoxication occurring ...
1,"8600337 NEG ""Retinoic acid syndrome"" was preve..."
2,8402502 NEG BACKGROUND: External beam radiatio...
3,"8700794 NEG Although the enuresis ceased, she ..."
4,17662448 NEG A 42-year-old woman had uneventfu...


In [None]:
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]
df_neg.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


**ADE Positive Dataset**

In [None]:
df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,10030778,Intravenous azithromycin-induced ototoxicity.,ototoxicity,43,54,azithromycin,22,34
1,10048291,"Immobilization, while Paget's bone disease was...",increased calcium-release,960,985,dihydrotachysterol,908,926
2,10048291,Unaccountable severe hypercalcemia in a patien...,hypercalcemia,31,44,dihydrotachysterol,94,112
3,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,naproxen,646,654
4,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,oxaprozin,659,668


In [None]:
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]
df_pos.head()

Unnamed: 0,text,category
0,Intravenous azithromycin-induced ototoxicity.,pos
1,"Immobilization, while Paget's bone disease was...",pos
2,Unaccountable severe hypercalcemia in a patien...,pos
3,METHODS: We report two cases of pseudoporphyri...,pos
4,METHODS: We report two cases of pseudoporphyri...,pos


**Merging Positive and Negative dataset**

In [None]:
ade_df= pd.concat([df_neg, df_pos])
ade_df.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


In [None]:
ade_df["category"].value_counts()

neg    16695
pos     6821
Name: category, dtype: int64

In [None]:
ade_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23516 entries, 0 to 6820
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      23516 non-null  object
 1   category  23516 non-null  object
dtypes: object(2)
memory usage: 551.2+ KB


We take 30% of the data to make a faster run. You can use all data for better scores.

In [None]:
spark_df = spark.createDataFrame(ade_df).sample(0.3, 3) # limit the data

trainingData, testData = spark_df.randomSplit([0.8, 0.2], seed = 100)

print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 5617
Test Dataset Count: 1390


In [None]:
spark_df.groupBy("category").count().show()

+--------+-----+
|category|count|
+--------+-----+
|     pos| 2068|
|     neg| 4939|
+--------+-----+



In [None]:
spark_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)



In [None]:
spark_df.head(3)

[Row(text=' Clioquinol intoxication occurring in the treatment of acrodermatitis enteropathica with reference to SMON outside of Japan.', category='neg'),
 Row(text=' A 42-year-old woman had uneventful bilateral laser-assisted subepithelial keratectomy (LASEK) to correct myopia.', category='neg'),
 Row(text=' A 16-year-old girl with erosive, polyarticular JRA showed no detectable change in her articular disease following nine exchanges.', category='neg')]

### 100 Dimension Healthcare Embeddings (embeddings_healthcare_100d)



Now we will extract [healthcare_100d embeddings](https://nlp.johnsnowlabs.com/2020/05/29/embeddings_healthcare_100d_en.html) and use it in the classificaiton model training.

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,

    ])

embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]


In [None]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| "Syndrome malin"-like symptoms probably due to interacti...|     neg|[{sentence_embeddings, 0, 109,  "Syndrome malin"-like sym...|
| 'Bail-out' bivalirudin use in patients with thrombotic c...|     neg|[{sentence_embeddings, 0, 150,  'Bail-out' bivalirudin us...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)
testData_with_embeddings = testData_with_embeddings.select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| (2) Rehabilitation of a 29-year-old man with a 7-year hi...|     neg|[{sentence_embeddings, 0, 176,  (2) Rehabilitation of a 2...|
| 2-Chlordeoxyadenosine (2-CdA) is an antineoplastic/immun...|     neg|[{sentence_embeddings, 0, 170,  2-Chlordeoxyadenosine (2-...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [None]:
testData_with_embeddings.select(testData_with_embeddings.sentence_embeddings.embeddings).show(3,truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                          sentence_embeddings.embeddings|
+------------------------------------------------------------------------------------------------------------------------+
|[[-0.07506646, 0.053498756, 0.20358036, 0.22310139, -0.12403396, -0.07148757, 0.07359774, -0.07931808, -0.058168065, ...|
|[[-0.010381512, 0.082957536, 0.10597669, 0.22413144, -0.17930073, -0.038972173, -0.017218085, -0.08750686, -0.0118597...|
|[[0.061766334, -0.050023016, 0.24404901, 0.18501845, -0.14053237, -0.0817295, -0.06483702, -0.1373563, 0.0844157, 0.0...|
+------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows



In [None]:
log_folder="ADE_logs_healthcare_100d"
!mkdir -p $log_folder

### GenericLogRegClassifier

In [None]:
!pip install -q tensorflow==2.11.0
!pip install -q tensorflow-addons

In [None]:
#from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = medical.TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = medical.GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)\
    # .setFeatureScaling() Possible values are 'zscore', 'minmax' or empty (no scaling)


clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_logreg_graph_builder,
    gen_clf])


In [None]:
gen_clf.extractParamMap()

{Param(parent='GenericLogRegClassifierApproach_2bd7030b2639', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='GenericLogRegClassifierApproach_2bd7030b2639', name='labelColumn', doc='Column with one label per document'): 'category',
 Param(parent='GenericLogRegClassifierApproach_2bd7030b2639', name='batchSize', doc='Size for each batch in the optimization process'): 128,
 Param(parent='GenericLogRegClassifierApproach_2bd7030b2639', name='epochsN', doc='Number of epochs for the optimization process'): 20,
 Param(parent='GenericLogRegClassifierApproach_2bd7030b2639', name='learningRate', doc='Learning rate for the optimization process'): 0.01,
 Param(parent='GenericLogRegClassifierApproach_2bd7030b2639', name='dropout', doc='Dropout at the output of each layer'): 0.1,
 Param(parent='GenericLogRegClassifierApproach_2bd7030b2639', name='fixImbalance', doc='A flag indicating whenther to balance the trainig set'): True,
 Param

In [None]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [None]:
!cat $log_folder/GenericLogRegClassifierApproach*

Training 20 epochs
Epoch 1/20	0.35s	Loss: 26.359146	ACC: 0.70233625
Epoch 2/20	0.09s	Loss: 24.4605	ACC: 0.7182676
Epoch 3/20	0.11s	Loss: 23.55838	ACC: 0.72685003
Epoch 4/20	0.09s	Loss: 22.888666	ACC: 0.73724264
Epoch 5/20	0.09s	Loss: 22.552008	ACC: 0.7384619
Epoch 6/20	0.08s	Loss: 22.318563	ACC: 0.74931645
Epoch 7/20	0.08s	Loss: 21.98907	ACC: 0.74900854
Epoch 8/20	0.09s	Loss: 21.837883	ACC: 0.75292736
Epoch 9/20	0.08s	Loss: 21.780584	ACC: 0.7532463
Epoch 10/20	0.09s	Loss: 21.487045	ACC: 0.7583719
Epoch 11/20	0.07s	Loss: 21.601967	ACC: 0.7576742
Epoch 12/20	0.07s	Loss: 21.501255	ACC: 0.7591528
Epoch 13/20	0.06s	Loss: 21.280378	ACC: 0.7623017
Epoch 14/20	0.07s	Loss: 21.322176	ACC: 0.7605969
Epoch 15/20	0.06s	Loss: 21.277739	ACC: 0.7614375
Epoch 16/20	0.07s	Loss: 21.115847	ACC: 0.764313
Epoch 17/20	0.07s	Loss: 21.265276	ACC: 0.76317847
Epoch 18/20	0.07s	Loss: 21.158234	ACC: 0.76303554
Epoch 19/20	0.05s	Loss: 21.051338	ACC: 0.76387614
Epoch 20/20	0.05s	Loss: 21.203983	ACC: 0.7643616
Traini

In [None]:
preds = generic_model_hc100.transform(testData_with_embeddings)

In [None]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- feature_vector: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |

In [None]:
preds_df = preds.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.77      0.94      0.85       972
         pos       0.72      0.34      0.47       418

    accuracy                           0.76      1390
   macro avg       0.75      0.64      0.66      1390
weighted avg       0.76      0.76      0.73      1390

