![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_RCT.ipynb)

# **RCT Binary Classifiers**

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

# **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


# **MODELS**

* ### *`rct_binary_classifier_use`*
* ### *`rct_binary_classifier_biobert`*
* ### *`bert_sequence_classifier_binary_rct_biobert`*
* ### *`bert_sequence_classifier_rct_biobert`*

**🔎You can find all these models and more [NLP Models Hub](https://nlp.johnsnowlabs.com/models?task=Named+Entity+Recognition&edition=Spark+NLP+for+Healthcare)**

## **`bert_sequence_classifier_rct_biobert`**

In [4]:
sample_texts = [
"""Clinical and renal toxicity of amphotericin are reduced when the drug is prepared in fat emulsion. Preparation is simple and cost effective. Its efficacy is similar to that of conventional amphotericin .  """,
"""Insulin-sensitive women on the HC/LF diet lost 13.5 + / - 1.2 % ( p < 0.001 ) of their initial BW , whereas those on the LC/HF diet lost 6.8 + / - 1.2 % ( p < 0.001 ; p < 0.002 between the groups ) .In contrast , among the insulin-resistant women , those on the LC/HF diet lost 13.4 + / - 1.3 % ( p < 0.001 ) of their initial BW as compared with 8.5 + / - 1.4 % ( p < 0.001 ) lost by those on the HC/LF diet ( p < 0.04 between two groups ) .These differences could not be explained by changes in resting metabolic rate , activity , or intake .Overall , changes in Si were associated with the degree of weight loss ( r = -0.57 , p < 0.05 ) .""",
"""Premenopausal African American women have a 2-3 times greater incidence of coronary heart disease ( CHD ) than do white women . The plasma lipid responsiveness to dietary fat , which may be associated with CHD , has not been adequately studied in premenopausal African American or white women .""",
"""Ninety-three disease-free survivors of advanced Hodgkin disease ( 56 men and 37 women ) were studied ( a minimum of 1 year after completion of treatment ) by an interview conducted over the telephone . Standardized measures were used to assess their psychologic , sexual , family , and vocational functioning , including the following tests : the Psychosocial Adjustment to Illness Scale -- Self Report , the Brief Symptom Inventory , the Profile of Mood States , and the Impact of Event Scale .""",
"""This study investigated the effect of contingent electrical stimulation ( CES ) on present pain intensity ( PI ) , pressure pain threshold ( PPT ) , and electromyographic events per hour of sleep ( EMG/h ) on probable bruxers with masticatory myofascial pain ."""
]

In [5]:
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")


tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")


sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_rct_biobert", "en", "clinical/models")\
.setInputCols(["document",'token'])\
.setOutputCol("class_")


pipeline = Pipeline(
    stages=[
      document_assembler, 
      tokenizer,
      sequenceClassifier  
      ])


df = spark.createDataFrame(sample_texts, StringType()).toDF("text")
results = pipeline.fit(df).transform(df)


bert_sequence_classifier_rct_biobert download started this may take some time.
[OK!]


In [6]:
res = results.select(F.explode(F.arrays_zip(results.document.result, 
                                            results.class_.result,
                                            results.class_.metadata)).alias("col"))\
              .select(F.expr("col['1']").alias("prediction"),
                      F.expr("col['2']").alias("confidence"),
                      F.expr("col['0']").alias("sentence"))
if res.count()>1:
    udf_func = F.udf(lambda x,y:  x["Some("+str(y)+")"])
    res.withColumn('confidence', udf_func(res.confidence, res.prediction)).show(truncate=100)

+-----------+----------+----------------------------------------------------------------------------------------------------+
| prediction|confidence|                                                                                            sentence|
+-----------+----------+----------------------------------------------------------------------------------------------------+
|CONCLUSIONS| 0.9995792|Clinical and renal toxicity of amphotericin are reduced when the drug is prepared in fat emulsion...|
|    RESULTS| 0.9999662|Insulin-sensitive women on the HC/LF diet lost 13.5 + / - 1.2 % ( p < 0.001 ) of their initial BW...|
| BACKGROUND|0.99916506|Premenopausal African American women have a 2-3 times greater incidence of coronary heart disease...|
|    METHODS|0.99996114|Ninety-three disease-free survivors of advanced Hodgkin disease ( 56 men and 37 women ) were stud...|
|  OBJECTIVE| 0.9990654|This study investigated the effect of contingent electrical stimulation ( CES ) on present pai

## **`rct_binary_classifier_biobert`**

In [7]:
sample_texts = [
"""Background:European roundtable meeting recommendations on bathing and cleansing of infants were published in 2009; a second meeting was held to update and expand these recommendations in light of new evidence and the continued need to address uncertainty surrounding this aspect of routine care. Method:The previous roundtable recommendations concerning infant cleansing, bathing, and use of liquid cleansers were critically reviewed and updated and the quality of evidence was evaluated using the Grading of Recommendation Assessment, Development and Evaluation system. New recommendations were developed to provide guidance on diaper care and the use of emollients. A series of recommendations was formulated to characterize the attributes of ideal liquid cleansers, wipes, and emollients. Results:Newborn bathing can be performed without harming the infant, provided basic safety procedures are followed. Water alone or appropriately designed liquid cleansers can be used during bathing without impairing the skin maturation process. The diaper area should be kept clean and dry; from birth, the diaper area may be gently cleansed with cotton balls/squares and water or by using appropriately designed wipes. Appropriately formulated emollients can be used to maintain and enhance skin barrier function. Appropriately formulated baby oils can be applied for physiologic (transitory) skin dryness and in small quantities to the bath. Baby products that are left on should be formulated to buffer and maintain babies' skin surface at approximately pH 5.5, and the formulations and their constituent ingredients should have undergone an extensive program of safety testing. Formulations should be effectively preserved; products containing harsh surfactants, such as sodium lauryl sulfate, should be avoided. Conclusion:Health care professionals can use these recommendations as the basis of their advice to parents.""",
"""Abstract:Over the past decade there has been a significant shift to the use of murine models for investigations into the molecular basis of respiratory diseases, including asthma and chronic obstructive pulmonary disease. These models offer the exciting prospect of dissecting the complex interaction between cytokines, chemokines and growth related peptides in disease pathogenesis. Furthermore, the receptors and the intracellular signalling pathways that are subsequently activated are amenable for study because of the availability of monoclonal antibodies and techniques for targeted gene disruption and gene incorporation for individual mediators, receptors and proteins. However, it is clear that extrapolation from these models to the human condition is not straightforward, as reflected by some recent clinical disappointments. This is not necessarily a problem with the use of mice itself, but results from our continued ignorance of the disease process and how to improve the modelling of complex interactions between different inflammatory mediators that underlie clinical pathology. This review highlights some of the strengths and weaknesses of murine models of respiratory disease."""
]

In [8]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

bert_sent = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en") \
        .setInputCols("document") \
        .setOutputCol("sentence_embeddings")

classifier_dl = ClassifierDLModel.pretrained("rct_binary_classifier_biobert", "en", "clinical/models")\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("class_")

biobert_clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        bert_sent, 
        classifier_dl
        ])


df = spark.createDataFrame(sample_texts, StringType()).toDF("text")
results = biobert_clf_pipeline.fit(df).transform(df)


sent_biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
rct_binary_classifier_biobert download started this may take some time.
Approximate size to download 21.4 MB
[OK!]


In [9]:
res = results.select(F.explode(F.arrays_zip(results.document.result, 
                                            results.class_.result,
                                            results.class_.metadata)).alias("col"))\
               .select(F.expr("col['1']").alias("prediction"),
                       F.expr("col['2']").alias("confidence"),
                       F.expr("col['0']").alias("sentence"))
if res.count()>1:
    udf_func = F.udf(lambda x,y:  x[y])
    res.withColumn('confidence', udf_func(res.confidence, res.prediction)).show(truncate=100)

+----------+----------+----------------------------------------------------------------------------------------------------+
|prediction|confidence|                                                                                            sentence|
+----------+----------+----------------------------------------------------------------------------------------------------+
|      true|0.96267426|Background:European roundtable meeting recommendations on bathing and cleansing of infants were p...|
|     false| 0.9939513|Abstract:Over the past decade there has been a significant shift to the use of murine models for ...|
+----------+----------+----------------------------------------------------------------------------------------------------+



## **rct_binary_classifier_use**

In [10]:
sample_texts = [
"""Background:European roundtable meeting recommendations on bathing and cleansing of infants were published in 2009; a second meeting was held to update and expand these recommendations in light of new evidence and the continued need to address uncertainty surrounding this aspect of routine care. Method:The previous roundtable recommendations concerning infant cleansing, bathing, and use of liquid cleansers were critically reviewed and updated and the quality of evidence was evaluated using the Grading of Recommendation Assessment, Development and Evaluation system. New recommendations were developed to provide guidance on diaper care and the use of emollients. A series of recommendations was formulated to characterize the attributes of ideal liquid cleansers, wipes, and emollients. Results:Newborn bathing can be performed without harming the infant, provided basic safety procedures are followed. Water alone or appropriately designed liquid cleansers can be used during bathing without impairing the skin maturation process. The diaper area should be kept clean and dry; from birth, the diaper area may be gently cleansed with cotton balls/squares and water or by using appropriately designed wipes. Appropriately formulated emollients can be used to maintain and enhance skin barrier function. Appropriately formulated baby oils can be applied for physiologic (transitory) skin dryness and in small quantities to the bath. Baby products that are left on should be formulated to buffer and maintain babies' skin surface at approximately pH 5.5, and the formulations and their constituent ingredients should have undergone an extensive program of safety testing. Formulations should be effectively preserved; products containing harsh surfactants, such as sodium lauryl sulfate, should be avoided. Conclusion:Health care professionals can use these recommendations as the basis of their advice to parents.""",
"""Abstract:Over the past decade there has been a significant shift to the use of murine models for investigations into the molecular basis of respiratory diseases, including asthma and chronic obstructive pulmonary disease. These models offer the exciting prospect of dissecting the complex interaction between cytokines, chemokines and growth related peptides in disease pathogenesis. Furthermore, the receptors and the intracellular signalling pathways that are subsequently activated are amenable for study because of the availability of monoclonal antibodies and techniques for targeted gene disruption and gene incorporation for individual mediators, receptors and proteins. However, it is clear that extrapolation from these models to the human condition is not straightforward, as reflected by some recent clinical disappointments. This is not necessarily a problem with the use of mice itself, but results from our continued ignorance of the disease process and how to improve the modelling of complex interactions between different inflammatory mediators that underlie clinical pathology. This review highlights some of the strengths and weaknesses of murine models of respiratory disease."""
]

In [11]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained()\
        .setInputCols("document")\
        .setOutputCol("sentence_embeddings")

classifier_dl = ClassifierDLModel.pretrained("rct_binary_classifier_use", "en", "clinical/models")\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("class_")

use_clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        use,
        classifier_dl
    ])


df = spark.createDataFrame(sample_texts, StringType()).toDF("text")
results = use_clf_pipeline.fit(df).transform(df)


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
rct_binary_classifier_use download started this may take some time.
Approximate size to download 19.9 MB
[OK!]


In [12]:
res = results.select(F.explode(F.arrays_zip(results.document.result, 
                                            results.class_.result,
                                            results.class_.metadata)).alias("col"))\
              .select(F.expr("col['1']").alias("prediction"),
                      F.expr("col['2']").alias("confidence"),
                      F.expr("col['0']").alias("sentence"))
if res.count()>1:
    udf_func = F.udf(lambda x,y:  x[y])
    res.withColumn('confidence', udf_func(res.confidence, res.prediction)).show(truncate=100)

+----------+----------+----------------------------------------------------------------------------------------------------+
|prediction|confidence|                                                                                            sentence|
+----------+----------+----------------------------------------------------------------------------------------------------+
|      true|       1.0|Background:European roundtable meeting recommendations on bathing and cleansing of infants were p...|
|     false|0.99968004|Abstract:Over the past decade there has been a significant shift to the use of murine models for ...|
+----------+----------+----------------------------------------------------------------------------------------------------+



## **bert_sequence_classifier_binary_rct_biobert**

In [13]:
sample_texts = [
"""Background:European roundtable meeting recommendations on bathing and cleansing of infants were published in 2009; a second meeting was held to update and expand these recommendations in light of new evidence and the continued need to address uncertainty surrounding this aspect of routine care. Method:The previous roundtable recommendations concerning infant cleansing, bathing, and use of liquid cleansers were critically reviewed and updated and the quality of evidence was evaluated using the Grading of Recommendation Assessment, Development and Evaluation system. New recommendations were developed to provide guidance on diaper care and the use of emollients. A series of recommendations was formulated to characterize the attributes of ideal liquid cleansers, wipes, and emollients. Results:Newborn bathing can be performed without harming the infant, provided basic safety procedures are followed. Water alone or appropriately designed liquid cleansers can be used during bathing without impairing the skin maturation process. The diaper area should be kept clean and dry; from birth, the diaper area may be gently cleansed with cotton balls/squares and water or by using appropriately designed wipes. Appropriately formulated emollients can be used to maintain and enhance skin barrier function. Appropriately formulated baby oils can be applied for physiologic (transitory) skin dryness and in small quantities to the bath. Baby products that are left on should be formulated to buffer and maintain babies' skin surface at approximately pH 5.5, and the formulations and their constituent ingredients should have undergone an extensive program of safety testing. Formulations should be effectively preserved; products containing harsh surfactants, such as sodium lauryl sulfate, should be avoided. Conclusion:Health care professionals can use these recommendations as the basis of their advice to parents.""",
"""Abstract:Over the past decade there has been a significant shift to the use of murine models for investigations into the molecular basis of respiratory diseases, including asthma and chronic obstructive pulmonary disease. These models offer the exciting prospect of dissecting the complex interaction between cytokines, chemokines and growth related peptides in disease pathogenesis. Furthermore, the receptors and the intracellular signalling pathways that are subsequently activated are amenable for study because of the availability of monoclonal antibodies and techniques for targeted gene disruption and gene incorporation for individual mediators, receptors and proteins. However, it is clear that extrapolation from these models to the human condition is not straightforward, as reflected by some recent clinical disappointments. This is not necessarily a problem with the use of mice itself, but results from our continued ignorance of the disease process and how to improve the modelling of complex interactions between different inflammatory mediators that underlie clinical pathology. This review highlights some of the strengths and weaknesses of murine models of respiratory disease."""
]

In [14]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")


tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")


sequenceClassifier_loaded = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_binary_rct_biobert", "en", "clinical/models")\
  .setInputCols(["document",'token'])\
  .setOutputCol("class_")


pipeline = Pipeline(stages=[
                document_assembler, 
                tokenizer,
                sequenceClassifier_loaded   
            ])


df = spark.createDataFrame(sample_texts, StringType()).toDF("text")
results = pipeline.fit(df).transform(df)

bert_sequence_classifier_binary_rct_biobert download started this may take some time.
[OK!]


In [15]:
res = results.select(F.explode(F.arrays_zip(results.document.result, 
                                            results.class_.result,
                                            results.class_.metadata)).alias("col"))\
              .select(F.expr("col['1']").alias("prediction"),
                      F.expr("col['2']").alias("confidence"),
                      F.expr("col['0']").alias("sentence"))
if res.count()>1:
    udf_func = F.udf(lambda x,y:  x["Some("+str(y)+")"])
    res.withColumn('confidence', udf_func(res.confidence, res.prediction)).show(truncate=100)

+----------+----------+----------------------------------------------------------------------------------------------------+
|prediction|confidence|                                                                                            sentence|
+----------+----------+----------------------------------------------------------------------------------------------------+
|      True| 0.9999721|Background:European roundtable meeting recommendations on bathing and cleansing of infants were p...|
|     False| 0.9999644|Abstract:Over the past decade there has been a significant shift to the use of murine models for ...|
+----------+----------+----------------------------------------------------------------------------------------------------+

