![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/MULTICLF_HOC.ipynb)

## **Multilabel Classification For Hallmarks of Cancer**

## **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.0.2
Spark NLP_JSL Version : 5.0.2


# **multiclassifierdl_hoc**





In [4]:
text_list = [

"""Ghrelin was identified in the stomach as an endogenous ligand specific for the growth hormone secretagogue receptor ( GHS-R ) . GHS-R is found in various tissues , but its function is unknown . Here we show that GHS-R is found in hepatoma cells . Exposure of these cells to ghrelin caused up-regulation of several insulin-induced activities including tyrosine phosphorylation of insulin receptor substrate-1 ( IRS-1 ) , association of the adapter molecule growth factor receptor-bound protein 2 with IRS-1 , mitogen-activated protein kinase activity , and cell proliferation . Unlike insulin , ghrelin inhibited Akt kinase activity as well as up-regulated gluconeogenesis . These findings raise the possibility that ghrelin modulates insulin activities in humans .""",

"""Clones of mortal chicken fibroblasts and erythroblasts transformed by temperature-sensitive v-src and v-erb B oncoproteins have been developed into immortal cell lines that retain the conditional transformed phenotype . The expressions of two tumor suppressor genes , the retinoblastoma ( Rb ) gene and the p53 gene , were investigated during senescence , crisis , and cell line establishment . In temperature-sensitive ( ts)-v-erb B erythroblasts and ts-v-src fibroblasts ( as well as in v-myc macrophages ) , loss of p53 mRNA or expression of a mutated p53 gene invariably occurred in the early phase of immortalization . In contrast , expression of the Rb gene was unchanged at all stages of immortalization . Inactivation of the original temperature-sensitive oncogene led to loss of the transformed phenotype in fibroblasts and to differentiation in erythroblasts , even in lines that were immortal and lacked p53 . The results demonstrate that the process of immortalization is distinct from cell transformation , probably requiring different mutational events """,

"""The immune system has an important role in tumor appearance and spreading . One of the most efficient subpopulations of cytotoxic cells in the destruction of tumors are NK cells . NK cells are activated and increase their cytotoxic potential and modulate their cytokine production after treatment with IFNgamma , IL-12 , TNFalpha and IL-2 . The investigation of the activity of NK cells was performed on peripheral blood lymphocytes ( PBL ) of 16 healthy controls and of 40 patients with metastatic breast carcinoma . Modulation of NK cells was performed with IL-2 , IL-7 , IL-12 , TNFalpha , monoclonal antibodies ( mAb ) for TNFalpha and TNFalpha receptors type I and II , as well as with sera of healthy controls and patients with breast cancer in different clinical stages . Modulating effect of the applied factors after in vitro treatment of PBL was evaluated by the cytotoxic assay using 51chromium . Our results indicate that IL-2 significantly increased the activity of NK cells of controls and breast cancer patients . The sera of patients with advanced breast cancer significantly reduced NK cell activity . IL-7 , IL-12 and mAb for TNFalpha do not significantly change the activity of NK cells . The presence of anti-TNFalpha mAb did not change the inhibitory effect of the sera of breast cancer patients with advanced disease on the activity of NK cells of controls and patients with breast cancer . Blocking of TNFalpha Rcs with mAbs decrease the reactivity of NK cells for IL-2 . The treatment of breast cancer patients with advanced clinical stage of breast cancer with IL-2 , as an additional therapy , could be advantageous , as NK cells after this treatment increase their cytotoxic activity against tumor cells and can improve therapeutical results """,

"""Many human cellular and tissue compartments are supersaturated with respect to calcium oxyanion salts . In order to prevent the formation of injurious crystals efficient anti-crystallization protective mechanisms must be necessary . We suggest that depletion of such systems , particularly in ageing organisms and under conditions of oxidative stress , plays an important role in degenerative and inflammatory diseases , including cancer ."""

"""BACKGROUND Matrix metalloproteinases ( MMP ) are a gene family of zinc enzymes capable of degrading almost all of the extracellular matrix macromolecules in vivo . Their enzymic activities are believed to be responsible for tumor invasion and metastasis . METHODS In this study , using peroxidase-antiperoxidase method , monospecific antisera against MMP-1 ( tissue collagenase ) , MMP-2 ( type IV collagenase/72-kilodalton [ KD ] gelatinase ) , and MMP-3 ( stromelysin ) were applied to 29 squamous cell carcinomas and normal epithelium of the esophagus to identify cells synthesizing and secreting these enzymes . RESULTS Immunoreactivity of MMP-1 , -2 , and -3 was observed in small cancer nests of the deeply invasive or marginal portion of the tumor . Among the 29 patients studied , the presence of at least one MMP was observed in 17 ( 58.6% ) . All three enzymes were observed in six ( 20.6% ) patients , MMP-2 and -3 in five ( 17.2% ) patients , only MMP-2 in three ( 10.3% ) patients , and MMP-3 alone in three ( 10.3% ) patients . There was a good correlation among histologic stage and tumor invasion , lymph node metastasis , and MMP expression . In particular , expression of MMP-2 and -3 was closely related to lymph node metastasis and vascular invasion . CONCLUSIONS These results suggest that MMP , especially MMP-2 and -3 , play an important role in tumor invasion and metastasis and that analysis of MMP-2 and -3 production is useful for evaluation of malignant potential in esophageal carcinoma. """,

"""A number of cancer chemotherapeutic drugs designed to have cytotoxic actions on tumor cells have recently been shown to also have antiangiogenic activities . Endothelial cell migration and proliferation are key components of tumor angiogenesis , and agents that target the microtubule cytoskeleton can interfere with these processes . In this study , the effect on endothelial cell functions of the microtubule-stabilizing drugs Taxotere and Taxol were evaluated in three in vitro assays : a chemokinetic migration assay , an angiogenesis factor-mediated chemotactic migration assay , and a three-dimensional Matrigel tubule formation assay , using rat fat pad endothelial cells ( RFPECs ) and/or human umbilical vein endothelial cells ( HUVECs ) . Taxotere was active in all three assays at concentrations that were not cytotoxic and did not inhibit endothelial cell proliferation . In the RFPEC chemokinetic migration and in vitro tubule formation assays , the IC50 values were approximately 10(-9) M for both Taxotere and Taxol . HUVEC migration , however , was more sensitive to Taxotere , with an observed IC50 of 10(-12) M in a chemokinetic assay . In a Boyden chamber assay , HUVEC chemotaxis stimulated by either of two angiogenic factors , thymidine phosphorylase or vascular endothelial growth factor , was inhibited by Taxotere with an IC50 of 10(-11) M and was ablated at 10(-9) M. Taxotere was also up to 1000-fold more potent than Taxol in inhibiting either chemokinetic or chemotactic migration . When the microtubule cytoskeleton was visualized using immunofluorescence staining of alpha-tubulin , there were no gross morphological changes observed in HUVECs or RFPECs treated with Taxotere at concentrations that inhibited endothelial cell migration but not proliferation . The effects of Taxotere on migration were associated with a reduction in the reorientation of the cell's centrosome , at concentrations that did not affect gross microtubule morphology or proliferation . Reorientation of the centrosome , which acts as the microtubule organizing center , in the intended direction of movement is a critical early step in the stabilization of directed cell migration . These data indicate that endothelial cell migration correlates more closely with changes in microtubule plasticity than with microtubule gross structure . The antiangiogenic activity of Taxotere in vivo was assessed in a Matrigel plug assay . In this assay , the angiogenic response to fibroblast growth factor 2 was inhibited in vivo by Taxotere with an ID50 of 5.4 mg/kg when injected twice weekly over a 14-day period , and angiogenesis was completely blocked in mice that received 10 mg/kg Taxotere . The in vivo data further suggested that Taxotere had selectivity for endothelial cell migration and/or microvessel formation because infiltration of inflammatory cells into the Matrigel plug was much less sensitive to inhibition by Taxotere . In conclusion , Taxotere is a potent and potentially specific inhibitor of endothelial cell migration in vitro and angiogenesis in vitro and in vivo .""",

"""Ataxia-telangiectasia mutated ( ATM ) is a high molecular weight protein serine/threonine kinase that plays a central role in the maintenance of genomic integrity by activating cell cycle checkpoints and promoting repair of DNA double-strand breaks . Little is known about the regulatory mechanisms for ATM expression itself . MicroRNAs are naturally existing regulators that modulate gene expression in a sequence-specific manner . Here , we show that a human microRNA , miR-421 , suppresses ATM expression by targeting the 3'-untranslated region ( 3'UTR ) of ATM transcripts . Ectopic expression of miR-421 resulted in S-phase cell cycle checkpoint changes and an increased sensitivity to ionizing radiation , creating a cellular phenotype similar to that of cells derived from ataxia-telangiectasia ( A-T ) patients . Blocking the interaction between miR-421 and ATM 3'UTR with an antisense morpholino oligonucleotide rescued the defective phenotype caused by miR-421 overexpression , indicating that ATM mediates the effect of miR-421 on cell cycle checkpoint and radiosensitivity . Overexpression of the N-Myc transcription factor , an oncogene frequently amplified in neuroblastoma , induced miR-421 expression , which , in turn , down-regulated ATM expression , establishing a linear signaling pathway that may contribute to N-Myc-induced tumorigenesis in neuroblastoma . Taken together , our findings implicate a previously undescribed regulatory mechanism for ATM expression and ATM-dependent DNA damage response and provide several potential targets for treating neuroblastoma and perhaps A-T .""",

"""PTEN loss of function enhances proliferation , but effects on cellular energy metabolism are less well characterized . We used an inducible PTEN expression vector in a PTEN-null glioma cell line to examine this issue . While proliferation of PTEN-positive cells was insensitive to increases in glucose concentration beyond 2.5mM , PTEN-null cells significantly increased proliferation with increasing glucose concentration across the normal physiologic range to approximately 10mM , coinciding with a shift to glycolysis and "" glucose addiction "" . This demonstrates that the impact of loss of function of PTEN is modified by glucose concentration , and may be relevant to epidemiologic results linking hyperglycemia to cancer risk and cancer mortality ."""
]

In [5]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings()\
    .setInputCols(["document", "word_embeddings"])\
    .setOutputCol("sentence_embeddings")\
    .setPoolingStrategy("AVERAGE")

multi_classifier_dl = MultiClassifierDLModel.pretrained("multiclassifierdl_hoc", "en", "clinical/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")

pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        multi_classifier_dl
    ])


pipeline_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(pipeline_model)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
multiclassifierdl_hoc download started this may take some time.
Approximate size to download 11.2 MB
[OK!]


In [6]:
df = spark.createDataFrame(pd.DataFrame({"text" : text_list}))

result = pipeline_model.transform(df)

In [7]:
x = result.select("document.result", "class.result")
df = x.toDF('text', 'class')
df.show(truncate=100)

+----------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+
|                                                                                                text|                                                                                           class|
+----------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+
|[Ghrelin was identified in the stomach as an endogenous ligand specific for the growth hormone se...|                                                            [Sustaining_Proliferative_Signaling]|
|[Clones of mortal chicken fibroblasts and erythroblasts transformed by temperature-sensitive v-sr...|[Genomic_Instability_And_Mutation, Evading_Growth_Suppressors, Enabling_Replicative_Immortality]|
