![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/NER_CELLULAR.ipynb)

# `ner_cellular` **Models**

This model detects cell type, cell line, DNA and RNA information using our pretrained Spark NLP for Healthcare model..

## 1. Colab Setup

**Import license keys**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## Start Session

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

In [None]:
spark

## 3. Select the model and construct the pipeline

In [None]:
MODEL_LIST = ["ner_cellular",
              "ner_cellular_biobert"]

**Create the pipeline**

In [None]:
document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")


# for clinical based model
embeddings_clinical = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

ner_cellular = medical.NerModel.pretrained("ner_cellular", "en", "clinical/models") \
      .setInputCols(["document", "token", "word_embeddings"]) \
      .setOutputCol("ner")



# for biobert based model
embeddings_biobert = nlp.BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

ner_cellular_biobert = medical.NerModel.pretrained("ner_cellular_biobert", "en", "clinical/models") \
      .setInputCols(["document", "token", "word_embeddings"]) \
      .setOutputCol("ner")



ner_converter = medical.NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\



def run_pipeline(MODEL_NAME , sample_text):

    if MODEL_NAME == "ner_cellular":
        resolver_pipeline = Pipeline(stages = [document_assembler,
                                               tokenizer,
                                               embeddings_clinical,
                                               ner_cellular,
                                               ner_converter,])
        
    else: 
        resolver_pipeline = Pipeline(stages = [document_assembler,
                                               tokenizer,
                                               embeddings_biobert,
                                               ner_cellular_biobert,
                                               ner_converter,])
        
    text = spark.createDataFrame(sample_text, StringType()).toDF('text')
    result = resolver_pipeline.fit(text).transform(text)
    
    return result

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_cellular download started this may take some time.
[OK!]
biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
ner_cellular_biobert download started this may take some time.
[OK!]


## 4. Create example inputs

In [None]:
sample_text = [
"""It remains open whether the growth retarding property of the EBNA2-oestrogen receptor fusion protein in B cell lymphoma lines is due to unphysiologically high expression of the chimeric protein or to interference with a cellular programme driving proliferation in these cell lines. Tissue-specific activity of the gammac chain gene promoter depends upon an Ets binding site and is regulated by GA-binding protein. The gammac chain is a subunit of multiple cytokine receptors (interleukin (IL) -2, IL-4, IL-7, IL-9, and IL-15), the expression of which is restricted to hematopoietic lineages. A defect in gammac leads to the X-linked severe combined immunodeficiency characterized by a block in T cell differentiation. In order to better characterize the human gammac promoter and define the minimal tissue-specific promoter region, progressive 5'-deletion constructs of a segment extending 1053 base pairs upstream of the major transcription start site were generated and tested for promoter activity in various hematopoietic and nonhematopoietic cell types.""",
"""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. """,
"""We have previously shown that NF-AT1 is constitutively active in Jurkat T cells stably transfected with the Tax cDNA, although the underlying molecular mechanism and physiological relevance of this finding remain unclear. In this report, we demonstrate that the active form of NF-AT1 is also present in the nuclei of HTLV-I-transformed T cells that express the Tax protein. Interestingly, the constitutive activation of NF-AT1 in these T cells is associated with its dephosphorylation. Furthermore, the dephosphorylated NF-AT1 can be rapidly rephosphorylated when the cells are incubated with cyclosporin A, an immunosuppressant inhibiting the serine/threonine phosphatase calcineurin. These results suggest that activation of NF-AT1 in Tax-expressing and HTLV-I-transformed T cells results from its dephosphorylation, which in turn may be due to deregulation of calcineurin Expression of NFAT-family proteins in normal human T cells.""",
"""To determine whether different cellular factors were involved in E3 regulation in lymphocytes as compared with HeLa cells, both DNA binding and transfection analysis with the E3 promoter in both cell types were performed. These studies detected two novel domains referred to as L1 and L2 with a variety of lymphoid but not HeLa extracts. Each of these domains possessed strong homology to motifs previously found to bind the cellular factor NF-kappa B. Transfections of E3 constructs linked to the chloramphenicol acetyltransferase gene revealed that mutagenesis of the distal NF-kappa B motif (L2) had minimal effects on promoter expression in HeLa cells, but resulted in dramatic decreases in expression by lymphoid cells. In contrast, mutagenesis of proximal NF-kappa B motif (L1) had minimal effects on gene expression in both HeLa cells and lymphoid cells but resulted in a small, but reproducible, increase in gene expression in lymphoid cells when coupled to the L2 mutation.""",
"""The gp160-induced AP-1 complex is dependent upon protein tyrosine phosphorylation and is protein synthesis-independent. This stimulation can also be abolished by inhibitors of protein kinase C, but it is unaffected by calcium channel blocker or cyclosporine A. This gp160 treatment adversely affects the functional capabilities of T cells: pre-treatment of CD4+ T cells with gp160 for 4 h at 37 degrees C inhibited anti-CD3-induced interleukin-2 secretion. Effects similar to gp160 were seen with anti-CD4 mAb. The aberrant activation of AP-1 by gp160 in CD4 positive T cells could result in up-regulation of cytokines containing AP-1 sites, e.g. interleukin-3 and granulocyte macrophage colony-stimulating factor, and concurrently lead to T cell unresponsiveness by inhibiting interleukin-2 secretion.""",
]

In [None]:
from pyspark.sql.types import StringType, IntegerType

text = spark.createDataFrame(sample_text, StringType()).toDF('text')

text.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|It remains open whether the growth retarding property of the EBNA2-oestrogen receptor fusion prot...|
|Detection of various other intracellular signaling proteins is also described. Genetic characteri...|
|We have previously shown that NF-AT1 is constitutively active in Jurkat T cells stably transfecte...|
|To determine whether different cellular factors were involved in E3 regulation in lymphocytes as ...|
|The gp160-induced AP-1 complex is dependent upon protein tyrosine phosphorylation and is protein ...|
+----------------------------------------------------------------------------------------------------+



## 5. Use the pipeline to create outputs

In [None]:
for i in range(len(MODEL_LIST)):

    result = run_pipeline(MODEL_LIST[i], sample_text)

    print(f"\n*******{MODEL_LIST[i]}********")

    result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                         result.ner_chunk.begin, 
                                         result.ner_chunk.end,
                                         result.ner_chunk.metadata, )).alias("cols"))\
          .select(F.expr("cols['0']").alias("chunk"),
                  F.expr("cols['1']").alias("begin"),
                  F.expr("cols['2']").alias("end"),
                  F.expr("cols['3']['entity']").alias("entity")).show()


*******ner_cellular********
+--------------------+-----+---+---------+
|               chunk|begin|end|   entity|
+--------------------+-----+---+---------+
|EBNA2-oestrogen r...|   61| 99|  protein|
|B cell lymphoma l...|  104|124|cell_line|
|    chimeric protein|  177|192|  protein|
|          cell lines|  270|279|cell_line|
|gammac chain gene...|  314|339|      DNA|
|    Ets binding site|  357|372|      DNA|
|  GA-binding protein|  394|411|  protein|
|        gammac chain|  418|429|  protein|
|multiple cytokine...|  447|473|  protein|
|         interleukin|  476|486|  protein|
|                IL-4|  497|500|  protein|
|                IL-7|  503|506|  protein|
|                IL-9|  509|512|  protein|
|               IL-15|  519|523|  protein|
|hematopoietic lin...|  568|589|cell_type|
|              gammac|  604|609|  protein|
|human gammac prom...|  754|774|      DNA|
|minimal tissue-sp...|  791|829|      DNA|
|5'-deletion const...|  844|865|      DNA|
|1053 base pairs u...|  8

## 6. Visualize results

In [None]:
from sparknlp_display import NerVisualizer

ner_viz = NerVisualizer()

for i in range(len(MODEL_LIST)):

    result = run_pipeline(MODEL_LIST[i], sample_text)
    print(f"\n\n******************{MODEL_LIST[i]}************************\n")
    
    for j in range(len(sample_text)):
        ner_viz.display(result = result.collect()[j], label_col = "ner_chunk")
        print("\n\n")



******************ner_cellular************************




























******************ner_cellular_biobert************************


























