![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/17.Graph_builder_for_DL_models.ipynb)

If you are using the `johnsnowlabs` library, please use this  [17.0.Graph_builder_for_DL_models](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/17.0.Graph_builder_for_DL_models.ipynb) notebook.

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [None]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.1.3
Spark NLP_JSL Version : 6.1.1


In [None]:
!pip install -q tensorflow==2.12.0
!pip install -q tensorflow_addons

In [None]:
import tensorflow
print('Graph Version :', tensorflow.version.GRAPH_DEF_VERSION)
print('TF Version    :', tensorflow.version.VERSION)

Graph Version : 2129
TF Version    : 2.19.0


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# **TF Graph Builder**

`TFGraphBuilder` annotator can be used to create graphs in the model training pipeline. `TFGraphBuilder` inspects the data and creates the proper graph if a suitable version of TensorFlow (=> 2.8 ) is available. The graph is stored in the defined folder and loaded by the approach.

You can use this builder with `MedicalNerApproach`, `RelationExtractionApproach`, `AssertionDLApproach`, and `GenericClassifierApproach`.

**ATTENTION:** Playing with the parameters of `TFGraphBuilder` may affect the model performance that you want to train.

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "./medical_graphs"

## **NER_DL**

**Create a Medical NER graph.**

In [None]:
med_ner_graph_builder = TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setGraphFolder(graph_folder)\
    .setIsLicensed(True)  # False -> for NerDLApproach

**Train the model with `MedicalNerApproach` and let it use the graph generated by the builder**

```python
...
med_ner = MedicalNerApproach() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label") \
    .setOutputCol("ner") \
    .setMaxEpochs(5) \
    .setLr(0.003) \
    .setBatchSize(8) \
    .setRandomSeed(0) \
    .setVerbose(1) \
    .setEvaluationLogExtended(False) \
    .setEnableOutputLogs(False) \
    .setIncludeConfidence(True) \
    .setEarlyStoppingCriterion(0.5) \
    .setEarlyStoppingPatience(2) \
    .setTestDataset(test_data_parquet_path) \
    .setGraphFolder(graph_folder)

medner_pipeline = sparknlp.base.Pipeline().setStages([
    embeddings,
    med_ner_graph_builder,
    med_ner    
])
```

## **AssertionDL**

**Create an Assertion DL graph.**

In [None]:
assertion_graph_builder = TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(100)\
    .setHiddenUnitsNumber(16)

**Train the model with `AssertionDLApproach` and let it use the graph generated by the builder**

```python
...
assertion_status = sparknlp_jsl.annotators.AssertionDLApproach() \
    .setGraphFolder(graph_folder) \
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setInputCols("sentence", "chunk", "embeddings") \
    .setOutputCol("assertion") \
    .setStartCol("start") \
    .setEndCol("end") \
    .setLabelCol("label") \
    .setLearningRate(0.01) \
    .setDropout(0.15) \
    .setBatchSize(16) \
    .setEpochs(3) \
    .setScopeWindow([9, 15])\
    .setValidationSplit(0.2) \
    .setIncludeConfidence(True)
    
assertion_pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        POSTag,
        chunker,
        embeddings,
        assertion_graph_builder,
        assertion_status])
```

## **GenericClassifier**

**Create Generic Classifier the pipeline with a graph builder in it**

In [None]:
gcf_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("class")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph.pb")\
    .setHiddenLayers([10, 5, 3])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(False)\
    .setHiddenWeightsL2(False)\
    .setBatchNorm(False)

```python
...
gcf_approach = sparknlp_jsl.annotators.GenericClassifierApproach()\
    .setLabelColumn("class")\
    .setInputCols(["feature_vector"])\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph.pb")\
    .setEpochsNumber(5)\
    .setBatchSize(100)\
    .setFeatureScaling("zscore")\
    .setFixImbalance(True)\
    .setLearningRate(0.001)\

gcf_pipeline = Pipeline(stages=[
    features_asm,
    gcf_graph_builder,
    gcf_approach])
```

## **RelationExtraction**

**Create RE graph builder**

In [None]:
re_graph_builder = TFGraphBuilder()\
    .setModelName("relation_extraction")\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"]) \
    .setLabelColumn("target_rel")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("re_graph.pb")\
    .setHiddenLayers([20, 10])\
    .setHiddenAct("sigmoid")\
    .setHiddenActL2(True)\
    .setHiddenWeightsL2(False)\
    .setBatchNorm(False)

```python
...
re_approach = sparknlp_jsl.annotators.RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("target_rel")\
    .setEpochsNumber(20)\
    .setDropout(0.5)\
    .setLearningRate(0.001)\
    .setModelFile(f"{graph_folder}/re_graph.pb")\
    .setFixImbalance(True)\
    .setFromEntity("from_begin", "from_end", "from_label")\
    .setToEntity("to_begin", "to_end", "to_label")

re_pipeline = Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    pos_tagger,
    dependency_parser,
    re_graph_builder,
    re_approach])

```

# **MedicalNerDLGraphChecker**

The MedicalNerDLGraphChecker processes the dataset to extract required graph parameters (tokens, labels, embedding dimensions)

In [None]:
embeddings = (WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
            .setInputCols(["splitter", "token"])
            .setOutputCol("embeddings"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [None]:
nerDLGraphChecker = MedicalNerDLGraphChecker()\
    .setInputCols(["splitter", "token"])\
    .setLabelColumn("ner_label")\
    .setEmbeddingsModel(embeddings)

```python
nerTagger = MedicalNerApproach()\
    .setInputCols(["splitter", "token", "embeddings"])\
    .setLabelColumn("ner_label")\
    .setOutputCol("ner")\
    .setMaxEpochs(30)\
    .setBatchSize(8)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0.2)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')\
    .setEarlyStoppingCriterion(0.01)\
    .setEarlyStoppingPatience(5)\
    .setUseBestModel(False)\
    #.setTestDataset("./data/test_df.parquet")\
    #.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch
    #.setDatasetInfo("NCBI_sample_short dataset") #You can add details regarding the dataset

ner_pipeline = Pipeline(
    stages=[
          nerDLGraphChecker,
          nerTagger
 ])

```

# **Generating Custom Graphs**

In [None]:
from sparknlp_jsl.training import tf_graph

# before sparknlp_jsl 3.2.1 version run the code below
# %tensorflow_version 1.x

In [None]:
tf_graph.get_models()

['ner_dl',
 'generic_classifier',
 'assertion_dl',
 'relation_extraction',
 'logreg_classifier',
 'svm_classifier',
 'fewshot_classifier']

## **NER_DL**

In [None]:
tf_graph.print_model_params("ner_dl")

ner_dl parameters.
Parameter            Required   Default value        Description
ntags                yes        -                    Number of tags.
embeddings_dim       no         200                  Embeddings dimension.
nchars               no         100                  Number of chars.
lstm_size            no         128                  Number of LSTM units.
gpu_device           no         0                    Device for training.
is_medical           no         0                    Build a Medical Ner graph.


In [None]:
tf_graph.build("ner_dl",
               build_params={"embeddings_dim": 200,
                             "nchars": 80,
                             "ntags": 12,
                             "is_medical": 1},
               model_location="./medical_ner_graphs",
               model_filename="auto")

## **AssertionDL**

In [None]:
tf_graph.print_model_params("assertion_dl")


assertion_dl parameters.
Parameter            Required   Default value        Description
max_seq_len          no         250                  Maximum sequence length.
feat_size            no         200                  Feature size.
n_classes            yes        -                    Number of classes.
device               no         /cpu:0               Device for training.
n_hidden             no         34                   Number of hidden units.


In [None]:
tf_graph.build("assertion_dl",
               build_params={"n_classes": 10},
               model_location="./assertion_graph",
               model_filename="auto")

## **GenericClassifier**

In [None]:
tf_graph.print_model_params("generic_classifier")

generic_classifier parameters.
Parameter            Required   Default value        Description
hidden_layers        no         [200]                List of integers indicating the size of each hidden layer. For example: [100, 200, 300].
input_dim            yes        -                    Input dimension.
output_dim           yes        -                    Output dimension.
hidden_act           no         relu                 Activation function of hidden layers: relu, sigmoid, tanh or linear.
hidden_act_l2        no         0                    L2 regularization of hidden layer activations. Boolean (0 or 1).
hidden_weights_l2    no         0                    L2 regularization of hidden layer weights. Boolean (0 or 1).
batch_norm           no         0                    Batch normalization. Boolean (0 or 1).
output_act           no         softmax              Output activation function: softmax, sigmoid or linear.
loss_func            no         cross_entropy        Loss function

In [None]:
tf_graph.build("generic_classifier",
               build_params={"input_dim": 100,
                             "output_dim": 10,
                             "hidden_layers": [300, 200, 100],
                             "hidden_act": "tanh"},
               model_location="generic_graph",
               model_filename="auto")

generic_classifier graph exported to generic_graph/gcl.100.10.pb


## **RelationExtraction**

In [None]:
tf_graph.print_model_params("relation_extraction")

relation_extraction parameters.
Parameter            Required   Default value        Description
hidden_layers        no         [200]                List of integers indicating the size of each hidden layer. For example: [100, 200, 300].
input_dim            yes        -                    Input dimension.
output_dim           yes        -                    Output dimension.
hidden_act           no         relu                 Activation function of hidden layers: relu, sigmoid, tanh or linear.
hidden_act_l2        no         0                    L2 regularization of hidden layer activations. Boolean (0 or 1).
hidden_weights_l2    no         0                    L2 regularization of hidden layer weights. Boolean (0 or 1).
batch_norm           no         0                    Batch normalization. Boolean (0 or 1).
output_act           no         softmax              Output activation function: softmax, sigmoid or linear.
loss_func            no         cross_entropy        Loss functio

In [None]:
tf_graph.build("relation_extraction",
               build_params={"input_dim": 6000,
                             "output_dim": 3,
                             'batch_norm':1,
                             "hidden_layers": [300, 200],
                             "hidden_act": "relu",
                             'hidden_act_l2':1},
               model_location="relation_graph",
               model_filename="re_with_BN.pb")

Instructions for updating:
Colocations handled automatically by placer.


relation_extraction graph exported to relation_graph/re_with_BN.pb
