![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/17.0.Graph_builder_for_DL_models.ipynb)

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [4]:
from johnsnowlabs import nlp, medical, visual

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/5.0.0.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.0, 💊Spark-Healthcare==5.0.0, running on ⚡ PySpark==3.1.2


In [None]:
!pip install -q tensorflow==2.7.0
!pip install tensorflow-addons

In [6]:
import tensorflow
print('Graph Version :', tensorflow.version.GRAPH_DEF_VERSION)
print('TF Version    :', tensorflow.version.VERSION)

Graph Version : 1395
TF Version    : 2.12.0


# TF Graph Builder

`TFGraphBuilder` annotator can be used to create graphs in the model training pipeline. `TFGraphBuilder` inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

You can use this builder with `MedicalNerApproach`, `RelationExtractionApproach`, `AssertionDLApproach`, and `GenericClassifierApproach`.

**ATTENTION:** Playing with the parameters of `TFGraphBuilder` may affect the model performance that you want to train.

In [7]:
graph_folder = "./medical_graphs"

## **NER_DL**

**Create a Medical NER graph.**

In [8]:
med_ner_graph_builder = medical.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setGraphFolder(graph_folder)\
    .setIsLicensed(True)  # False -> for NerDLApproach

**Train the model with `MedicalNerApproach` and let it use the graph generated by the builder**

```python
...
med_ner = medical.NerApproach() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label") \
    .setOutputCol("ner") \
    .setMaxEpochs(5) \
    .setLr(0.003) \
    .setBatchSize(8) \
    .setRandomSeed(0) \
    .setVerbose(1) \
    .setEvaluationLogExtended(False) \
    .setEnableOutputLogs(False) \
    .setIncludeConfidence(True) \
    .setEarlyStoppingCriterion(0.5) \
    .setEarlyStoppingPatience(2) \
    .setTestDataset(test_data_parquet_path) \
    .setGraphFolder(graph_folder)

medner_pipeline = nlp.base.Pipeline().setStages([
    embeddings,
    med_ner_graph_builder,
    med_ner    
])
```

## **AssertionDL**

**Create an Assertion DL graph.**

In [9]:
assertion_graph_builder = medical.TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(100)\
    .setHiddenUnitsNumber(16)

**Train the model with `AssertionDLApproach` and let it use the graph generated by the builder**

```python
...
assertion_status = medical.AssertionDLApproach() \
    .setGraphFolder(graph_folder) \
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setInputCols("sentence", "chunk", "embeddings") \
    .setOutputCol("assertion") \
    .setStartCol("start") \
    .setEndCol("end") \
    .setLabelCol("label") \
    .setLearningRate(0.01) \
    .setDropout(0.15) \
    .setBatchSize(16) \
    .setEpochs(3) \
    .setScopeWindow([9, 15])\
    .setValidationSplit(0.2) \
    .setIncludeConfidence(True)
    
assertion_pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        POSTag,
        chunker,
        embeddings,
        assertion_graph_builder,
        assertion_status])
```

## **GenericClassifier**

**Create Generic Classifier the pipeline with a graph builder in it**

In [10]:
gcf_graph_builder = medical.TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("class")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph.pb")\
    .setHiddenLayers([10, 5, 3])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(False)\
    .setHiddenWeightsL2(False)\
    .setBatchNorm(False)

```python
...
gcf_approach = medical.GenericClassifierApproach()\
    .setLabelColumn("class")\
    .setInputCols(["feature_vector"])\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph.pb")\
    .setEpochsNumber(5)\
    .setBatchSize(100)\
    .setFeatureScaling("zscore")\
    .setFixImbalance(True)\
    .setLearningRate(0.001)\

gcf_pipeline = Pipeline(stages=[
    features_asm,
    gcf_graph_builder,
    gcf_approach])
```

## **RelationExtraction**

**Create RE graph builder**

In [11]:
re_graph_builder = medical.TFGraphBuilder()\
    .setModelName("relation_extraction")\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"]) \
    .setLabelColumn("target_rel")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("re_graph.pb")\
    .setHiddenLayers([20, 10])\
    .setHiddenAct("sigmoid")\
    .setHiddenActL2(True)\
    .setHiddenWeightsL2(False)\
    .setBatchNorm(False)

```python
...
re_approach = medical.RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("target_rel")\
    .setEpochsNumber(20)\
    .setDropout(0.5)\
    .setLearningRate(0.001)\
    .setModelFile(f"{graph_folder}/re_graph.pb")\
    .setFixImbalance(True)\
    .setFromEntity("from_begin", "from_end", "from_label")\
    .setToEntity("to_begin", "to_end", "to_label")

re_pipeline = Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    pos_tagger,
    dependency_parser,
    re_graph_builder,
    re_approach])

```

# Generating Custom Graphs

In [12]:
medical.tf_graph.get_models()

['ner_dl',
 'generic_classifier',
 'assertion_dl',
 'relation_extraction',
 'logreg_classifier',
 'svm_classifier',
 'fewshot_classifier']

## **NER_DL**

In [13]:
medical.tf_graph.print_model_params("ner_dl")

ner_dl parameters.
Parameter            Required   Default value        Description
ntags                yes        -                    Number of tags.
embeddings_dim       no         200                  Embeddings dimension.
nchars               no         100                  Number of chars.
lstm_size            no         128                  Number of LSTM units.
gpu_device           no         0                    Device for training.
is_medical           no         0                    Build a Medical Ner graph.


In [14]:
medical.tf_graph.build("ner_dl",
               build_params={"embeddings_dim": 200,
                             "nchars": 80,
                             "ntags": 12,
                             "is_medical": 1},
               model_location="./medical_ner_graphs",
               model_filename="auto")

Instructions for updating:
non-resource variables are not supported in the long term
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


ner_dl graph exported to ./medical_ner_graphs/blstm_12_200_128_80.pb


## **AssertionDL**

In [15]:
medical.tf_graph.print_model_params("assertion_dl")


assertion_dl parameters.
Parameter            Required   Default value        Description
max_seq_len          no         250                  Maximum sequence length.
feat_size            no         200                  Feature size.
n_classes            yes        -                    Number of classes.
device               no         /cpu:0               Device for training.
n_hidden             no         34                   Number of hidden units.


In [16]:
medical.tf_graph.build("assertion_dl",
               build_params={"n_classes": 10},
               model_location="./assertion_graph",
               model_filename="auto")

Device mapping: no known devices.


Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API


Device mapping: no known devices.
assertion_dl graph exported to ./assertion_graph/blstm_34_32_30_200_10.pb


## **GenericClassifier**

In [17]:
medical.tf_graph.print_model_params("generic_classifier")

generic_classifier parameters.
Parameter            Required   Default value        Description
hidden_layers        no         [200]                List of integers indicating the size of each hidden layer. For example: [100, 200, 300].
input_dim            yes        -                    Input dimension.
output_dim           yes        -                    Output dimension.
hidden_act           no         relu                 Activation function of hidden layers: relu, sigmoid, tanh or linear.
hidden_act_l2        no         0                    L2 regularization of hidden layer activations. Boolean (0 or 1).
hidden_weights_l2    no         0                    L2 regularization of hidden layer weights. Boolean (0 or 1).
batch_norm           no         0                    Batch normalization. Boolean (0 or 1).
output_act           no         softmax              Output activation function: softmax, sigmoid or linear.
loss_func            no         cross_entropy        Loss function

In [18]:
medical.tf_graph.build("generic_classifier",
               build_params={"input_dim": 100,
                             "output_dim": 10,
                             "hidden_layers": [300, 200, 100],
                             "hidden_act": "tanh"},
               model_location="generic_graph",
               model_filename="auto")

generic_classifier graph exported to generic_graph/gcl.100.10.pb


## **RelationExtraction**

In [19]:
medical.tf_graph.print_model_params("relation_extraction")

relation_extraction parameters.
Parameter            Required   Default value        Description
hidden_layers        no         [200]                List of integers indicating the size of each hidden layer. For example: [100, 200, 300].
input_dim            yes        -                    Input dimension.
output_dim           yes        -                    Output dimension.
hidden_act           no         relu                 Activation function of hidden layers: relu, sigmoid, tanh or linear.
hidden_act_l2        no         0                    L2 regularization of hidden layer activations. Boolean (0 or 1).
hidden_weights_l2    no         0                    L2 regularization of hidden layer weights. Boolean (0 or 1).
batch_norm           no         0                    Batch normalization. Boolean (0 or 1).
output_act           no         softmax              Output activation function: softmax, sigmoid or linear.
loss_func            no         cross_entropy        Loss functio

In [20]:
medical.tf_graph.build("relation_extraction",
               build_params={"input_dim": 6000,
                             "output_dim": 3,
                             'batch_norm':1,
                             "hidden_layers": [300, 200],
                             "hidden_act": "relu",
                             'hidden_act_l2':1},
               model_location="relation_graph",
               model_filename="re_with_BN.pb")

Instructions for updating:
Colocations handled automatically by placer.


relation_extraction graph exported to relation_graph/re_with_BN.pb
