![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/english/dl-ner/ner_graph_builder.ipynb)

# Building Graphs for NER

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import os
import json
import pandas as pd
import numpy as np


from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F
from pyspark.sql import types as T

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

from sparknlp.training import CoNLL

### Prerequisites for TFNerDLGraphBuilder

This annotator only works in Python since we need to build a tensorflow graph, `TFNerDLGraphBuilder` requires this packages:
1. Tensorflow 2.xx or 1.15
2. Tensorflow addons

In [None]:
pip install tensorflow-addons

In addition, we need to set `GraphFolder` parameter with the location to store our graph. We have 3 options to do this:
- Local File System: `/home/my_user/ner_graphs/`
- Distributed File System: `hdfs://my_cluster/my_path/ner_graphs` or `dbfs:/my_databricks_path/ner_graphs`
- S3: `s3://my_bucket/my_path/ner_graphs`

When storing on S3, we need to define AWS credentials and region when starting a spark session as shown below:

In [None]:
spark = SparkSession.builder \
    .appName("SparkNLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "12G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1") \
    .config("spark.jsl.settings.aws.credentials.access_key_id", "MY_ACCESS_KEY_ID") \
    .config("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY") \
    .config("spark:spark.jsl.settings.aws.credentials.session_token", "MY_SESSION_TOKEN") \
    .config("spark.jsl.settings.aws.region", "MY_AWS_REGION") \
    .getOrCreate()

print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.3.0


Please check how to start a spark session with spark-nlp based on your environment [here]( https://github.com/JohnSnowLabs/spark-nlp#usage)

We use a variable to define the location that we will set to generate the graph. This example uses S3, but we can define a local, HDFS or DBFS path.

In [None]:
# graph_folder = "s3://my_bucket/my_path/ner_graphs"
graph_folder = "ner_graphs"

### Prepare NER test data

In [None]:
conll = CoNLL()

train_data = conll.readDataset(spark=spark, path="./eng.testa").limit(1000)
test_data = conll.readDataset(spark=spark, path="./eng.testa").limit(1000)

embeddings = WordEmbeddingsModel \
    .pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

test_data_parquet_path = "./tmp/test_data_parquet"

embeddings.transform(test_data).write.mode("overwrite").parquet(test_data_parquet_path)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


### Pipeline with TFNerDLGraphBuilder

We define `TFNerDLGraphBuilder` to generate the graph and store it in the selected folder

In [None]:
graph_builder = TFNerDLGraphBuilder()\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setGraphFolder(graph_folder)\
    .setHiddenUnitsNumber(20)

Then, we use `NerApproach`and let it use the graph generated by the builder

In [None]:
ner_dl = NerDLApproach() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label") \
    .setOutputCol("ner") \
    .setMaxEpochs(5) \
    .setLr(0.003) \
    .setBatchSize(8) \
    .setRandomSeed(0) \
    .setVerbose(1) \
    .setEvaluationLogExtended(False) \
    .setEnableOutputLogs(False) \
    .setIncludeConfidence(True) \
    .setTestDataset(test_data_parquet_path) \
    .setGraphFolder(graph_folder)

Put pipeline together

In [None]:
ner_pipeline = sparknlp.base.Pipeline().setStages([
    embeddings,
    graph_builder,
    ner_dl
])

Fit data

In [None]:
ner_pipeline.fit(train_data)