![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/english/dl-ner/ner_graphs_gcp.ipynb)

## Training NER with Graphs in GCP

In Spark NLP you can configure the location to store TF Graphs used while training NER models. Starting at Spark NLP 5.1.0, you can set a GCP Storage URI, or Azure Storage URI or DBFS paths like HDFS or Databricks FS.

In this notebook, we are going to see the steps required to use an external GCP Storage URI to store the logs of traning an NER model

To do this, we need to configure the spark session with the required settings for Spark NLP and Spark ML.

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

### Spark NLP Settings

`project_id`: We need to know the ProjectId of our GCP Storage. This is defined in `spark.jsl.settings.gcp.project_id`

To integrage with GCP, we need to setup Application Default Credentials (ADC) for GCP. You can check how to configure it in the official [GCP documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc)

In [None]:
! gcloud auth application-default login

In [None]:
! ls /content/.config/application_default_credentials.json

### Spark ML Settings

Spark ML requires the following configuration to load a model from GCP using ADC:

1. GCP connector: You need to identify your hadoop version and set the required dependency in `spark.jars.packages`
2. ADC credentials: After following the instructions to setup ADC, you will have a JSON file that holds your authenticiation information. This file is setup in `spark.hadoop.google.cloud.auth.service.account.json.keyfile`
3. Hadoop File System: You also need to setup the Hadoop implementation to work with GCP Storage as file system. This is define in `spark.hadoop.fs.gs.impl`
3. Finally, to mitigate conflicts between Spark's dependencies and user dependencies. You must define `spark.driver.userClassPathFirst` as true. You may also need to define `spark.executor.userClassPathFirst` as true.

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import CoNLL

print("Spark NLP version", sparknlp.version())

Spark NLP version 4.3.1


In [None]:
print("Enter your GCP ProjectId:")
PROJECT_ID = input()

In [None]:
json_keyfile = "/content/.config/application_default_credentials.json"

#GCP Storage configuration
gcp_params = {
    "spark.jars.packages": "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.8",
    "spark.hadoop.fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
    "spark.driver.userClassPathFirst": "true",
    "spark.hadoop.google.cloud.auth.service.account.json.keyfile": json_keyfile,
    "spark.jsl.settings.gcp.project_id": PROJECT_ID
}

spark = sparknlp.start(params=gcp_params)

print("Apache Spark version: {}".format(spark.version))

In [None]:
spark = sparknlp.start(params=s3_params)

In [None]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, 'sample_data/test_ner_dataset.txt')
training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
embeddings = WordEmbeddingsModel.pretrained("glove_100d")
ready_data = embeddings.transform(training_data).cache()

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
# External Graph folder on GCP
graphFolder = "gs://my-gcp-bucket/ner/graphs"

ner_tagger = NerDLApproach() \
                .setInputCols("sentence", "token", "embeddings") \
                .setLabelColumn("label") \
                .setOutputCol("ner") \
                .setMinEpochs(1) \
                .setMaxEpochs(30) \
                .setRandomSeed(0) \
                .setVerbose(0) \
                .setDropout(0.8) \
                .setBatchSize(18) \
                .setGraphFolder(graphFolder)

In [None]:
ner_tagger.fit(ready_data)

NerDLModel_18c6a5b33e9a