![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/english/dl-ner/mfa_ner_graphs_s3.ipynb)

# Training NER with Graphs in S3

In Spark NLP you can configure the location to store TF Graphs used while training NER models. Starting at Spark NLP 5.1.0, you can set a GCP Storage URI, or Azure Storage URI or DBFS paths like HDFS or Databricks FS.

In this notebook, we are going to see the steps required to use an external S3 URI to store the logs of traning an NER model

To do this, we need to configure the spark session with the required settings for Spark NLP and Spark ML.

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import CoNLL

print("Spark NLP version", sparknlp.version())

Spark NLP version 4.3.1


To configure MFA we just need to define the requires values in spark properties as show below. Look an example to get temporal credentials [here](https://github.com/JohnSnowLabs/spark-nlp/blob/master/scripts/aws_tmp_credentials.sh) 

In [None]:
print("Enter your AWS Access Key:")
MY_ACCESS_KEY = input()

In [None]:
print("Enter your AWS Secret Key:")
MY_SECRET_KEY = input()

In [None]:
print("Enter your AWS Session Key:")
MY_SESSION_KEY = input()

In [None]:
print("Enter your AWS Region:")
MY_AWS_REGION

In [None]:
#S3 Storage configuration
s3_params = {
    "spark.jsl.settings.aws.credentials.access_key_id": MY_ACCESS_KEY,
    "spark.jsl.settings.aws.credentials.secret_access_key": MY_SECRET_KEY,
    "spark.jsl.settings.aws.credentials.session_token": MY_SESSION_KEY,
    "spark.jsl.settings.aws.region": MY_AWS_REGION
}

In [None]:
spark = sparknlp.start(params=s3_params)

In [None]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, 'sample_data/test_ner_dataset.txt')
training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
embeddings = WordEmbeddingsModel.pretrained("glove_100d")
ready_data = embeddings.transform(training_data).cache()

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
# External Graph folder on S3
graphFolder = "s3://my.bucket.com/my/s3/path"

ner_tagger = NerDLApproach() \
                .setInputCols("sentence", "token", "embeddings") \
                .setLabelColumn("label") \
                .setOutputCol("ner") \
                .setMinEpochs(1) \
                .setMaxEpochs(30) \
                .setRandomSeed(0) \
                .setVerbose(0) \
                .setDropout(0.8) \
                .setBatchSize(18) \
                .setGraphFolder(graphFolder)

In [None]:
ner_tagger.fit(ready_data)

NerDLModel_18c6a5b33e9a