![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/english/dl-ner/ner_logs.ipynb)


# Exporting Logs in S3 with NER training

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import CoNLL

To use S3 to store training logs, we have two options:
- Defining S3 path information as well as AWS credentials while starting spark
- Defining S3 path information in runtime and AWS credentials while starting spark (Available since spark-nlp 4.1.0)

In [None]:
print("Enter your AWS Access Key:")
ACCESS_KEY = input()

In [None]:
print("Enter your AWS Secret Key:")
SECRET_KEY = input()

In [None]:
print("Enter your AWS Session Key:")
SESSION_KEY = input()

In [None]:
print("Enter your AWS Region:")
AWS_REGION

In [None]:
s3_params = {
    "spark.jsl.settings.aws.credentials.access_key_id": ACCESS_KEY,
    "spark.jsl.settings.aws.credentials.secret_access_key": SECRET_KEY,
    "spark.jsl.settings.aws.credentials.session_token": SESSION_KEY,
    "spark.jsl.settings.aws.region": AWS_REGION
}

spark = sparknlp.start(params=s3_params)

print("Apache Spark version: {}".format(spark.version))

Please check how to start a spark session with spark-nlp based on your environment [here]( https://github.com/JohnSnowLabs/spark-nlp#usage)

### Training NER DL

In [None]:
training_data = CoNLL().readDataset(spark, './test_ner_dataset.txt')
training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|John Smith works ...|[{document, 0, 35...|[{document, 0, 35...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
embeddings = WordEmbeddingsModel.pretrained("glove_100d")
ready_data = embeddings.transform(training_data).cache()

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


The example below defines an S3 in runtime:

In [None]:
ner_tagger = NerDLApproach() \
                .setInputCols("sentence", "token", "embeddings") \
                .setLabelColumn("label") \
                .setOutputCol("ner") \
                .setMaxEpochs(1) \
                .setMaxEpochs(5) \
                .setRandomSeed(0) \
                .setVerbose(2) \
                .setDropout(0.8) \
                .setBatchSize(18) \
                .setEnableOutputLogs(True) \
                .setOutputLogsPath("s3://my_bucket/my_path/ner_logs")

In [None]:
ner_tagger.fit(ready_data)

NerDLModel_4cc29d1aa9e3

Before spark-nlp 4.1.0, in addition to AWS credentials, we needed to define the configuration below for spark session:

In [None]:
spark.conf.set("spark.jsl.settings.aws.s3_bucket", "MY_S3_BUCKET")
spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my_path/ner_logs") #yes, without my_bucket

This configuration is still available in 4.1.0, but the path defined in `setOutputLogsPath` takes precedence.