![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/prediction/english/Load_Model_From_S3.ipynb)

### Loading Pretrained Models from S3

In [1]:
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-09-08 14:43:43--  https://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-09-08 14:43:44--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Installing PySpark 3.2.1 and Spark NLP 4.1.0
setup Colab for PySpark 3.2.1 

## Defining S3 URI in cache_pretrained 

In this notebook, we are going to see the steps required to use an external S3 URI as `cache_pretrained` folder

In Spark NLP you can configure the location to download the pre-trained models. Before Spark NLP 4.2.0, we can define a local file system, or a distributed file system (DBFS). Starting at 4.2.0, you can also set an S3 URI. To do this, we need to configure the spark session with the required settings for Spark NLP and Spark ML.

### Spark NLP Settings

Spark NLP requires the following configuration:
1. `cache_folder`: Here you must define your S3 URI (using s3 or s3a prefix) that will store Spark NLP pre-trained models. This is defined in the config `spark.jsl.settings.pretrained.cache_folder`
2. S3 Region: We need the region to upload a file on your S3 bucket. This is defined in the config `spark.jsl.settings.aws.region`
3. Spark NLP JAR: Since some custom configurations are needed to use S3 URI in `cache_pretrained`. It is also required to include spark-nlp JAR either as a dependency for our application or during spark session creation. Since we are using a notebook, we will add these packages while creating a spark session in the following config:

- `spark.jars.packages` for Maven coordinates or `spark.jar` for FAT JAR
4. We recommend also adding the parameters described in creating manually a spark session in requirements section on [Spark NLP documentation](https://github.com/JohnSnowLabs/spark-nlp#requirements).

### Spark ML Settings

This configuration will depend on your S3 bucket and AWS configuration. In this notebook a connection through **Temporary Security Credentials** is showcased. **Please contact your administrator to choose the right setup, as well as, the required keys/tokens.**

Spark ML requires the following configuration to load a model from S3 using *Temporary Security Credentials*:

1. Authenticating with S3: This is needed to interact with external S3 buckets, and it will require an access key, a secret key, and a session token. Define the values in these configs:

- `spark.hadoop.fs.s3a.access.key`
- `spark.hadoop.fs.s3a.secret.key`
- `spark.hadoop.fs.s3a.session.token`
2. Credential Provider: You need to define the Hadoop provider that will handle this connection. Since in this notebook, *Temporary Security Credentials* is used we need to use the provider `TemporaryAWSCredentialsProvider` from `hadoop-aws` package, and set it up in the config below:

- `spark.hadoop.fs.s3a.aws.credentials.provider`
3. AWS packages: S3A depends upon two JARs, alongside `hadoop-common` and its dependencies, which are `hadoop-aws` and `aws-java-sdk` packages. So, you will need to either add these dependencies in your application or to your spark session. Since we are using a notebook, we will add these packages while creating the spark session in the following config:

- `spark.jars.packages`
4. AWS File System: Defining S3AFileSystem it's also required for interacting S3 with AWS SDK. Define the value in this config:

- `spark.hadoop.fs.s3a.impl`

Now, let's take a look at the spark session creation below to see how to define each of the configurations with its values for **Temporary Security Credentials**:

In [None]:
print("Enter your AWS Access Key:")
MY_ACCESS_KEY = input()

In [None]:
print("Enter your AWS Secret Key:")
MY_SECRET_KEY = input()

In [None]:
print("Enter your AWS Session Key:")
MY_SESSION_KEY = input()

In [4]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkNLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "12G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.hadoop.fs.s3a.access.key", MY_ACCESS_KEY) \
    .config("spark.hadoop.fs.s3a.secret.key", MY_SECRET_KEY) \
    .config("spark.hadoop.fs.s3a.session.token", MY_SESSION_KEY) \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk:1.11.901") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.jsl.settings.pretrained.cache_folder", "s3://my_bucket/my/models/") \
    .config("spark.jsl.settings.aws.region", "us-east-1") \
    .getOrCreate()

spark

### Disclaimer: 
- Interaction with S3 depends on Spark/Hadoop/AWS implementations, which is out of our scope. Keep in mind that the configuration requirements or formats could change in other releases. For addidional information and details, we recommend checking their up to date official documentation, like this one from [Hadoop-AWS Integration with AWS](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)
- It's important to stand out that `hadoop-aws` and `aws-java-sdk` package versions must be compatible. Otherwise, it won't work. The example of this notebook uses Hadoop 3.3.1. So, you must modify those versions based on your Hadoop version.

In [5]:
print(f"Hadoop version = {spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}")

Hadoop version = 3.3.1


In [6]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

In [7]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained() \
            .setInputCols(["document"]) \
            .setOutputCol("sentence")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


In [8]:
pipeline = Pipeline(stages=[
            document_assembler,
            sentence_detector
        ])

In [9]:
test_df = spark.createDataFrame([["This is a simple example. This is another sentence"]]).toDF("text")

In [10]:
model = pipeline.fit(test_df)

In [11]:
model.transform(test_df).show(truncate=False)

+--------------------------------------------------+--------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|text                                              |document                                                                                    |sentence                                                                                                                               |
+--------------------------------------------------+--------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|This is a simple example. This is another sentence|[{document, 0, 49, This is a simple example. This is another sentence, {sentence -> 0}, []}]|[{documen

In [12]:
from sparknlp.pretrained import PretrainedPipeline

pipeline_model = PretrainedPipeline('explain_document_ml', lang = 'en')

explain_document_ml download started this may take some time.
Approx size to download 9.2 MB
[OK!]


In [13]:
pipeline_model.transform(test_df).show(truncate=False)

+--------------------------------------------------+--------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------