![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/Import%20External%20SavedModel%20From%20Remote%20Storage.ipynb)

In [None]:
# This is only needed to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

# Import External SavedModel From Remote Storage

This feature is available for `Spark NLP 4.2.2` and above. So please make sure you have upgraded to the latest Spark NLP release!

This feature allows you to load external models (for example exported models from the transfomers library) from various remote locations. These include dbfs, hdfs and s3.

For this example we will load an ALBERT model from the transformers library. On how to prepare the model and to export it properly, see the tutorials for the respective transformer at the [following discussion](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669)!

## Loading Models from the Databricks File System (DBFS)
First, make sure you have Spark NLP installed on your cluster.

You can load models from a directory on DBFS by providing a path with the `dbfs:/` protocol.

In [None]:
import sparknlp
from sparknlp.annotator import *

spark = sparknlp.start()

albert = AlbertEmbeddings.loadSavedModel(
     'dbfs:/FileStore/tables/johnsnow/albert-base-v2/',
     spark
 )\
    .setInputCols(["sentence",'token'])\
    .setOutputCol("embeddings")\
    .setCaseSensitive(False)\
    .setDimension(768)\
    .setStorageRef('albert_base_uncased') 


If the file is on local file storage, it is asvisable to append the `file:/` protocol so that the correct path is resolved.

In [None]:
import sparknlp
from sparknlp.annotator import *

spark = sparknlp.start()

albert = AlbertEmbeddings.loadSavedModel(
     'file:/databricks/driver/johnsnow/albert-base-v2/',
     spark
 )\
    .setInputCols(["sentence",'token'])\
    .setOutputCol("embeddings")\
    .setCaseSensitive(False)\
    .setDimension(768)\
    .setStorageRef('albert_base_uncased') 


## Loading Models from the Hadoop File System (HDFS)
You can load models from a directory on HDFS by providing a path with the `hdfs:/` protocol. 

Here, the hdfs endpoint is reachable under `localhost:9000`.

In [None]:
import sparknlp
from sparknlp.annotator import *

spark = sparknlp.start()

albert = AlbertEmbeddings.loadSavedModel(
     'hdfs://localhost:9000/johnsnow/albert-base-v2/',
     spark
 )\
    .setInputCols(["sentence",'token'])\
    .setOutputCol("embeddings")\
    .setCaseSensitive(False)\
    .setDimension(768)\
    .setStorageRef('albert_base_uncased') 


## Loading Models from S3
You can load models from a directory on S3 by providing a path with the `s3:/` protocol. 

You will need to create a custom Spark session with the proper credentials and permissions to access a directory on the s3 bucket. To see an example on how to set up access with temporary credentials see [Load Model From S3 from the SparkNLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/examples/prediction/english/Load_Model_From_S3.ipynb).

In this example, the bucket that will be used is called `johnsnow` and its region is `us-east-1`.

### Anonymous Access
If the bucket is publicly accesible, then a Spark session with s3 support can be created like this to load the model from the bucket:

In [None]:
from pyspark.sql import SparkSession
from sparknlp.annotator import *

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('Spark NLP') \
    .config("spark.driver.memory", "8g") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "200M") \
    .config("spark.jsl.settings.aws.region", "us-east-1") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2") \
    .getOrCreate()


albert = AlbertEmbeddings.loadSavedModel(
     's3://johnsnow/models/albert-base-v2/',
     spark
 )\
    .setInputCols(["sentence",'token'])\
    .setOutputCol("embeddings")\
    .setCaseSensitive(False)\
    .setDimension(768)\
    .setStorageRef('albert_base_uncased') 


### Restricted Access
If the bucket needs credentials, then a Spark session with s3 support can be created like this to load the model from the bucket (taken from the workshop example).

Note that `MY_ACCESS_KEY`, `MY_SECRET_KEY`, `MY_SESSION_KEY` need to be set for this example to work.

In [None]:
print("Enter your AWS Access Key:")
MY_ACCESS_KEY = input()

In [None]:
print("Enter your AWS Secret Key:")
MY_SECRET_KEY = input()

In [None]:
print("Enter your AWS Session Key:")
MY_SESSION_KEY = input()

In [None]:
from pyspark.sql import SparkSession
from sparknlp.annotator import *


spark = SparkSession.builder \
    .appName("SparkNLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "8G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.hadoop.fs.s3a.access.key", MY_ACCESS_KEY) \
    .config("spark.hadoop.fs.s3a.secret.key", MY_SECRET_KEY) \
    .config("spark.hadoop.fs.s3a.session.token", MY_SESSION_KEY) \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.jsl.settings.aws.region", "us-east-1") \
    .getOrCreate()


albert = AlbertEmbeddings.loadSavedModel(
     's3://johnsnow/models/albert-base-v2/',
     spark
 )\
    .setInputCols(["sentence",'token'])\
    .setOutputCol("embeddings")\
    .setCaseSensitive(False)\
    .setDimension(768)\
    .setStorageRef('albert_base_uncased') 
