![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/util/Load_Model_from_GCP_Storage.ipynb)

## Loading Pretrained Models from GCP

In [None]:
# Only run this Cell when you are using Spark NLP on Google Colab
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

## Defining GCP Storage URI in cache_pretrained

In Spark NLP you can configure the location to download the pre-trained models. Starting at Spark NLP 4.2.4, you can set a GCP Storage URI. Now, since Spark NLP 5.1.0 you can also define Azure Storage URI or DBFS paths like HDFS or Databricks FS.

In this notebook, we are going to see the steps required to use an external GCP Storage URI as cache_pretrained folder. To do this, we need to configure the spark session with the required settings for Spark NLP and Spark ML.

### Spark NLP Settings



1. `cache_folder`: Here you must define your S3 URI (using s3 or s3a prefix) that will store Spark NLP pre-trained models. This is defined in the config spark.jsl.settings.pretrained.cache_folder
2. `project_id`: We need to know the ProjectId of our GCP Storage. This is defined in `spark.jsl.settings.gcp`

To integrage with GCP, we need to setup Application Default Credentials (ADC) for GCP. You can check how to configure it in the official [GCP documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc)



### Spark ML Settings

Spark ML requires the following configuration to load a model from GCP using ADC:



1. GCP connector: You need to identify your hadoop version and set the required dependency in `spark.jars.packages`
2. ADC credentials: After following the instructions to setup ADC, you will have a JSON file that holds your authenticiation information. This file is setup in `spark.hadoop.google.cloud.auth.service.account.json.keyfile`
3. Hadoop File System: You also need to setup the Hadoop implementation to work with GCP Storage as file system. This is define in `spark.hadoop.fs.gs.impl`
3. Finally, to mitigate conflicts between Spark's dependencies and user dependencies. You must define `spark.driver.userClassPathFirst` as true. You may also need to define `spark.executor.userClassPathFirst` as true.



Now, let's take a look at a simple example the spark session creation below to see how to define each of the configurations with its values:

In [None]:
print("Enter your GCP ProjectId:")
PROJECT_ID = input()

In [None]:
print("Enter cache_folder URI:")
# Example: gs://my-bucket/models
CACHE_FOLDER

In [None]:
import sparknlp
import pyspark

json_keyfile = "/content/.config/application_default_credentials.json"

#GCP Storage configuration
gcp_params = {
    "spark.jars.packages": "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.8",
    "spark.hadoop.fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
    "spark.driver.userClassPathFirst": "true",
    "spark.hadoop.google.cloud.auth.service.account.json.keyfile": json_keyfile,
    "spark.jsl.settings.gcp.project_id": PROJECT_ID,
    "spark.jsl.settings.pretrained.cache_folder": CACHE_FOLDER
}

spark = sparknlp.start(params=gcp_params)

print("Apache Spark version: {}".format(spark.version))

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

In [None]:
sample_text = "This is a sentence. This is another sentence"
data_df = spark.createDataFrame([[sample_text]]).toDF("text").cache()

empty_df = spark.createDataFrame([[""]]).toDF("text")

In [None]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")

In [None]:
sentence_detector_dl = SentenceDetectorDLModel() \
.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


In [None]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector_dl, tokenizer])
pipeline_model = pipeline.fit(empty_df)

In [None]:
result = pipeline_model.transform(data_df)
result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|
+--------------------+--------------------+--------------------+--------------------+
|This is a sentenc...|[{document, 0, 43...|[{document, 0, 18...|[{token, 0, 3, Th...|
+--------------------+--------------------+--------------------+--------------------+

