![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/prediction/english/Load_Model_From_GCP_Storage.ipynb)

## Loading Pretrained Models from GCP

In [None]:
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

## Defining GCP Storage URI in cache_pretrained

In this notebook, we are going to see the steps required to use an external GCP Storage URI as cache_pretrained folder

In Spark NLP you can configure the location to download the pre-trained models. Starting at Spark NLP 4.2.4, you can set a GCP Storage URI. To do this, we need to configure the spark session with the required settings for Spark NLP and Spark ML.

### Spark NLP Settings



1. `cache_folder`: Here you must define your GCP storage URI (using gs prefix) that will store Spark NLP pre-trained models. This is defined in the config spark.jsl.settings.pretrained.cache_folder
2. `project_id`: We need to know the ProjectId of our GCP Storage. This is defined in `spark.jsl.settings.gcp`

To integrage with GCP, we need to setup Application Default Credentials (ADC) for GCP. You can check how to configure it in the official [GCP documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc)



### Spark ML Settings

Spark ML requires the following configuration to load a model from GCP using ADC:



1. GCP connector: You need to identify your hadoop versio and set the required dependency in `spark.jars.packages`
2. ADC credentials: After following the instructions to setup ADC, you will have a JSON file that holds your authenticiation information. This file is setup in `spark.hadoop.google.cloud.auth.service.account.json.keyfile`
3. Hadoop File System: You also need to setup the Hadoop implementation to work with GCP Storage as file system. This is define in `spark.hadoop.fs.gs.impl`
3. Finally, to mitigate conflicts between Spark's dependencies and user dependencies. You must define `spark.driver.userClassPathFirst` as true. You may also need to define `spark.executor.userClassPathFirst` as true.



Now, let's take a look at a simple ecxample the spark session creation below to see how to define each of the configurations with its values:

In [None]:
mport pyspark
from pyspark.sql import SparkSession

#GCP Storage configuration
spark = SparkSession.builder \
    .appName("SparkNLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "12G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars", "./sparknlp.jar") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.4,com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.8") \
    .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
    .config("spark.driver.userClassPathFirst", "true") \
    .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "/content/.config/application_default_credentials.json") \
    .config("spark.jsl.settings.gcp.project_id", "my_project_id") \
    .config("spark.jsl.settings.pretrained.cache_folder", "gs://my-bucket/models") \
    .getOrCreate()

print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.2.1


Starting at spark-nlp 4.3.0, if you have control over spark session creation. You can also use sparknlp.start() with params argument:

In [None]:
params = {
    "spark.jars.packages": "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.8",
    "spark.hadoop.fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
    "spark.driver.userClassPathFirst", "true",
    "spark.hadoop.google.cloud.auth.service.account.json.keyfile", "/content/.config/application_default_credentials.json",
    "spark.jsl.settings.gcp.project_id", "my_project_id",
    "spark.jsl.settings.pretrained.cache_folder", "gs://my-bucket/models"
}

spark = sparknlp.start(params=params)

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

In [None]:
sample_text = "This is a sentence. This is another sentence"
data_df = spark.createDataFrame([[sample_text]]).toDF("text").cache()

empty_df = spark.createDataFrame([[""]]).toDF("text")

In [None]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")

In [None]:
sentence_detector_dl = SentenceDetectorDLModel() \
.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


In [None]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector_dl, tokenizer])
pipeline_model = pipeline.fit(empty_df)

In [None]:
result = pipeline_model.transform(data_df)
result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|
+--------------------+--------------------+--------------------+--------------------+
|This is a sentenc...|[{document, 0, 43...|[{document, 0, 18...|[{token, 0, 3, Th...|
+--------------------+--------------------+--------------------+--------------------+

