![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/util/Load_Model_from_Azure_Storage.ipynb)

## Loading Pretrained Models from Azure

In [1]:
%env PYSPARK=3.4.0

env: PYSPARK=3.4.0


In [4]:
! pip install --upgrade -q pyspark==$PYSPARK findspark spark_nlp

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m537.5/537.5 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


## Defining Azure Storage URI in cache_pretrained

In Spark NLP you can configure the location to download the pre-trained models. Starting at Spark NLP 5.1.0, you can set a Azure Storage URI, GCP Storage URI or DBFS paths like HDFS or Databricks FS.

In this notebook, we are going to see the steps required to use an external Azure Storage URI as `cache_pretrained` folder. To do this, we need to configure the spark session with the required settings for Spark NLP and Spark ML.

### Spark NLP Settings

`cache_folder`: Here you must define your Azure URI that will store Spark NLP pre-trained models. This is defined in the config `spark.jsl.settings.pretrained.cache_folder`

### Spark ML Settings

Spark ML requires the following configuration to load a model from Azure:


1. Azure connector: You need to identify your hadoop version and set the required dependency in `spark.jars.packages`
2. Hadoop File System: You also need to setup the Hadoop file system to work with azure storage as file system. This is define in `spark.hadoop.fs.azure`

To integrage with Azure, we need to define STORAGE_ACCOUNT and AZURE_ACCOUNT_KEY variables:
1. STORAGE_ACCOUNT: This can be found in Microsoft Azure portal, in Resources look for the Type storage account and check the name that is your storage account.
2. AZURE_ACCOUNT_KEY:
Check View account access keys in this oficial [Azure documentation](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal)

## Loading Pretrained Models with `pretrained()`

You can define this two properties as variables to set those during spark session creation. In addition we also need to define the Azure container where the models are stored.

In [6]:
print("Enter your Storage Account:")
STORAGE_ACCOUNT = input()

In [7]:
print("Enter your Azure Account Key:")
AZURE_ACCOUNT_KEY = input()

In [8]:
print("Enter your Azure Container:")
CONTAINER = input()

Enter your Azure Container:
test


In [9]:
azure_hadoop_config = "spark.hadoop.fs.azure.account.key." + STORAGE_ACCOUNT + ".blob.core.windows.net"
cache_folder = "https://" + STORAGE_ACCOUNT + ".blob.core.windows.net/" + CONTAINER + "/models"

In [10]:
print(cache_folder)

https://sparknlp2641242170.blob.core.windows.net/test/models


In [11]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
import pyspark

In [12]:
hadoop_azure_pkg = "org.apache.hadoop:hadoop-azure:3.3.4"
azure_storage_pkg = "com.microsoft.azure:azure-storage:8.6.6"
azure_identity_pkg = "com.azure:azure-identity:1.9.1"
azure_storage_blob_pkg = "com.azure:azure-storage-blob:12.22.2"
azure_pkgs = hadoop_azure_pkg + "," + azure_storage_pkg + "," + azure_identity_pkg + "," + azure_storage_blob_pkg

#Azure Storage configuration
azure_params = {
    "spark.jars.packages": azure_pkgs,
    azure_hadoop_config: AZURE_ACCOUNT_KEY,
    "spark.jsl.settings.pretrained.cache_folder": cache_folder
}


spark = sparknlp.start(real_time_output = True, params=azure_params)
#spark = sparknlp.start(params=azure_params)

print("Apache Spark version: {}".format(spark.version))

:: loading settings :: url = jar:file:/usr/local/lib/python3.10/dist-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Apache Spark version: 3.4.0


In [13]:
print(f"Hadoop version = {spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}")

Hadoop version = 3.3.4


### Disclaimer:
- Interaction with Azure depends on Spark/Hadoop/Azure implementations, which is out of our scope. Keep in mind that the configuration requirements or formats could change in other releases.
- It's important to stand out that `hadoop-azure`, `azure-storage`, `azure_identity` and `azure-storage-blob` packages versions must be compatible. Otherwise, it won't work. The example of this notebook uses Spark 3.4.0 and Hadoop 3.3.4. So, you must modify those versions based on your Hadoop version.

In [14]:
sample_text = "This is a sentence. This is another sentence"
data_df = spark.createDataFrame([[sample_text]]).toDF("text").cache()

empty_df = spark.createDataFrame([[""]]).toDF("text")

In [15]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")

In [16]:
sentence_detector_dl = SentenceDetectorDLModel() \
.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ / ]sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ \ ]Download done! Loading the resource.
[OK!]


In [17]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector_dl, tokenizer])
pipeline_model = pipeline.fit(empty_df)

In [18]:
result = pipeline_model.transform(data_df)
result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|
+--------------------+--------------------+--------------------+--------------------+
|This is a sentenc...|[{document, 0, 43...|[{document, 0, 18...|[{token, 0, 3, Th...|
+--------------------+--------------------+--------------------+--------------------+



In [19]:
from sparknlp.pretrained import PretrainedPipeline

pipeline_model = PretrainedPipeline('explain_document_ml', lang = 'en')

explain_document_ml download started this may take some time.
Approx size to download 9 MB
[ / ]explain_document_ml download started this may take some time.
Approximate size to download 9 MB
[ — ]Download done! Loading the resource.
[OK!]


In [20]:
pipeline = PretrainedPipeline("albert_xlarge_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")

albert_xlarge_token_classifier_conll03_pipeline download started this may take some time.
Approx size to download 196.9 MB
[ | ]albert_xlarge_token_classifier_conll03_pipeline download started this may take some time.
Approximate size to download 196.9 MB
[ / ]Download done! Loading the resource.
[OK!]


{'ner_chunk': ['John', 'John Snow Labs'],
 'token': ['My',
  'name',
  'is',
  'John',
  'and',
  'I',
  'work',
  'at',
  'John',
  'Snow',
  'Labs',
  '.'],
 'sentence': ['My name is John and I work at John Snow Labs.']}

## Loading Pretrained Models with `load()`

Here we don't need to set `cache_folder`. So, you can ommit that configuration when starting a spark session

In [21]:
model_path = "wasbs://" + CONTAINER + "@" + STORAGE_ACCOUNT + ".blob.core.windows.net/models/sentence_detector_dl_en_2.7.0_2.4_1609611052663/"

my_sentence_detector_dl = SentenceDetectorDLModel() \
.load(model_path) \
.setInputCols(["document"]) \
.setOutputCol("sentence")

In [22]:
pipeline = Pipeline(stages=[document_assembler, my_sentence_detector_dl, tokenizer])
pipeline_model = pipeline.fit(empty_df)

In [23]:
result = pipeline_model.transform(data_df)
result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|
+--------------------+--------------------+--------------------+--------------------+
|This is a sentenc...|[{document, 0, 43...|[{document, 0, 18...|[{token, 0, 3, Th...|
+--------------------+--------------------+--------------------+--------------------+

