![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/util/Load_Model_from_Azure_Storage.ipynb)

## Loading Pretrained Models from Azure

In [None]:
# Only run this Cell when you are using Spark NLP on Google Colab
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

## Defining Azure Storage URI in cache_pretrained

In Spark NLP you can configure the location to download the pre-trained models. Starting at Spark NLP 5.1.0, you can set a Azure Storage URI, GCP Storage URI or DBFS paths like HDFS or Databricks FS.

In this notebook, we are going to see the steps required to use an external Azure Storage URI as `cache_pretrained` folder. To do this, we need to configure the spark session with the required settings for Spark NLP and Spark ML. 

### Spark NLP Settings

`cache_folder`: Here you must define your Azure URI that will store Spark NLP pre-trained models. This is defined in the config `spark.jsl.settings.pretrained.cache_folder`

### Spark ML Settings

Spark ML requires the following configuration to load a model from Azure:


1. Azure connector: You need to identify your hadoop version and set the required dependency in `spark.jars.packages`
2. Hadoop File System: You also need to setup the Hadoop file system to work with azure storage as file system. This is define in `spark.hadoop.fs.azure`

To integrage with Azure, we need to define STORAGE_ACCOUNT and AZURE_ACCOUNT_KEY variables:
1. STORAGE_ACCOUNT: This can be found in Microsoft Azure portal, in Resources look for the Type storage account and check the name that is your storage account.
2. AZURE_ACCOUNT_KEY: 
Check View account access keys in this oficial [Azure documentation](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal)

Then you can define this two properties as variables to set those during spark session creation:

In [None]:
print("Enter your Storage Account:")
STORAGE_ACCOUNT = input()

In [None]:
print("Enter your Azure Account Key:")
AZURE_ACCOUNT_KEY = input()

In [3]:
azure_hadoop_config = "spark.hadoop.fs.azure.account.key." + STORAGE_ACCOUNT + ".blob.core.windows.net"
cache_folder = "https://" + STORAGE_ACCOUNT + ".blob.core.windows.net/test/models"

In [4]:
print(azure_hadoop_config)
print(cache_folder)

spark.hadoop.fs.azure.account.key.MY_STORAGE_ACCOUNT.blob.core.windows.net
https://MY_STORAGE_ACCOUNT.blob.core.windows.net/test/models


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
import pyspark

In [None]:
hadoop_azure_pkg = "org.apache.hadoop:hadoop-azure:3.3.4"
azure_storage_pkg = "com.microsoft.azure:azure-storage:8.6.6"
azure_identity_pkg = "com.azure:azure-identity:1.9.1"
azure_storage_blob_pkg = "com.azure:azure-storage-blob:12.22.2"
azure_pkgs = hadoop_azure_pkg + "," + azure_storage_pkg + "," + azure_identity_pkg + "," + azure_storage_blob_pkg

#Azure Storage configuration
azure_params = {
    "spark.jars.packages": azure_pkgs,
    "spark.jsl.settings.pretrained.cache_folder": cache_folder
}

spark = sparknlp.start(params=azure_params)

print("Apache Spark version: {}".format(spark.version))

In [None]:
print(f"Hadoop version = {spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}")

### Disclaimer: 
- Interaction with Azure depends on Spark/Hadoop/Azure implementations, which is out of our scope. Keep in mind that the configuration requirements or formats could change in other releases. 
- It's important to stand out that `hadoop-azure`, `azure-storage`, `azure_identity` and `azure-storage-blob` packages versions must be compatible. Otherwise, it won't work. The example of this notebook uses Spark 3.4.0 and Hadoop 3.3.4. So, you must modify those versions based on your Hadoop version.

In [None]:
sample_text = "This is a sentence. This is another sentence"
data_df = spark.createDataFrame([[sample_text]]).toDF("text").cache()

empty_df = spark.createDataFrame([[""]]).toDF("text")

In [None]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")

In [None]:
import time
start_time = time.time()

sentence_detector_dl = SentenceDetectorDLModel() \
.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: ", elapsed_time)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
Elapsed time:  24.047518253326416


In [None]:
start_time = time.time()

ner_model = NerDLModel.pretrained()

end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: ", elapsed_time)

ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]
Elapsed time:  11.117968082427979


In [None]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector_dl, tokenizer])
pipeline_model = pipeline.fit(empty_df)

In [None]:
result = pipeline_model.transform(data_df)
result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|
+--------------------+--------------------+--------------------+--------------------+
|This is a sentenc...|[{document, 0, 43...|[{document, 0, 18...|[{token, 0, 3, Th...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
from sparknlp.pretrained import PretrainedPipeline

start_time = time.time()
pipeline_model = PretrainedPipeline('explain_document_ml', lang = 'en')
end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: ", elapsed_time)

explain_document_ml download started this may take some time.
Approx size to download 9 MB
[OK!]
Elapsed time:  31.03494954109192


In [None]:
pipeline = PretrainedPipeline("albert_xlarge_token_classifier_conll03_pipeline", lang = "en")
pipeline.annotate("My name is John and I work at John Snow Labs.")

albert_xlarge_token_classifier_conll03_pipeline download started this may take some time.
Approx size to download 196.9 MB
[OK!]


{'ner_chunk': ['John', 'John Snow Labs'],
 'token': ['My',
  'name',
  'is',
  'John',
  'and',
  'I',
  'work',
  'at',
  'John',
  'Snow',
  'Labs',
  '.'],
 'sentence': ['My name is John and I work at John Snow Labs.']}