![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/34.0.Model_Download_Helpers.ipynb)


## Colab Setup

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [4]:
from johnsnowlabs import nlp, medical

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/5.0.0.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.0, 💊Spark-Healthcare==5.0.0, running on ⚡ PySpark==3.1.2


In [5]:
import glob
import string
import numpy as np

# ResourceDownloader

This notebook will cover the different parameters and usages of `ResourceDownloader` annotator.

**📖 Learning Objectives:**

1. Understand how to use `ResourceDownloader`.

2. Become comfortable using the different parameters of the annotator.




**🔗 Helpful Links:**


- Python Docs : [ResourceDownloader](https://sparknlp.org/api/python/reference/autosummary/sparknlp/pretrained/resource_downloader/index.html#sparknlp.pretrained.resource_downloader.ResourceDownloader)

- Scala Docs : [ResourceDownloader](https://sparknlp.org/api/com/johnsnowlabs/nlp/pretrained/ResourceDownloader.html)

- For extended examples of usage, see the [Spark NLP Workshop repository]().

In [6]:
from sparknlp.pretrained import ResourceDownloader

## showPublicModels

In [7]:
ResourceDownloader.showPublicModels(lang="en", version="2.4.0")

+------------------------+------+---------+
| Model                  | lang | version |
+------------------------+------+---------+
| token_rules            |  en  | 2.1.0   |
| onto_100               |  en  | 2.1.0   |
| onto_300               |  en  | 2.1.0   |
| bert_base_cased        |  en  | 2.2.0   |
| bert_uncased           |  en  | 2.2.0   |
| bert_base_uncased      |  en  | 2.2.0   |
| bert_large_cased       |  en  | 2.2.0   |
| bert_large_uncased     |  en  | 2.2.0   |
| ner_dl_bert            |  en  | 2.2.0   |
| pos_ud_ewt             |  en  | 2.2.2   |
| glove_100d             |  en  | 2.4.0   |
| onto_100               |  en  | 2.4.0   |
| onto_300               |  en  | 2.4.0   |
| ner_dl                 |  en  | 2.4.0   |
| ner_dl_sentence        |  en  | 2.4.0   |
| elmo                   |  en  | 2.4.0   |
| bert_base_cased        |  en  | 2.4.0   |
| bert_base_uncased      |  en  | 2.4.0   |
| bert_large_cased       |  en  | 2.4.0   |
| bert_large_uncased     |  en  

## downloadModel

    Downloads and loads a model with the default downloader. Usually this method does not need to be called directly, as it is called by the `pretrained()`
    method of the annotator. Return Loaded pretrained annotator/pipeline


    Parameters:
      reader    : Name of the class to read the model for
      name      : Name of the pretrained model
      language  : Language of the model
      remote_loc: Directory of the Spark NLP Folder, by default None


[**glove_100d**](https://sparknlp.org/2020/01/22/glove_100d.html)



In [8]:
# You can download the model directly.
# parameters are class name, model name, lang, and remote location
ResourceDownloader.downloadModel(nlp.WordEmbeddingsModel, "glove_100d", "en",remote_loc="public/models")

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


WORD_EMBEDDINGS_MODEL_48cffc8b9a76

In [9]:
# You can directly download the model and assign it to the corresponding variable

embeddings = ResourceDownloader.downloadModel(nlp.WordEmbeddingsModel, "glove_100d", "en",remote_loc="public/models")

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [10]:
# check the cache folder path

path=glob.glob("/root/cache_pretrained/glove_100d*")
path

['/root/cache_pretrained/glove_100d_en_2.4.0_2.4_1579690104032']

In [11]:
# you can load the downloaded model with the corresponding annotator

embeddings = nlp.WordEmbeddingsModel.load("/root/cache_pretrained/glove_100d_en_2.4.0_2.4_1579690104032") \
  .setInputCols("sentence", "token") \
  .setOutputCol("embeddings")

[**nerdl_restaurant_100d**](https://sparknlp.org/2021/12/31/nerdl_restaurant_100d_en.html)


In [12]:
ResourceDownloader.downloadModel(nlp.NerDLModel, "nerdl_restaurant_100d", "en",remote_loc="public/models")

nerdl_restaurant_100d download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


NerDLModel_ffcdc63f298c

In [13]:
path=glob.glob("/root/cache_pretrained/nerdl_restaurant_100d*")
path

['/root/cache_pretrained/nerdl_restaurant_100d_en_3.3.4_3.0_1640949258750']

In [14]:
# you can use absolute path
ner_model = nlp.NerDLModel.load(path[0]) \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

**pipeline**

In [15]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

sentence_detector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(['sentence']) \
    .setOutputCol('token')


embeddings = nlp.WordEmbeddingsModel.load("/root/cache_pretrained/glove_100d_en_2.4.0_2.4_1579690104032") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")

nerdl = nlp.NerDLModel.load("/root/cache_pretrained/nerdl_restaurant_100d_en_3.3.4_3.0_1640949258750")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlp_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        nerdl,
        ner_converter
      ])

text = ["""Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day."""]

data = spark.createDataFrame([text]).toDF("text")

result = nlp_pipeline.fit(data).transform(data)

In [16]:
from sparknlp_display import NerVisualizer

for i in range(len(text)):
  NerVisualizer().display(
      result = result.collect()[i],
      label_col = 'ner_chunk',
      document_col = 'document'
  )

## showPublicPipelines

In [17]:
from sparknlp.pretrained import ResourceDownloader

ResourceDownloader.showPublicPipelines(lang="en", version="2.4.0")

+-------------------------------------+------+---------+
| Pipeline                            | lang | version |
+-------------------------------------+------+---------+
| dependency_parse                    |  en  | 2.0.2   |
| check_spelling                      |  en  | 2.1.0   |
| match_datetime                      |  en  | 2.1.0   |
| match_pattern                       |  en  | 2.1.0   |
| clean_pattern                       |  en  | 2.1.0   |
| clean_stop                          |  en  | 2.1.0   |
| match_phrases                       |  en  | 2.1.0   |
| movies_sentiment_analysis           |  en  | 2.1.0   |
| explain_document_ml                 |  en  | 2.1.0   |
| clean_slang                         |  en  | 2.1.0   |
| analyze_sentiment                   |  en  | 2.1.0   |
| explain_document_dl                 |  en  | 2.1.0   |
| explain_document_dl_fast            |  en  | 2.1.0   |
| recognize_entities_dl               |  en  | 2.1.0   |
| recognize_entities_bert      

## downloadPipeline

    Downloads and loads a pipeline with the default downloader.

    Parameters:
        name      :  Name of the pipeline
        language  :  Language of the pipeline
        remote_loc:  Directory of the remote Spark NLP Folder, by default None

In [18]:
explain_document_ml = ResourceDownloader.downloadPipeline("explain_document_ml", language ="en", remote_loc = "public/models"  )

explain_document_ml download started this may take some time.
Approx size to download 9 MB
[OK!]


In [19]:
pipeline = nlp.LightPipeline(explain_document_ml)

text = 'Peter Parker is a nice guy and lives in New York'

result = pipeline.annotate(text)

list(zip(result['token'], result['lemmas'], result['stems'], result['spell']))

[('Peter', 'Peter', 'peter', 'Peter'),
 ('Parker', 'Parker', 'parker', 'Parker'),
 ('is', 'be', 'i', 'is'),
 ('a', 'a', 'a', 'a'),
 ('nice', 'nice', 'nice', 'nice'),
 ('guy', 'guy', 'gui', 'guy'),
 ('and', 'and', 'and', 'and'),
 ('lives', 'life', 'live', 'lives'),
 ('in', 'in', 'in', 'in'),
 ('New', 'New', 'new', 'New'),
 ('York', 'York', 'york', 'York')]

## downloadModelDirectly

    Downloads a model directly to the cache folder.
    You can use to copy-paste the s3 URI from the model hub and download the model.
    For available s3 URI and models,

please see the [Models Hub](https://nlp.johnsnowlabs.com/models).


    Parameters:
        name       : Name of the model or s3 URI
        remote_loc : Directory of the remote Spark NLP Folder, by default "public/models"
        unzip      : Used to unzip model, by default 'True'


### with **S3 URI**

In [20]:
# You can download the model directly.
# parameters are model name, remote location, and unzip

# model link:  https://sparknlp.org/2022/06/01/nerdl_conll_elmo_en_3_0.html

s3_uri = "s3://auxdata.johnsnowlabs.com/public/models/nerdl_conll_elmo_en_4.0.0_3.0_1654103884644.zip"

ResourceDownloader.downloadModelDirectly(s3_uri,  "public/models", unzip=True)


### with **model name**

In [21]:
# You can download the model directly.
# parameters are model name, remote location, and unzip

model_name = "public/models/nerdl_restaurant_100d_en_3.3.4_3.0_1640949258750.zip"

ResourceDownloader.downloadModelDirectly(model_name,  "public/models", unzip=True)


## clearCache

    Clears the cache entry of a model

    parameters:
      name      : Name of the model
      language  : Language of the model
      remote_loc: Directory of the remote Spark NLP Folder, by default None



In [22]:
# ResourceDownloader.clearCache("nerdl_conll_elmo","en","public/models")

#Note: if S3 Amazon throws credential error ignore it, just check your cache folder, your model will be deleted

# InternalResourceDownloader


In [23]:
from sparknlp_jsl.pretrained import InternalResourceDownloader

### showPrivateModels

    show private models available for download

    Parameters:
        annotator : The annotator to filter by. Defaults to None.
        lang      : The language to filter by. Defaults to None.
        version   : The version to filter by. Defaults to None.



In [24]:
InternalResourceDownloader.showPrivateModels("MedicalNerModel","en", "4.0.0")

+----------------------------------------+------+---------+
| Model                                  | lang | version |
+----------------------------------------+------+---------+
| nerdl_tumour_demo                      |  en  | 1.7.3   |
| nerdl_tumour_demo                      |  en  | 1.8.0   |
| nerdl_tumour_demo                      |  en  | 2.0.2   |
| ner_healthcare                         |  en  | 2.4.4   |
| ner_radiology                          |  en  | 2.7.0   |
| ner_deid_augmented                     |  en  | 2.7.1   |
| ner_deidentify_dl                      |  en  | 2.7.2   |
| ner_events_admission_clinical          |  en  | 2.7.4   |
| ner_clinical                           |  en  | 3.0.0   |
| ner_radiology                          |  en  | 3.0.0   |
| ner_bionlp                             |  en  | 3.0.0   |
| ner_posology                           |  en  | 3.0.0   |
| ner_deid_augmented                     |  en  | 3.0.0   |
| ner_anatomy                           

### showPrivatePipelines

    show private models available for download

    Parameters:
        lang    : The language to filter by. Defaults to None.
        version : The version to filter by. Defaults to None.



In [25]:
InternalResourceDownloader.showPrivatePipelines("en", "4.0.0")

+--------------------------------------------------------+------+---------+
| Pipeline                                               | lang | version |
+--------------------------------------------------------+------+---------+
| clinical_analysis                                      |  en  | 2.4.0   |
| clinical_ner_assertion                                 |  en  | 2.4.0   |
| clinical_deidentification                              |  en  | 2.4.0   |
| explain_clinical_doc_ade                               |  en  | 2.7.3   |
| recognize_entities_posology                            |  en  | 3.0.0   |
| explain_clinical_doc_carp                              |  en  | 3.0.0   |
| explain_clinical_doc_ade                               |  en  | 3.0.0   |
| explain_clinical_doc_era                               |  en  | 3.0.0   |
| icd10cm_snomed_mapping                                 |  en  | 3.0.2   |
| snomed_icd10cm_mapping                                 |  en  | 3.0.2   |
| icd10cm_um

### downloadModel
    Download a model from S3

    parameters:
        Args:
            reader    : The reader class to use to load the model.
            name      : The name of the model to download.
            language  : The language of the model to download.
            remote_loc: The remote location of the model. Defaults to None.


In [26]:
ner_clinical = InternalResourceDownloader.downloadModel(medical.NerModel,"ner_clinical","en",remote_loc="clinical/models")

ner_clinical download started this may take some time.
[OK!]


In [27]:
embeddings = ResourceDownloader.downloadModel(nlp.WordEmbeddingsModel,"embeddings_clinical","en",remote_loc="clinical/models")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [28]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(['sentence']) \
    .setOutputCol('token')

embeddings = nlp.WordEmbeddingsModel.load("/root/cache_pretrained/embeddings_clinical_en_2.4.0_2.4_1580237286004") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")

# or just set Input and Output columns
# embeddings\
#     .setInputCols("sentence", "token") \
#     .setOutputCol("embeddings")

ner_clinical = medical.NerModel.load("/root/cache_pretrained/ner_clinical_en_3.0.0_3.0_1617208419368")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

# or just set Input and Output columns
# ner_clinical\
#     .setInputCols(["sentence", "token", "embeddings"])\
#     .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlp_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_clinical,
        ner_converter
      ])

text = ["""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes"""]

data = spark.createDataFrame([text]).toDF("text")

result = nlp_pipeline.fit(data).transform(data)

In [29]:
from sparknlp_display import NerVisualizer

for i in range(len(text)):
  NerVisualizer().display(
      result = result.collect()[i],
      label_col = 'ner_chunk',
      document_col = 'document'
  )

### Control where model downloaded

By setting the cache_folder_path, you can control where the downloaded resources are stored, enabling easy access and reuse of the downloaded models in subsequent operations or workflows

In [30]:
#The first argument is the path to the zip file and the second one is the folder.
InternalResourceDownloader.downloadModelDirectly("clinical/models/ner_clinical_large_en_2.5.0_2.4_1590021302624.zip",
                                                 "clinical/models",
                                                 unzip=False,
                                                 cache_folder_path="/content/models")

In [31]:
cd models

/content/models


In [32]:
ls

ner_clinical_large_en_2.5.0_2.4_1590021302624.zip


### returnPrivateModels

    Return private models available for download.

    parameter:
        annotator : The annotator to filter by. Defaults to None.
        lang      : The language to filter by. Defaults to None.
        version   : The version to filter by. Defaults to None.

In [33]:
ner_models = InternalResourceDownloader.returnPrivateModels("MedicalNerModel","en", "4.0.0")
ner_models

[['nerdl_tumour_demo', 'en', '1.7.3'],
 ['nerdl_tumour_demo', 'en', '1.8.0'],
 ['nerdl_tumour_demo', 'en', '2.0.2'],
 ['ner_healthcare', 'en', '2.4.4'],
 ['ner_radiology', 'en', '2.7.0'],
 ['ner_deid_augmented', 'en', '2.7.1'],
 ['ner_deidentify_dl', 'en', '2.7.2'],
 ['ner_events_admission_clinical', 'en', '2.7.4'],
 ['ner_clinical', 'en', '3.0.0'],
 ['ner_radiology', 'en', '3.0.0'],
 ['ner_bionlp', 'en', '3.0.0'],
 ['ner_posology', 'en', '3.0.0'],
 ['ner_deid_augmented', 'en', '3.0.0'],
 ['ner_anatomy', 'en', '3.0.0'],
 ['ner_risk_factors', 'en', '3.0.0'],
 ['ner_chemprot_clinical', 'en', '3.0.0'],
 ['ner_posology_small', 'en', '3.0.0'],
 ['ner_posology_greedy', 'en', '3.0.0'],
 ['ner_deid_enriched', 'en', '3.0.0'],
 ['ner_drugs_greedy', 'en', '3.0.0'],
 ['jsl_ner_wip_clinical', 'en', '3.0.0'],
 ['ner_posology_large', 'en', '3.0.0'],
 ['jsl_ner_wip_greedy_clinical', 'en', '3.0.0'],
 ['ner_clinical_large', 'en', '3.0.0'],
 ['ner_diseases', 'en', '3.0.0'],
 ['ner_aspect_based_sentiment'

# UpdateModels

## updateCacheModels

    Refreshes all pretrained models located in the cache pretrained folder.
    Checks the existing models in the cache pretrained folder and if there is are new
    version for each model. If there is a new version, it will be downloaded and
    overwrite the existing one.

    Parameters:
        cache_folder : Path where the models will be refreshed. i.e ("hdfs:..","file:...")

In [34]:
# clear cache folder
!rm -rf /root/cache_pretrained

In [35]:
from sparknlp.pretrained import ResourceDownloader

#The first argument is the path to the zip file and the second one is the folder.
ResourceDownloader.downloadModelDirectly("clinical/models/embeddings_clinical_en_2.0.2_2.4_1558454742956.zip", "clinical/models")
ResourceDownloader.downloadModelDirectly("clinical/models/ner_clinical_large_en_2.5.0_2.4_1590021302624.zip", "clinical/models")

In [36]:
ls ~/cache_pretrained

[0m[01;34membeddings_clinical_en_2.0.2_2.4_1558454742956[0m/
[01;34mner_clinical_large_en_2.5.0_2.4_1590021302624[0m/


In [37]:
from sparknlp_jsl.updateModels import UpdateModels
UpdateModels.updateCacheModels()

In [38]:
ls ~/cache_pretrained

[0m[01;34membeddings_clinical_en_2.0.2_2.4_1558454742956[0m/
[01;34membeddings_clinical_en_2.4.0_2.4_1580237286004[0m/
[01;34mner_clinical_large_en_2.5.0_2.4_1590021302624[0m/
[01;34mner_clinical_large_en_3.0.0_3.0_1617206114650[0m/


## updateModels

    downloads all the new pretrained models that have been released since the specified date interval.

    parameter:
       model_names: A list of names of the models to be downloaded.
       language: The language of the models, with a default value of "en".
       start_date: The starting date used to filter the models, in the format "yyyy-MM-dd".
       end_date: The ending date used to filter the models, in the format "yyyy-MM-dd".
       cache_folder: The path indicating where the models will be downloaded and stored.

In [39]:
UpdateModels.updateModels(start_date = "2021-01-01",
                          end_date = "2023-07-07",
                          model_names=["ner_clinical","ner_jsl"],
                          language="en",
                          remote_loc="clinical/models",
                          cache_folder="/content/jsl_models"
                          )

In [40]:
ls /content/jsl_models

[0m[01;34mner_clinical_en_3.0.0_3.0_1617208419368[0m/  [01;34mner_jsl_en_4.2.0_3.0_1666181370373[0m/
