## 📦 Air-Gapped Setup: Spark NLP for Healthcare

This notebook demonstrates how to **download and prepare all necessary libraries and dependencies** for running **Spark NLP for Healthcare** in **air-gapped environments** — where direct internet access is restricted or completely unavailable.

It covers downloading essential resources such as:
- ✅ Spark NLP and Spark NLP for Healthcare **JAR** files  
- ✅ Corresponding **Python wheel (.whl)** packages  

Once downloaded, these files should be **uploaded to an internal storage** location (e.g., an **S3 bucket within your private VPC**) so they can be securely used within your air-gapped cluster.

**Impotant Note**:
- This notebook is designed for **Google Colab**. If you are using a different environment, you may need to adjust the code accordingly.
- **numpy** and **pandas** libraries should be downloaded according to the Python version used in your air-gapped cluster. You can download the specific versions of these libraries from the [Python Package Index (PyPI)](https://pypi.org/) or use the `pip download` command to get the appropriate versions.


# Define Variables

In [1]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

Saving License-healthcare-600-airgap.json to License-healthcare-600-airgap.json


In [2]:
license_keys.keys()

dict_keys(['SPARK_NLP_LICENSE', 'SECRET', 'JSL_VERSION', 'PUBLIC_VERSION', 'SPARK_OCR_LICENSE', 'SPARK_OCR_SECRET', 'OCR_VERSION', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY'])

# Download Jars

In [3]:
!pip install -q awscli

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.6/13.6 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m570.5/570.5 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.8/84.8 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sphinx 8.2.3 requires docutils<0.22,>=0.20, but you have docutils 0.19 which is incompatible.[0m[31m
[0m

In [4]:

# Download Spark NLP assembly jar
!aws s3 cp --region us-east-2 s3://auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-{PUBLIC_VERSION}.jar ./jars/spark-nlp-assembly-{PUBLIC_VERSION}.jar

# Download Spark NLP for Healthcare jar
!aws s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/{SECRET}/spark-nlp-jsl-{JSL_VERSION}.jar ./jars/spark-nlp-jsl-{JSL_VERSION}.jar

download: s3://auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.0.jar to jars/spark-nlp-assembly-6.0.0.jar
download: s3://pypi.johnsnowlabs.com/6.0.0-86d3e799186208a54b77e542c5f27c0fa7bf2334/spark-nlp-jsl-6.0.0.jar to jars/spark-nlp-jsl-6.0.0.jar


# Download Python packages

In [5]:
# Download Spark NLP Python package
!pip download spark-nlp=={PUBLIC_VERSION} --dest ./python_libs

Collecting spark-nlp==6.0.0
  Downloading spark_nlp-6.0.0-py2.py3-none-any.whl.metadata (19 kB)
Downloading spark_nlp-6.0.0-py2.py3-none-any.whl (684 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m684.9/684.9 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hSaved ./python_libs/spark_nlp-6.0.0-py2.py3-none-any.whl
Successfully downloaded spark-nlp


In [6]:
# Download Spark NLP for Healthcare Python package
!pip download spark-nlp-jsl=={JSL_VERSION} --extra-index-url https://pypi.johnsnowlabs.com/{SECRET} --dest ./python_libs

Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/6.0.0-86d3e799186208a54b77e542c5f27c0fa7bf2334
Collecting spark-nlp-jsl==6.0.0
  Downloading https://pypi.johnsnowlabs.com/6.0.0-86d3e799186208a54b77e542c5f27c0fa7bf2334/spark-nlp-jsl/spark_nlp_jsl-6.0.0-py3-none-any.whl (557 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.4/557.4 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spark-nlp==6.0.0 (from spark-nlp-jsl==6.0.0)
  File was already downloaded /content/python_libs/spark_nlp-6.0.0-py2.py3-none-any.whl
Collecting cloudpickle (from spark-nlp-jsl==6.0.0)
  Downloading cloudpickle-3.1.1-py3-none-any.whl.metadata (7.1 kB)
Downloading cloudpickle-3.1.1-py3-none-any.whl (20 kB)
Saved ./python_libs/spark_nlp_jsl-6.0.0-py3-none-any.whl
Saved ./python_libs/cloudpickle-3.1.1-py3-none-any.whl
Successfully downloaded spark-nlp-jsl spark-nlp cloudpickle


In [7]:
# Download numpy
!pip download numpy --dest ./python_libs

Collecting numpy
  Downloading numpy-2.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[?25hSaved ./python_libs/numpy-2.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Successfully downloaded numpy


In [8]:
# Download pandas
!pip download pandas --dest ./python_libs

Collecting pandas
  Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m523.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.23.2 (from pandas)
  File was already downloaded /content/python_libs/numpy-2.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Collecting python-dateutil>=2.8.2 (from pandas)
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas)
  Downloading six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

Upload the ./jars and ./python_libs folders to your private storage to use them inside the air-gapped cluster.