![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/platforms/emr-airgapped/Download_Libraries_For_Airgapped_Environment.ipynb)

## 📦 Air-Gapped Setup: Spark NLP for Healthcare

This notebook demonstrates how to **download and prepare all necessary libraries and dependencies** for running **Spark NLP for Healthcare** in **air-gapped environments** — where direct internet access is restricted or completely unavailable.

It covers downloading essential resources such as:
- ✅ Spark NLP and Spark NLP for Healthcare **JAR** files  
- ✅ Corresponding **Python wheel (.whl)** packages  

Once downloaded, these files should be **uploaded to an internal storage** location (e.g., an **S3 bucket within your private VPC**) so they can be securely used within your air-gapped cluster.

**Impotant Note**:
- This notebook is designed for **Google Colab**. If you are using a different environment, you may need to adjust the code accordingly.
- **numpy** and **pandas** libraries should be downloaded according to the Python version used in your air-gapped cluster. You can download the specific versions of these libraries from the [Python Package Index (PyPI)](https://pypi.org/) or use the `pip download` command to get the appropriate versions.


# Define Variables

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [2]:
license_keys.keys()

dict_keys(['SPARK_NLP_LICENSE', 'SECRET', 'JSL_VERSION', 'PUBLIC_VERSION', 'SPARK_OCR_LICENSE', 'SPARK_OCR_SECRET', 'OCR_VERSION', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY'])

# Download Jars

In [None]:
# Install awscli
!pip install -q awscli

In [4]:
# Download the appropriate Spark NLP assembly JAR depending on the environment.
# Only one of these JARs is needed:
# - If the machine has a GPU, download the GPU-optimized assembly JAR.
# - If the machine does not have a GPU, download the standard CPU assembly JAR.
#
# Both download commands are shown below for reference, but **you only need to run one of them**
# depending on your environment. Do not use both at the same time.

# Download Spark NLP assembly jar
!aws s3 cp --region us-east-2 s3://auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-{PUBLIC_VERSION}.jar ./jars/spark-nlp-assembly-{PUBLIC_VERSION}.jar

# Download Spark NLP GPU jar
!aws s3 cp --region us-east-2 s3://auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-{PUBLIC_VERSION}.jar ./jars/spark-nlp-gpu-assembly-{PUBLIC_VERSION}.jar


download: s3://auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.3.jar to jars/spark-nlp-assembly-6.0.3.jar
download: s3://auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.3.jar to jars/spark-nlp-gpu-assembly-6.0.3.jar


In [None]:
# Download Spark NLP for Healthcare jar
!aws s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/{SECRET}/spark-nlp-jsl-{JSL_VERSION}.jar ./jars/spark-nlp-jsl-{JSL_VERSION}.jar

# Download Python packages

In [6]:
# Download Spark NLP Python package
!pip download spark-nlp=={PUBLIC_VERSION} --dest ./python_libs

Collecting spark-nlp==6.0.3
  Downloading spark_nlp-6.0.3-py2.py3-none-any.whl.metadata (19 kB)
Downloading spark_nlp-6.0.3-py2.py3-none-any.whl (713 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/713.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.6/713.0 kB[0m [31m5.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m706.6/713.0 kB[0m [31m12.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m713.0/713.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hSaved ./python_libs/spark_nlp-6.0.3-py2.py3-none-any.whl
Successfully downloaded spark-nlp


In [None]:
# Download Spark NLP for Healthcare Python package
!pip download spark-nlp-jsl=={JSL_VERSION} --extra-index-url https://pypi.johnsnowlabs.com/{SECRET} --dest ./python_libs

In [8]:
# Specify which Python version will you use in the EMR cluster, Default is Python3.9 on the EMR clusters.
PYTHON_VERSION = "3.9" # 3.11

In [9]:
# Download numpy
!python -m pip download numpy --only-binary=:all: --dest ./python_libs --python-version $PYTHON_VERSION --platform manylinux2014_x86_64 --implementation cp

Collecting numpy
  Downloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hSaved ./python_libs/numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Successfully downloaded numpy


In [10]:
# Download pandas
# Note: Installing `pandas` is optional when using Spark NLP.
# You can safely skip it unless you specifically need it for other purposes or some util functions
# or DataFrame conversions (e.g., using `.toPandas()`).
!python -m pip download pandas --only-binary=:all: --dest ./python_libs --python-version $PYTHON_VERSION --platform manylinux2014_x86_64 --implementation cp

Collecting pandas
  Downloading pandas-2.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/91.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.23.2 (from pandas)
  File was already downloaded /content/python_libs/numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Collecting python-dateutil>=2.8.2 (from pandas)
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas)
  Downloading six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Downloading pandas-2.3

Upload the ./jars and ./python_libs folders to your private storage to use them inside the air-gapped cluster.