# Description
## This notebok provides set of commands to install Spark NLP for offline usage. It contains 4 sections:

0) Initial setup

1) Download all dependencies for Spark NLP

2) Download all dependencies for Spark NLP (enterprise/licensed)

3) Download all dependencies for Spark NLP OCR

4) Download all models/embeddings for offline usage

5) Example of NER

`p.s: This notebook runned succesfully in:
                Distributor ID: Ubuntu
                Description:    Ubuntu 22.04.3 LTS
                Release:        22.04`


## 0) Initial setup

In [1]:
# load and point the licence_keys as 'spark_jsl.json' file

import json, os

with open('spark_jsl.json') as f:
    license_keys = json.load(f)


locals().update(license_keys)
os.environ.update(license_keys)
license_keys.keys()



dict_keys(['SPARK_NLP_LICENSE', 'SECRET', 'JSL_VERSION', 'SPARK_OCR_LICENSE', 'SPARK_OCR_SECRET', 'OCR_VERSION', 'PUBLIC_VERSION', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY'])

In [3]:
! sudo apt-get update -qq

In [5]:
! sudo apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

Scanning processes...                                                           
Scanning candidates...                                                          
Scanning linux images...                                                        

Running kernel seems to be up-to-date.

Restarting services...


In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
!java -version

openjdk version "1.8.0_392"
OpenJDK Runtime Environment (build 1.8.0_392-8u392-ga-1~22.04-b08)
OpenJDK 64-Bit Server VM (build 25.392-b08, mixed mode)


In [3]:
print(os.environ["JAVA_HOME"])

/usr/lib/jvm/java-8-openjdk-amd64


In [4]:
!pip install --upgrade -q pyspark==3.4.1

In [4]:
!pip list | grep spark

pyspark                3.4.1


In [None]:
! curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
! unzip awscliv2.zip  # sudo apt install unzip if not installed
! sudo ./aws/install

## 1) Download all dependencies for Spark NLP

In [8]:
# spark-nlp jar
!sudo wget  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-$PUBLIC_VERSION.jar -P /usr/lib/spark/jars/

--2023-12-30 18:14:16--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.0.jar
Resolving s3.amazonaws.com (s3.amazonaws.com)... 16.182.38.160, 52.216.9.101, 52.216.93.221, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|16.182.38.160|:443... connected.
HTTP request sent, awaiting response... 

200 OK
Length: 722128385 (689M) [application/java-archive]
Saving to: ‘/usr/lib/spark/jars/spark-nlp-assembly-5.2.0.jar’


2023-12-30 18:14:36 (35.5 MB/s) - ‘/usr/lib/spark/jars/spark-nlp-assembly-5.2.0.jar’ saved [722128385/722128385]



In [11]:
# spark-nlp wheel
! pip install spark-nlp==$PUBLIC_VERSION

Defaulting to user installation because normal site-packages is not writeable
Collecting spark-nlp==5.2.0
  Downloading spark_nlp-5.2.0-py2.py3-none-any.whl (548 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m548.5/548.5 KB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-5.2.0


In [12]:
!pip list | grep spark

pyspark                3.4.1
spark-nlp              5.2.0


## 2) Download all dependencies for Spark NLP (enterprise/licensed)

In [13]:
# spark nlp JSL JAR
!sudo wget https://s3.eu-west-1.amazonaws.com/pypi.johnsnowlabs.com/$SECRET/spark-nlp-jsl-$JSL_VERSION.jar -P /usr/lib/spark/jars/

--2023-12-30 18:38:58--  https://s3.eu-west-1.amazonaws.com/pypi.johnsnowlabs.com/5.2.0-b05bf6cf9656ef4a0fe918a037481a8ef7245fcb/spark-nlp-jsl-5.2.0.jar
Resolving s3.eu-west-1.amazonaws.com (s3.eu-west-1.amazonaws.com)... 52.92.19.224, 52.92.4.16, 52.218.56.11, ...
Connecting to s3.eu-west-1.amazonaws.com (s3.eu-west-1.amazonaws.com)|52.92.19.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40061499 (38M) [application/octet-stream]
Saving to: ‘/usr/lib/spark/jars/spark-nlp-jsl-5.2.0.jar’


2023-12-30 18:39:00 (19.2 MB/s) - ‘/usr/lib/spark/jars/spark-nlp-jsl-5.2.0.jar’ saved [40061499/40061499]



In [15]:
# park nlp JSL wheel
! sudo wget https://s3.eu-west-1.amazonaws.com/pypi.johnsnowlabs.com/$SECRET/spark-nlp-jsl/spark_nlp_jsl-$JSL_VERSION-py3-none-any.whl -P /usr/lib/spark/jars/

--2023-12-30 18:41:09--  https://s3.eu-west-1.amazonaws.com/pypi.johnsnowlabs.com/5.2.0-b05bf6cf9656ef4a0fe918a037481a8ef7245fcb/spark-nlp-jsl/spark_nlp_jsl-5.2.0-py3-none-any.whl
Resolving s3.eu-west-1.amazonaws.com (s3.eu-west-1.amazonaws.com)... 52.218.90.59, 52.218.36.218, 52.92.17.144, ...
Connecting to s3.eu-west-1.amazonaws.com (s3.eu-west-1.amazonaws.com)|52.218.90.59|:443... 

connected.
HTTP request sent, awaiting response... 200 OK
Length: 473226 (462K) [application/x-gzip]
Saving to: ‘/usr/lib/spark/jars/spark_nlp_jsl-5.2.0-py3-none-any.whl’


2023-12-30 18:41:10 (1.66 MB/s) - ‘/usr/lib/spark/jars/spark_nlp_jsl-5.2.0-py3-none-any.whl’ saved [473226/473226]



In [16]:
!pip install -q /usr/lib/spark/jars/spark_nlp_jsl-$JSL_VERSION-py3-none-any.whl

In [17]:
!pip list | grep spark

pyspark                3.4.1
spark-nlp              5.2.0
spark-nlp-jsl          5.2.0


## 3) Download all dependencies for Spark NLP OCR

In [18]:
# spark ocr JAR
!sudo wget -q https://s3.eu-west-1.amazonaws.com/pypi.johnsnowlabs.com/$SPARK_OCR_SECRET/jars/spark-ocr-assembly-$OCR_VERSION.jar -P /usr/lib/spark/jars/

#spark ocr wheel
!sudo wget -q https://s3.eu-west-1.amazonaws.com/pypi.johnsnowlabs.com/$SPARK_OCR_SECRET/spark-ocr/spark_ocr-$OCR_VERSION-py3-none-any.whl -P /usr/lib/spark/jars/

In [19]:
!ls -l /usr/lib/spark/jars/

total 1231488
-rw-r--r-- 1 root root 722128385 Dec  8 21:21 spark-nlp-assembly-5.2.0.jar
-rw-r--r-- 1 root root  40061499 Dec 23 21:13 spark-nlp-jsl-5.2.0.jar
-rw-r--r-- 1 root root 457626988 Nov 17 16:50 spark-ocr-assembly-5.1.0.jar
-rw-r--r-- 1 root root    473226 Dec 23 21:13 spark_nlp_jsl-5.2.0-py3-none-any.whl
-rw-r--r-- 1 root root  40736607 Nov 17 16:50 spark_ocr-5.1.0-py3-none-any.whl


In [20]:
!pip install -q /usr/lib/spark/jars/spark_ocr-$OCR_VERSION-py3-none-any.whl

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spark-nlp-jsl 5.2.0 requires spark-nlp==5.2.0, but you have spark-nlp 5.1.2 which is incompatible.[0m[31m
[0m

In [None]:
# to fix version incompatibility run below again

# pip install spark-nlp==$PUBLIC_VERSION

In [12]:
#sanity check
!pip list | grep spark

pyspark                3.4.0
spark-nlp              5.2.0
spark-nlp-jsl          5.2.0
spark-ocr              5.1.0


## Installation completed. Let's download models using AWS keys

## 4) Download all models/embeddings for offline usage

In [22]:
!mkdir models

In [None]:
# For example purposes let's download only subset for NER and glove
!sudo aws s3 cp --region us-east-2 s3://auxdata.johnsnowlabs.com/public/models/ public_models/ --recursive --exclude "*" --include "ner_dl*"

In [None]:
!sudo aws s3 cp --region us-east-2 s3://auxdata.johnsnowlabs.com/public/models/ public_models/ --recursive --exclude "*" --include "glove*"

download: s3://auxdata.johnsnowlabs.com/public/models/glove_6B_100_xx_2.4.0_2.4_1579690037117.zip to public_models/glove_6B_100_xx_2.4.0_2.4_1579690037117.zip
download: s3://auxdata.johnsnowlabs.com/public/models/glove_100d_en_2.0.0_2.4_1553028251278.zip to public_models/glove_100d_en_2.0.0_2.4_1553028251278.zip
download: s3://auxdata.johnsnowlabs.com/public/models/glove_100d_en_2.0.2_2.4_1556534397055.zip to public_models/glove_100d_en_2.0.2_2.4_1556534397055.zip
download: s3://auxdata.johnsnowlabs.com/public/models/glove_100d_en_2.4.0_2.4_1579690104032.zip to public_models/glove_100d_en_2.4.0_2.4_1579690104032.zip
download: s3://auxdata.johnsnowlabs.com/public/models/glove_6B_300_xx_2.4.0_2.4_1579698630432.zip to public_models/glove_6B_300_xx_2.4.0_2.4_1579698630432.zip
download: s3://auxdata.johnsnowlabs.com/public/models/glove_6B_300_xx_2.0.2_2.4_1559059806004.zip to public_models/glove_6B_300_xx_2.0.2_2.4_1559059806004.zip
download: s3://auxdata.johnsnowlabs.com/public/models/glov

In [8]:
!sudo wget  -P models/ https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_dl_en_2.4.3_2.4_1584624950746.zip 
!sudo wget  -P models/ https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/glove_100d_en_2.4.0_2.4_1579690104032.zip

--2023-12-30 19:42:15--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/glove_100d_en_2.4.0_2.4_1579690104032.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.168.0, 54.231.226.88, 52.217.229.184, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.168.0|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 152394105 (145M) [application/zip]
Saving to: ‘models/glove_100d_en_2.4.0_2.4_1579690104032.zip’


2023-12-30 19:42:16 (104 MB/s) - ‘models/glove_100d_en_2.4.0_2.4_1579690104032.zip’ saved [152394105/152394105]



In [None]:
# !sudo aws s3 cp --region us-east-2 s3://auxdata.johnsnowlabs.com/clinical/models/ clinical_models/ --recursive --exclude "*" --include "embeddings_clinical*"

## 5) Example on NER

In [9]:
!unzip -q models/ner_dl_en_2.4.3_2.4_1584624950746.zip -d ner_dl_glove/

In [10]:
!unzip -q models/glove_100d_en_2.4.0_2.4_1579690104032.zip -d glove_embeddings/

In [11]:
ner_local_path = 'ner_dl_glove'
embeddings_local_path = 'glove_embeddings'

In [4]:
import json
import os
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

In [17]:
!ls -l /usr/lib/spark/jars/

total 1231488
-rw-r--r-- 1 root root 722128385 Dec  8 21:21 spark-nlp-assembly-5.2.0.jar
-rw-r--r-- 1 root root  40061499 Dec 23 21:13 spark-nlp-jsl-5.2.0.jar
-rw-r--r-- 1 root root 457626988 Nov 17 16:50 spark-ocr-assembly-5.1.0.jar
-rw-r--r-- 1 root root    473226 Dec 23 21:13 spark_nlp_jsl-5.2.0-py3-none-any.whl
-rw-r--r-- 1 root root  40736607 Nov 17 16:50 spark_ocr-5.1.0-py3-none-any.whl


In [5]:
def start():
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "10G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars", "/usr/lib/spark/jars/spark-nlp-assembly-5.2.0.jar,/usr/lib/spark/jars/spark-nlp-jsl-5.2.0.jar")
    return builder.getOrCreate()

spark = start()

23/12/30 19:40:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


# sample Pipeline

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

# ner_dl model is trained with glove_100d. So we use the same embeddings in the pipeline
glove_embeddings = WordEmbeddingsModel.load(embeddings_local_path).\
  setInputCols(["document", 'token']).\
  setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
public_ner = NerDLModel.load(ner_local_path) \
  .setInputCols(["document", "token", "embeddings"]) \
  .setOutputCol("ner")

nlpPipeline = Pipeline(stages=[
 documentAssembler, 
 tokenizer,
 glove_embeddings,
 public_ner
 ])

empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

In [13]:
df = spark.createDataFrame([['Peter Parker lives in New York.']]).toDF("text")

result = pipelineModel.transform(df)

result.select('token.result','ner.result').show(truncate=False)



+----------------------------------------+-------------------------------------+
|result                                  |result                               |
+----------------------------------------+-------------------------------------+
|[Peter, Parker, lives, in, New, York, .]|[B-PER, I-PER, O, O, B-LOC, I-LOC, O]|
+----------------------------------------+-------------------------------------+



                                                                                

In [14]:
light_model = LightPipeline(pipelineModel)

text = 'Peter Parker lives in New York.'

light_result = light_model.annotate(text)

list(zip(light_result['token'], light_result['ner']))

[('Peter', 'B-PER'),
 ('Parker', 'I-PER'),
 ('lives', 'O'),
 ('in', 'O'),
 ('New', 'B-LOC'),
 ('York', 'I-LOC'),
 ('.', 'O')]