![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Installing Spark NLP offline mode




Spark-nlp installation Doc : https://nlp.johnsnowlabs.com/docs/en/install#offline

Medium Airgapped https://medium.com/spark-nlp/installing-spark-nlp-and-spark-ocr-in-air-gapped-networks-offline-mode-f42a1ee6b7a8


installation platform : https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/platforms

sparknlp from pypi : https://pypi.org/project/spark-nlp/3.4.2/#files

CPUvsGPUbenchmark: https://nlp.johnsnowlabs.com/docs/en/CPUvsGPUbenchmark

## install pyspark v3.1.2

In [None]:
!pip -q install pyspark==3.1.2

## license key

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [3]:
license_keys.keys()

dict_keys(['SPARK_NLP_LICENSE', 'SECRET', 'JSL_VERSION', 'PUBLIC_VERSION', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', 'SPARK_OCR_LICENSE', 'SPARK_OCR_SECRET', 'OCR_VERSION'])

## download Spark NLP jars from S3

In [None]:
!pip -q install -q awscli

In [None]:
# public jar
!aws  s3 cp --region us-east-2 s3://auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-$PUBLIC_VERSION.jar /content/spark-nlp-$PUBLIC_VERSION.jar

# healthcare jar
!aws  s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/$SECRET/spark-nlp-jsl-$JSL_VERSION.jar /content/spark-nlp-jsl-$JSL_VERSION.jar

# healthcare  whl
!aws  s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/$SECRET/spark-nlp-jsl/spark_nlp_jsl-$JSL_VERSION-py3-none-any.whl /content/spark_nlp_jsl-$JSL_VERSION-py3-none-any.whl

In [None]:
# public whl from pypi  https://pypi.org/project/spark-nlp/#files 
# get the whl download link 

!wget https://files.pythonhosted.org/packages/d3/4a/68da710afc1dec749063313d9d63f22350521f0181cfedecec05ce4dc069/spark_nlp-3.4.4-py2.py3-none-any.whl

## install 

In [None]:
! pip install /content/spark_nlp-$PUBLIC_VERSION-py2.py3-none-any.whl
! pip install /content/spark_nlp_jsl-$JSL_VERSION-py3-none-any.whl

## session start

In [15]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from sparknlp.base import LightPipeline

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

Spark NLP Version : 3.4.4
Spark NLP_JSL Version : 3.5.2


In [11]:
def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars", f"/content/spark-nlp-jsl-{JSL_VERSION}.jar,/content/spark-nlp-{PUBLIC_VERSION}.jar" )

    return builder.getOrCreate()


In [12]:
#  SECRET is in your Licence key

spark = start(SECRET)

spark

## online mode pipeline
USING THE RESOURCE DOWNLOADER   `.pretrained()`

In [13]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
 
# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")\
    .setLabelCasing("upper") #decide if we want to return the tags in upper or lower case 

ner_converter = NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical_large download started this may take some time.
[OK!]


In [14]:
# fullAnnotate in LightPipeline

text = '''
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , and associated with an acute hepatitis , presented with a one-week history of polyuria , poor appetite , and vomiting . 
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . 
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl ,  creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , and venous pH 7.27 . 
'''

print (text)

light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)


chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    
import pandas as pd

df_clinical = pd.DataFrame({'chunks':chunks, 
                            'begin': begin, 
                            'end':end, 
                            'sentence_id':sentence, 
                            'entities':entities})

df_clinical.head(20)


A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , and associated with an acute hepatitis , presented with a one-week history of polyuria , poor appetite , and vomiting . 
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . 
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl ,  creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , and venous pH 7.27 . 



Unnamed: 0,chunks,begin,end,sentence_id,entities
0,gestational diabetes mellitus,40,68,0,PROBLEM
1,subsequent type two diabetes mellitus,118,154,0,PROBLEM
2,T2DM,158,161,0,PROBLEM
3,HTG-induced pancreatitis,187,210,0,PROBLEM
4,an acute hepatitis,268,285,0,PROBLEM
5,polyuria,326,333,0,PROBLEM
6,poor appetite,337,349,0,PROBLEM
7,vomiting,357,364,0,PROBLEM
8,metformin,380,388,1,TREATMENT
9,glipizide,392,400,1,TREATMENT


## offline mode pipeline

MANUALLY DOWNLOADING  `.load()`

### using boto3 for download 

In [None]:
! pip install -q boto3

In [None]:
license_keys.keys()

dict_keys(['SPARK_NLP_LICENSE', 'SECRET', 'JSL_VERSION', 'PUBLIC_VERSION', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', 'SPARK_OCR_LICENSE', 'SPARK_OCR_SECRET', 'OCR_VERSION'])

In [None]:
import shutil
import boto3

# Add your credentials 
ACCESS_KEY = AWS_ACCESS_KEY_ID
SECRET_KEY = AWS_SECRET_ACCESS_KEY

# Connect
s3 = boto3.resource('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
buck_auxdata = s3.Bucket('auxdata.johnsnowlabs.com')

In [None]:
!mkdir /content/zip_files /content/models

**Download the embedding model**

In [None]:
# Download the embedding model 
buck_auxdata.download_file('clinical/models/embeddings_clinical_en_2.4.0_2.4_1580237286004.zip',
'zip_files/embeddings_clinical_en_2.4.0_2.4_1580237286004.zip')

# Unzip
shutil.unpack_archive('zip_files/embeddings_clinical_en_2.4.0_2.4_1580237286004.zip',
'models/embeddings_clinical', 'zip')

**Download the ner_clinical_large model**

In [None]:
# Download the ner_clinical_large model 
buck_auxdata.download_file('clinical/models/ner_clinical_large_en_3.0.0_3.0_1617206114650.zip',
'zip_files/ner_clinical_large_en_3.0.0_3.0_1617206114650.zip')

# Unzip
shutil.unpack_archive('zip_files/ner_clinical_large_en_3.0.0_3.0_1617206114650.zip',
'models/ner_clinical_large', 'zip')

**ner pipeline**

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
 
# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings_loaded = WordEmbeddingsModel.load("/content/models/embeddings_clinical")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner_loaded = MedicalNerModel.load("/content/models/ner_clinical_large")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")\
    .setLabelCasing("upper") #decide if we want to return the tags in upper or lower case 

ner_converter = NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings_loaded,
        clinical_ner_loaded,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [None]:
# fullAnnotate in LightPipeline

text = '''
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , and associated with an acute hepatitis , presented with a one-week history of polyuria , poor appetite , and vomiting . 
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . 
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl ,  creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , and venous pH 7.27 . 
'''

print (text)

light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)


chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    
import pandas as pd

df_clinical = pd.DataFrame({'chunks':chunks, 
                            'begin': begin, 
                            'end':end, 
                            'sentence_id':sentence, 
                            'entities':entities})

df_clinical.head(20)


A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , and associated with an acute hepatitis , presented with a one-week history of polyuria , poor appetite , and vomiting . 
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . 
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl ,  creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , and venous pH 7.27 . 



Unnamed: 0,chunks,begin,end,sentence_id,entities
0,gestational diabetes mellitus,40,68,0,PROBLEM
1,subsequent type two diabetes mellitus,118,154,0,PROBLEM
2,T2DM,158,161,0,PROBLEM
3,HTG-induced pancreatitis,187,210,0,PROBLEM
4,an acute hepatitis,268,285,0,PROBLEM
5,polyuria,326,333,0,PROBLEM
6,poor appetite,337,349,0,PROBLEM
7,vomiting,357,364,0,PROBLEM
8,metformin,380,388,1,TREATMENT
9,glipizide,392,400,1,TREATMENT


## offline mode public models

**Download the embedding model from Model Hub and Upload *zip_files* folder** 

https://nlp.johnsnowlabs.com/2020/01/22/glove_100d.html

https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/glove_100d_en_2.4.0_2.4_1579690104032.zip

In [None]:
# or you can use internet connection
# !wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/glove_100d_en_2.4.0_2.4_1579690104032.zip 

drug and drop

In [None]:
# Unzip
shutil.unpack_archive('zip_files/glove_100d_en_2.4.0_2.4_1579690104032.zip',
'modelhub_files/glove_100d', 'zip')

**Download the ner_clinical_large model from Model Hub and Upload *zip_files* folder**

https://nlp.johnsnowlabs.com/2020/03/19/ner_dl_en.html

https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_dl_en_2.4.3_2.4_1584624950746.zip

drug and drop

In [None]:
# Unzip
shutil.unpack_archive('zip_files/ner_dl_en_2.4.3_2.4_1584624950746.zip',
'modelhub_files/ner_dl', 'zip')

pipeline

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

# ner_dl model is trained with glove_100d. So we use the same embeddings in the pipeline
glove_embeddings = WordEmbeddingsModel.load('/content/modelhub_files/glove_100d')\
    .setInputCols(["document", 'token'])\
    .setOutputCol("embeddings")

public_ner = NerDLModel.load("/content/modelhub_files/ner_dl")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    tokenizer,
    glove_embeddings,
    public_ner,
    ner_converter
 ])

empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

In [None]:
# fullAnnotate in LightPipeline

light_model = LightPipeline(pipelineModel)

light_result = light_model.annotate('Peter Parker is a nice persn and lives in New York. Bruce Wayne is also a nice guy and lives in Gotham City.')

list(zip(light_result['token'], light_result['ner']))

[('Peter', 'B-PER'),
 ('Parker', 'I-PER'),
 ('is', 'O'),
 ('a', 'O'),
 ('nice', 'O'),
 ('persn', 'O'),
 ('and', 'O'),
 ('lives', 'O'),
 ('in', 'O'),
 ('New', 'B-LOC'),
 ('York', 'I-LOC'),
 ('.', 'O'),
 ('Bruce', 'B-PER'),
 ('Wayne', 'I-PER'),
 ('is', 'O'),
 ('also', 'O'),
 ('a', 'O'),
 ('nice', 'O'),
 ('guy', 'O'),
 ('and', 'O'),
 ('lives', 'O'),
 ('in', 'O'),
 ('Gotham', 'B-LOC'),
 ('City', 'I-LOC'),
 ('.', 'O')]

In [None]:
light_model = LightPipeline(pipelineModel)

light_result = light_model.fullAnnotate('Peter Parker is a nice persn and lives in New York. Bruce Wayne is also a nice guy and lives in Gotham City.')


chunks = []
entities = []

for n in light_result[0]['ner_chunk']:
        
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    
    
import pandas as pd

df = pd.DataFrame({'chunks':chunks, 'entities':entities})

df

Unnamed: 0,chunks,entities
0,Peter Parker,PER
1,New York,LOC
2,Bruce Wayne,PER
3,Gotham City,LOC
