![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Hardcore DL by Spark NLP

In [1]:
import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.3

# Install Spark NLP
! pip install --ignore-installed spark-nlp==2.2.2

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~18.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
Collecting pyspark==2.4.3
[?25l  Downloading https://files.pythonhosted.org/packages/37/98/244399c0daa7894cdf387e7007d5e8b3710a79b67f3fd991c0b0b644822d/pyspark-2.4.3.tar.gz (215.6MB)
[K     |████████████████████████████████| 215.6MB 105kB/s 
[?25hCollecting py4j==0.10.7 (from pyspark==2.4.3)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 48.3MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.3-py2.py3-none-any.whl size=215964963 sha256=61e4693bb2e1b0ca2932dd252a243b906471f3d3a2073646b04ff8c88de85e05
  Stored in directory: /root/.cache/pip/w

## Explain Documents with Deep Learning

In [0]:
import sys
import time

#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import *

### Let's create a Spark Session for our app

Let's take a look at what's behind `sparknlp.start()` function:

In [3]:
import sparknlp
spark = sparknlp.start(include_ocr=True)

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version


Spark NLP version
2.2.1
Apache Spark version


'2.4.3'

In [4]:
pipeline = PretrainedPipeline('explain_document_dl', lang='en')

explain_document_dl download started this may take some time.
Approx size to download 167.3 MB
[OK!]


We simply send the text we want to transform and the pipeline does the work.

In [0]:
text = 'He would love to visit many beautful cities wth you. He lives in an amazing country.'
result = pipeline.annotate(text)

We can see the output of each annotator below. This one is doing so many things at once!

In [6]:
list(result.keys())

['entities',
 'stem',
 'checked',
 'lemma',
 'document',
 'pos',
 'token',
 'ner',
 'embeddings',
 'sentence']

In [7]:
result['sentence']

['He would love to visit many beautful cities wth you.',
 'He lives in an amazing country.']

In [8]:
result['lemma']

['He',
 'would',
 'love',
 'to',
 'visit',
 'many',
 'beautiful',
 'city',
 'wth',
 'you',
 '.',
 'He',
 'life',
 'in',
 'an',
 'amazing',
 'country',
 '.']

In [9]:
list(zip(result['checked'], result['pos']))

[('He', 'PRP'),
 ('would', 'MD'),
 ('love', 'VB'),
 ('to', 'TO'),
 ('visit', 'VB'),
 ('many', 'JJ'),
 ('beautiful', 'JJ'),
 ('cities', 'NNS'),
 ('wth', 'NN'),
 ('you', 'PRP'),
 ('.', '.'),
 ('He', 'PRP'),
 ('lives', 'VBZ'),
 ('in', 'IN'),
 ('an', 'DT'),
 ('amazing', 'JJ'),
 ('country', 'NN'),
 ('.', '.')]

In [10]:
list(zip(result['checked'], result['pos']))

[('He', 'PRP'),
 ('would', 'MD'),
 ('love', 'VB'),
 ('to', 'TO'),
 ('visit', 'VB'),
 ('many', 'JJ'),
 ('beautiful', 'JJ'),
 ('cities', 'NNS'),
 ('wth', 'NN'),
 ('you', 'PRP'),
 ('.', '.'),
 ('He', 'PRP'),
 ('lives', 'VBZ'),
 ('in', 'IN'),
 ('an', 'DT'),
 ('amazing', 'JJ'),
 ('country', 'NN'),
 ('.', '.')]

### Now let's try to use this pipleine to explain a PDF file

In [11]:
from sparknlp.ocr import OcrHelper
# ! wget https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/immortal_text.pdf
data = OcrHelper().createDataset(spark, './../assets/immortal_text.pdf')
data.show()

--2019-09-20 16:32:03--  https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/immortal_text.pdf
Resolving github.com (github.com)... 192.30.253.112
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/immortal_text.pdf [following]
--2019-09-20 16:32:03--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/immortal_text.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 243543 (238K) [application/octet-stream]
Saving to: ‘immortal_text.pdf’


2019-09-20 16:32:04 (4.89 MB/s) - ‘immortal_text.pdf’ saved [243543/243543]

+--------------------+-------+------

We can see the output of each annotator below.

In [12]:
pipeline.transform(data).select("token.result", "pos.result").show()

+--------------------+--------------------+
|              result|              result|
+--------------------+--------------------+
|[would, have, bee...|[MD, VB, VBN, DT,...|
+--------------------+--------------------+

