## Explain Document Deep Learning

This notebook shows some of the available annotators in sparknlp. We start by importing required modules. 

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Global DEMO - Spark NLP Enterprise 2.3.4") \
    .master("local[*]") \
    .config("spark.driver.memory","8G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.3.4") \
    .config("spark.jars", "#####/spark-nlp-jsl-2.3.4.jar") \
    .getOrCreate()

In [2]:
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *

Now, we load a pipeline model which contains the following annotators:
Tokenizer, Deep Sentence Detector, Lemmatizer, Stemmer, Part of Speech (POS) and Context Spell Checker

In [4]:
pipeline = PretrainedPipeline('explain_document_dl_noncontrib')

explain_document_dl_noncontrib download started this may take some time.
Approx size to download 167.4 MB
[OK!]


We simple send the text we want to transform and the pipeline does the work.

In [9]:
from sparknlp_jsl.ocr import *

ocrh = OcrHelper()
data = ocrh.createDataset(spark, './immortal_text.pdf')
data.show()

+--------------------+-------+------+----------+----------+--------------------+----------------+---------------+--------------------+
|                text|pagenum|method|noiselevel|confidence|           positions|height_dimension|width_dimension|            filename|
+--------------------+-------+------+----------+----------+--------------------+----------------+---------------+--------------------+
|would have been a...|      0|  text|       0.0|       0.0|[[[[w, 1, 14.4, 3...|           383.0|          284.0|file:/C:/Users/sa...|
+--------------------+-------+------+----------+----------+--------------------+----------------+---------------+--------------------+



We can see the output of each annotator below.

In [10]:
pipeline.transform(data).select("ner", "checked").show()

+--------------------+--------------------+
|                 ner|             checked|
+--------------------+--------------------+
|[[named_entity, 0...|[[token, 0, 4, wo...|
+--------------------+--------------------+



In [11]:
local_data = data.select("text").first()['text']
local_data

'would have been a liberation, a joy, and a fiesta. \r\nHe sensed that had he been able to choose or \r\ndream his death that night, this is the death he \r\nwould have dreamed or chosen.  \r\nDahlmann firmly grips the knife, which he \r\nmay have no idea how to manage, and steps out \r\ninto the plains.  \r\n \r\n \r\n \r\nThe Aleph  \r\n(1949) \r\n \r\n \r\nThe Immortal \r\n \r\nSolomon saith: There is no new thing upon \r\nthe earth.  So that as Plato had an imagination, \r\nthat all knowledge was but remembrance;  so \r\nSolomon giveth his sentence, that all novelty is \r\nbut oblivion.  \r\nFrancis Bacon: Essays,  LVIII \r\n \r\nIn London, in early June of the year 1929, \r\n'

In [12]:
result = pipeline.annotate(local_data)
list(zip(result['token'], result['ner']))

[('would', 'O'),
 ('have', 'O'),
 ('been', 'O'),
 ('a', 'O'),
 ('liberation', 'O'),
 (',', 'O'),
 ('a', 'O'),
 ('joy', 'O'),
 (',', 'O'),
 ('and', 'O'),
 ('a', 'O'),
 ('fiesta', 'O'),
 ('.', 'O'),
 ('He', 'O'),
 ('sensed', 'O'),
 ('that', 'O'),
 ('had', 'O'),
 ('he', 'O'),
 ('been', 'O'),
 ('able', 'O'),
 ('to', 'O'),
 ('choose', 'O'),
 ('or', 'O'),
 ('dream', 'O'),
 ('his', 'O'),
 ('death', 'O'),
 ('that', 'O'),
 ('night', 'O'),
 (',', 'O'),
 ('this', 'O'),
 ('is', 'O'),
 ('the', 'O'),
 ('death', 'O'),
 ('he', 'O'),
 ('would', 'O'),
 ('have', 'O'),
 ('dreamed', 'O'),
 ('or', 'O'),
 ('chosen', 'O'),
 ('.', 'O'),
 ('Dahlmann', 'I-PER'),
 ('firmly', 'O'),
 ('grips', 'O'),
 ('the', 'O'),
 ('knife', 'O'),
 (',', 'O'),
 ('which', 'O'),
 ('he', 'O'),
 ('may', 'O'),
 ('have', 'O'),
 ('no', 'O'),
 ('idea', 'O'),
 ('how', 'O'),
 ('to', 'O'),
 ('manage', 'O'),
 (',', 'O'),
 ('and', 'O'),
 ('steps', 'O'),
 ('out', 'O'),
 ('into', 'O'),
 ('the', 'O'),
 ('plains', 'O'),
 ('.', 'O'),
 ('The', 'O'),


In [18]:
ocrh.setAutomaticSizeCorrection(True)
ocrh.setPreferredMethod('image')
ocrh.setFallbackMethod(False)

In [20]:
data = ocrh.createDataset(spark, './numbers.pdf')
data.select('text').show(truncate=False)

+---------------------------------+
|text                             |
+---------------------------------+
|23.18
1.5
1.6
17.6
2.5
22.5
9935
|
+---------------------------------+



In [29]:
ocrh.setScalingFactor(0.9)

In [30]:
data = ocrh.createDataset(spark, './numbers.pdf')
data.select('text').show(truncate=False)

+----------------------------------+
|text                              |
+----------------------------------+
|Â£3.18
1.5
1.6
17.6
2.5
22.5
223.5
|
+----------------------------------+

