![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Entity Recognizer DL by Spark NLP

In [1]:
! pip install -q pyspark==3.1.2 spark-nlp

[K     |████████████████████████████████| 212.4 MB 65 kB/s 
[K     |████████████████████████████████| 133 kB 15.7 MB/s 
[K     |████████████████████████████████| 198 kB 41.0 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


## Extract entities with Deep Learning

In [2]:
import sys
import time

#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.base import *

### Let's create a Spark Session for our app

In [3]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.3.4
Apache Spark version:  3.1.2


We are going to download `entity_recognizer_dl` pipeline from Spark-NLP S3 repository

In [4]:
pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')

recognize_entities_dl download started this may take some time.
Approx size to download 160.1 MB
[OK!]


In [5]:
text = "The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."
result = pipeline.annotate(text)

In [6]:
list(result.keys())

['entities', 'document', 'token', 'ner', 'embeddings', 'sentence']

In [7]:
result['sentence']

['The Mona Lisa is a 16th century oil painting created by Leonardo.',
 "It's held at the Louvre in Paris."]

In [8]:
result['token']

['The',
 'Mona',
 'Lisa',
 'is',
 'a',
 '16th',
 'century',
 'oil',
 'painting',
 'created',
 'by',
 'Leonardo',
 '.',
 "It's",
 'held',
 'at',
 'the',
 'Louvre',
 'in',
 'Paris',
 '.']

In [9]:
list(zip(result['token'], result['ner']))

[('The', 'O'),
 ('Mona', 'B-PER'),
 ('Lisa', 'I-PER'),
 ('is', 'O'),
 ('a', 'O'),
 ('16th', 'O'),
 ('century', 'O'),
 ('oil', 'O'),
 ('painting', 'O'),
 ('created', 'O'),
 ('by', 'O'),
 ('Leonardo', 'B-PER'),
 ('.', 'O'),
 ("It's", 'O'),
 ('held', 'O'),
 ('at', 'O'),
 ('the', 'O'),
 ('Louvre', 'B-LOC'),
 ('in', 'O'),
 ('Paris', 'B-LOC'),
 ('.', 'O')]

In [10]:
result['entities']

['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']

Let's have a bigger document

In [11]:
text = """
When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously. “I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to,” said Thrun, now the co-founder and CEO of online higher education startup Udacity, in an interview with Recode earlier this week.
A little less than a decade later, dozens of self-driving startups have cropped up while automakers around the world clamor, wallet in hand, to secure their place in the fast-moving world of fully automated transportation.
"""
result = pipeline.annotate(text)

In [12]:
result['entities']

['Sebastian Thrun', 'Google', 'CEOs', 'American', 'Thrun', 'Udacity', 'Recode']