![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

## Named-entity recognition with Deep Learning

<p><strong>Named-Entity recognition</strong> is a well-known technique in information extraction it is also known as&nbsp;<strong>entity identification</strong>,&nbsp;<strong>entity chunking</strong>&nbsp;and&nbsp;<strong>entity extraction.</strong>&nbsp;Knowing the relevant tags for each article help in automatically categorizing the articles in defined hierarchies and enable smooth content discovery. This pipeline is based on&nbsp;<strong>NerDLApproach</strong> annotator with <strong>Char CNN - BiLSTM</strong> and <strong>GloVe Embeddings</strong> on the <strong>OntoNotes</strong> corpus and supports the identification of 18 entities.</p><p>Following NER types are supported in this pipeline:</p><table><thead><tr><th>Type</th><th>Description</th></tr></thead><tbody><tr><td><code>PERSON</code></td><td>People, including fictional.</td></tr><tr><td><code>NORP</code></td><td>Nationalities or religious or political groups.</td></tr><tr><td><code>FAC</code></td><td>Buildings, airports, highways, bridges, etc.</td></tr><tr><td><code>ORG</code></td><td>Companies, agencies, institutions, etc.</td></tr><tr><td><code>GPE</code></td><td>Countries, cities, states.</td></tr><tr><td><code>LOC</code></td><td>Non-GPE locations, mountain ranges, bodies of water.</td></tr><tr><td><code>PRODUCT</code></td><td>Objects, vehicles, foods, etc. (Not services.)</td></tr><tr><td><code>EVENT</code></td><td>Named hurricanes, battles, wars, sports events, etc.</td></tr><tr><td><code>WORK_OF_ART</code></td><td>Titles of books, songs, etc.</td></tr><tr><td><code>LAW</code></td><td>Named documents made into laws.</td></tr><tr><td><code>LANGUAGE</code></td><td>Any named language.</td></tr><tr><td><code>DATE</code></td><td>Absolute or relative dates or periods.</td></tr><tr><td><code>TIME</code></td><td>Times smaller than a day.</td></tr><tr><td><code>PERCENT</code></td><td>Percentage, including &rdquo;%&ldquo;.</td></tr><tr><td><code>MONEY</code></td><td>Monetary values, including unit.</td></tr><tr><td><code>QUANTITY</code></td><td>Measurements, as of weight or distance.</td></tr><tr><td><code>ORDINAL</code></td><td>&ldquo;first&rdquo;, &ldquo;second&rdquo;, etc.</td></tr><tr><td><code>CARDINAL</code></td><td>Numerals that do not fall under another type.</td></tr></tbody></table>

In [1]:
import sparknlp 

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)


Spark NLP version:  2.3.4
Apache Spark version:  2.4.3


In [2]:
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *

Now, we load a `onto_recognize_entities_sm` pipeline model which contains the following annotators:
Tokenizer, GloVe embeddings, and NER model trained by Deep Learning

In [3]:
pipeline = PretrainedPipeline('onto_recognize_entities_sm')

onto_recognize_entities_sm download started this may take some time.
Approx size to download 158 MB
[OK!]


NOTE: We are using `onto_recognize_entities_sm` which is the smaller version. You can use `onto_recognize_entities_lg` which is a larger pipeline model if you have enough resources.

Let's annotate our `text` by pretrained `pipeline`:

In [4]:
text = '''Barclays misled shareholders and the public about one of the biggest investments in the bank's history, a BBC Panorama investigation has found.
The bank announced in 2008 that Manchester City owner Sheikh Mansour had agreed to invest more than £3bn.
But the BBC found that the money, which helped Barclays avoid a bailout by British taxpayers, actually came from the Abu Dhabi government.
Barclays said the mistake in its accounts was "a drafting error".
Unlike RBS and Lloyds TSB, Barclays narrowly avoided having to request a government bailout late in 2008 after it was rescued by £7bn worth of new investment, most of which came from the Gulf states of Qatar and Abu Dhabi.
The S&P 500's price to earnings multiple is 71% higher than Apple's, and if Apple were simply valued at the same multiple, its share price would be $840, which is 52% higher than its current price.'''

result = pipeline.annotate(text)

We can see the output of each annotator below. This one is doing so many things at once!

In [5]:
list(result.keys())

['entities', 'document', 'token', 'ner', 'embeddings', 'sentence']

In [6]:
result['sentence']

["Barclays misled shareholders and the public about one of the biggest investments in the bank's history, a BBC Panorama investigation has found.",
 'The bank announced in 2008 that Manchester City owner Sheikh Mansour had agreed to invest more than £3bn.',
 'But the BBC found that the money, which helped Barclays avoid a bailout by British taxpayers, actually came from the Abu Dhabi government.',
 'Barclays said the mistake in its accounts was "a drafting error".',
 'Unlike RBS and Lloyds TSB, Barclays narrowly avoided having to request a government bailout late in 2008 after it was rescued by £7bn worth of new investment, most of which came from the Gulf states of Qatar and Abu Dhabi.',
 "The S&P 500's price to earnings multiple is 71% higher than Apple's, and if Apple were simply valued at the same multiple, its share price would be $840, which is 52% higher than its current price."]

In [7]:
result['entities']

['Barclays',
 'about one',
 'BBC Panorama',
 '2008',
 'Manchester City',
 'Sheikh Mansour',
 'more than £3bn',
 'BBC',
 'Barclays',
 'British',
 'Abu Dhabi',
 'Barclays',
 'RBS',
 'Lloyds TSB',
 'Barclays',
 '2008',
 '7bn',
 'Gulf',
 'Qatar',
 'Abu Dhabi',
 'S&P',
 "500's",
 '71%',
 'Apple',
 'Apple',
 '$840',
 '52%']

In [8]:
list(zip(result['token'], result['ner']))

[('Barclays', 'B-ORG'),
 ('misled', 'O'),
 ('shareholders', 'O'),
 ('and', 'O'),
 ('the', 'O'),
 ('public', 'O'),
 ('about', 'B-CARDINAL'),
 ('one', 'I-CARDINAL'),
 ('of', 'O'),
 ('the', 'O'),
 ('biggest', 'O'),
 ('investments', 'O'),
 ('in', 'O'),
 ('the', 'O'),
 ('bank', 'O'),
 ("'s", 'O'),
 ('history', 'O'),
 (',', 'O'),
 ('a', 'O'),
 ('BBC', 'B-ORG'),
 ('Panorama', 'I-ORG'),
 ('investigation', 'O'),
 ('has', 'O'),
 ('found', 'O'),
 ('.', 'O'),
 ('The', 'O'),
 ('bank', 'O'),
 ('announced', 'O'),
 ('in', 'O'),
 ('2008', 'B-DATE'),
 ('that', 'O'),
 ('Manchester', 'B-GPE'),
 ('City', 'I-GPE'),
 ('owner', 'O'),
 ('Sheikh', 'B-PERSON'),
 ('Mansour', 'I-PERSON'),
 ('had', 'O'),
 ('agreed', 'O'),
 ('to', 'O'),
 ('invest', 'O'),
 ('more', 'B-MONEY'),
 ('than', 'I-MONEY'),
 ('£', 'I-MONEY'),
 ('3bn', 'I-MONEY'),
 ('.', 'O'),
 ('But', 'O'),
 ('the', 'O'),
 ('BBC', 'B-ORG'),
 ('found', 'O'),
 ('that', 'O'),
 ('the', 'O'),
 ('money', 'O'),
 (',', 'O'),
 ('which', 'O'),
 ('helped', 'O'),
 ('Barc