Named Entity Recognition (NER)
---

<center><img src="https://researchkb.files.wordpress.com/2014/02/ner.png" width="700"/></center>

> Knowledge worker adds value to information.   
> \- Peter Drucker 

---

> Data Scientist adds value to data.  
> \- Brian Spiering


By The End Of This Session You Should Be Able To:
----

- Explain how NER builds on POS tagging
- Describe conceptually how to train NER
- Pick the best system for NER tagging

NER is just token classification
---

NER is a "Strict Type" system for human language

NER is difficult
-----

I used to work at Galvanize (the company):

Galvanize also means:

1. shock or excite (someone), typically into taking action.
2. coat (iron or steel) with a protective layer of zinc.

<center><img src="images/ner_fail.png" width="400"/></center>

NER is can only be Nouns
-----

POS comes first; NER comes second

Avoids problems like "Galvanize" (company vs verb) and "Will" (name vs verb)

Common Named Entity (NE) Types
---

- ORGANIZATION: Georgia-Pacific Corp., WHO
- PERSON: Eddy Bonte, President Obama
- LOCATION: Murray River, Mount Everest
- DATE: June, 2008-06-29
- TIME: two fifty a m, 1:30 p.m.
- MONEY: 175 million Canadian Dollars, GBP 10.40
- GPE (Geo-political entity): South East Asia, Midlothian

Feel free to define your own
-----

- PRESIDENT: Trump, Washington, Lincoln
- COUNTRY: Thailand, Germany, Canada
- PRODUCT: Apple Watch


- POSITION / JOB_TITLE: Product Manager, Data Scientist

NER Workflow
---

<center><img src="images/ie-architecture.png" width="700"/></center>

NER Methods (POS flashback)
---

1. Rule-based, aka make a dictionary
2. Statistical Models, aka use Graphical Models
3. Deep Learning, aka what everyone does now

Rule based NER
-----

Use a combination of lists and regular expressions to identify named entities. 

Examples:

```python
{"Dick": PERSON,
"Jane": PERSON}
```

Gazetteers
-----

![](images/gazetteer.png)

> A gazetteer consists of a set of lists containing names of entities such as cities, organizations, days of the week, etc. These lists are used to find occurrences of these names in text, e.g. for the task of named entity recognition.

Gazetteers
-----

[How to make a gazetteer](http://www.aclweb.org/anthology/P08-1047)

Then use it to train other models

![](images/extend)

Gazetteers: Pros
-----

- Simplest model (that could possible work)
- Minimum Viable Solution (MVP)
- Works for most cases overall
- Performs nicely within specific, well-understood, static domains

Gazetteers: Cons
-----

- Deterministic
- Brittle
- Labor intensive to maintaining
- Names of people and places are often the same — Washington (state, D.C., or George)

- Many proper names are conjunctions of other proper names.
- Moving to other languages or domains may involve repeating much of the work.
- It’s difficult to model dependencies between names across a document using rules based on regular expressions.

Student Activity
-------

Reverse dictionary look up

In [2]:
gazetteer = {'names':     {'Danai', 'John', 'Akshay', 'Ting Ting', 'Asmita', 'Lingzhi', 'Ford'},
             'companies': {'Ford', 'Walmart', 'K-Mart', 'Sears', 'JC Penny', 'JC', 'Chase Bank', 'Chase'}}

In [3]:
def test_keys_with_item_value():
    assert keys_with_item_value(gazetteer, 'Lambda') == set() # Empty set
    assert keys_with_item_value(gazetteer, 'John') == {'names'}
    assert keys_with_item_value(gazetteer, 'Ford') == {'companies', 'names'} 
    return 'tests pass 🙂'

In [4]:
def keys_with_item_value(dictionary, item):
    "Return a set of keys that have item in the values"
    return {key for key, value in dictionary.items() if item in value}

test_keys_with_item_value()

'tests pass 🙂'

What is the Big O time complexity? <br> - Your code <br> - This code
-----

Linear with the number of keys.

Constant with the number of values within in a key.

Given the structure of a gazetteer, that is good enough.

[Source](http://code.activestate.com/recipes/415903-two-dict-classes-which-can-lookup-keys-by-value-an/)

[Source](http://research.ijcaonline.org/volume73/number14/pxc3890066.pdf)

Statistical Models Examples
------

- Hidden Markov model (HMM)
- Conditional random fields (CRFs)
- Viterbi

Viterbi
----
<center><img src="https://i.ytimg.com/vi/orRsWGqMOSk/maxresdefault.jpg" width="500"/></center>

A dynamic programming algorithm for finding the most likely sequence of hidden states (called the Viterbi path)

Viterbi to generative Siri speech sounds
-------

<center><img src="https://machinelearning.apple.com/images/journals/siri-voices/viterbi_lattice.png" width="700"/></center>

The sounds must:

1. Match the target prosody 
2. The units should, wherever possible, be concatenated without audible glitches at the unit boundary

The goal is important but how to get there also matters

Source: https://machinelearning.apple.com/2017/08/06/siri-voices.html

Statistical Models: __Pros__
-----

- "Good enough" for performance and speed
- Usually capable of human level performance
- Transfering to other languages or domains may only involve minimal code changes.
- The classifier can be retrained to incorporate additional text or other features.
- It’s easier to model the context within a sentence and in a document.
- Currently most common so you'll see them working in production environments

Statistical Models: __Cons__
-----

- The main disadvantage is the need for human-annotated data.
    - Also you may not have enough data
- Deep Learning is better now!

General DL Model for Language Tasks
-----
<br>
<center><img src="http://www.marekrei.com/blog/wp-content/uploads/2016/12/baseline_graph-300x153.png" width="700"/></center>

1) Each word is represented as a 300-dimensional word embedding. 


2) Each word is represented as low dimensional dense word embedding. 


3) Passed through a bidirectional LSTM. 


4) The representations from both directions are concatenated, in order to get a word representation that is conditioned on the whole sentence. 



5) Then a hidden layer 

6) Finally an output layer, which can be a __softmax or a Conditional Random Field__.

Source: http://www.marekrei.com/blog/attending-to-characters-in-neural-sequence-labeling-models/

Deep Learning for NER
-----

<center><img src="https://s3.amazonaws.com/poly-screenshots.angel.co/Project/56/200582/51ca1ecf91f8477a4d8f0a796f6c62c4-original.png" width="800"/></center>

<center><img src="images/sequence_labeling.png" width="700"/></center>

Summary
----

- NER is more specific version of POS tagging
- Generally, more useful in business context.
- Remember to define your own tags
- NER can be done with rules, graphical models, or deep learning
- DL sequence models are currently state-of-the-art across many tasks in NLP

<br>
<br>
----