## Misran Dolan
## Named Entity Recognition Research

### This research was used for a client in the medical industry that needed a tool to extract information on insurance claims. The information could be treatment, date, symptom, etc. Named Entity Recognition is a natural language processing technique used by many Python libraries to categorize text based on pre-defined indicators.

### The first NER Demo uses the library spaCy, and the second library uses the Stanford NER Tagger

## spaCy demo

### Faster, puts entire entity into one cell

#### Doesn't detect "st" suffixes

#### Is able to detect institutions (ICU)

In [6]:
import spacy

spacy_nlp = spacy.load('en_core_web_sm')

article = 'John Smith was admitted for a routine surgery on Thursday May 23rd 2017.  He had complications overnight and was transferred to the ICU.'

document = spacy_nlp(article)

print('Original Sentence: %s' % (article))

for element in document.ents:
    print('Type: %s, Value: %s' % (element.label_, element))
    
#can do 'rd', but not 'st'
#didn't detect 'overnight'


Original Sentence: John Smith was admitted for a routine surgery on Thursday May 23rd 2017.  He had complications overnight and was transferred to the ICU.
Type: PERSON, Value: John Smith
Type: DATE, Value: Thursday May 23rd 2017
Type: ORG, Value: ICU


In [7]:
document.ents

(John Smith, Thursday May 23rd 2017, ICU)

## Stanford NER Demo

### Has to be used through NLTK API
### Slower than spaCy, but it splits each entity into each cell

In [10]:
import nltk
from nltk.tag import StanfordNERTagger
import os
#java_path = "C:/Program Files (x86)/Java/jre1.8.0_211/bin/java.exe"
#os.environ['JAVAHOME'] = java_path

print('NTLK Version: %s' % nltk.__version__)

stanford_ner_tagger = StanfordNERTagger(
    '/Users/ladmin/Documents/stanford_ner/classifiers/english.muc.7class.distsim.crf.ser.gz',
    '/Users/ladmin/Documents/stanford_ner/stanford-ner.jar'
)

NTLK Version: 3.4


In [12]:
nltk.download('punkt')
article = 'John Smith was admitted for a routine surgery on Thursday May 21st 2017.  He had complications overnight and was transferred to the ICU.'

results = stanford_ner_tagger.tag(article.split())

print('Original Sentence: %s' % (article))
for result in results:
    tag_value = result[0]
    tag_type = result[1]
    if tag_type != 'O':
        print('Type: %s, Value: %s' % (tag_type, tag_value))

[nltk_data] Downloading package punkt to /Users/ladmin/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Original Sentence: John Smith was admitted for a routine surgery on Thursday May 21st 2017.  He had complications overnight and was transferred to the ICU.
Type: PERSON, Value: John
Type: PERSON, Value: Smith
Type: DATE, Value: Thursday
Type: DATE, Value: May
Type: DATE, Value: 21st
Type: DATE, Value: 2017.


## Other Demos

In [16]:
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
print(article)
for sent in nltk.sent_tokenize(article):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
             print(chunk.label(), ' '.join(c[0] for c in chunk))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ladmin/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/ladmin/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/ladmin/nltk_data...


John Smith was admitted for a routine surgery on Thursday May 21st 2017.  He had complications overnight and was transferred to the ICU.
PERSON John
PERSON Smith
ORGANIZATION ICU


[nltk_data]   Unzipping corpora/words.zip.


In [17]:
article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight 
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.'''

results = stanford_ner_tagger.tag(article.split())

#print('Original Sentence: %s' % (article))
for result in results:
    tag_value = result[0]
    tag_type = result[1]
    if tag_type != 'O':
        print('Type: %s, Value: %s' % (tag_type, tag_value))

Type: DATE, Value: Tuesday
Type: LOCATION, Value: Europe
Type: ORGANIZATION, Value: Asia-Pacific
Type: LOCATION, Value: Japan
Type: PERCENT, Value: 1.7
Type: PERCENT, Value: percent
Type: ORGANIZATION, Value: Nikkei
Type: PERCENT, Value: 3.1
Type: PERCENT, Value: percent
Type: LOCATION, Value: European
Type: LOCATION, Value: Union
Type: PERSON, Value: Theresa
Type: PERSON, Value: May


In [18]:
spacy_nlp = spacy.load('en_core_web_sm')
document = spacy_nlp(article)

#print('Original Sentence: %s' % (article))

for element in document.ents:
    print('Type: %s, Value: %s' % (element.label_, element))

Type: NORP, Value: Asian
Type: DATE, Value: Tuesday
Type: LOC, Value: Europe
Type: ORG, Value: MSCI
Type: LOC, Value: Asia-Pacific
Type: GPE, Value: Japan
Type: PERCENT, Value: 1.7 percent
Type: DATE, Value: week
Type: NORP, Value: Australian
Type: PERCENT, Value: 1.6 percent
Type: GPE, Value: Japan
Type: PERCENT, Value: 3.1 percent
Type: ORG, Value: Apple
Type: MONEY, Value: 1.286
Type: CARDINAL, Value: three
Type: GPE, Value: Nov.1
Type: ORG, Value: European Union
Type: GPE, Value: Brexit
Type: NORP, Value: British
Type: PERSON, Value: Theresa
Type: DATE, Value: May
Type: DATE, Value: Monday
