# Named Entity Recognition(NER)
- **Named Entity**
    - ORGANIZATION
    - PERSON
    - PRODUCT
    - MONEY
    - GPE
    - ORDINAL
    - LOCATION
    - DATE
    - TIME

In [3]:
import nltk
import pandas as pd
# from nltk import word_tokenize,pos_tag

In [5]:
text = "Apple acquired Zoom in China on Wednesday 6th May 2020.\
This news has made Apple and Google stock jump by 5% on Dow Jones Index in the \
United States of America"

**NLTK provides a function nltk.ne_chunk() that is already a pre-trained classifier to recognize named entity using POS tag as input**

In [6]:
# tokenize to words
words = nltk.word_tokenize(text)
words

In [7]:
# Part of speech tagging | Find parts of speech tag for each word using the pos_tag() function
pos_tags = nltk.pos_tag(words)
pos_tags

In [8]:
# Binary = True | Pass the list that contains tuples of words and POS tags to the ne_chunk() function
chunks = nltk.ne_chunk(pos_tags, binary=False)     # True give u two outputs either the word is NE or not NE
                                                   # False show u either NE or not NE and also print label(Person, Organization, GPE) of theses NE
for chunk in chunks:
    print(chunk)

## IOB tagging
- **The IOB format (short for inside, outside, beginning) is a tagging format. These tags are similar to part-of-speech tags but give us information about the location of the word in the chunk**
    - B-{CHUNK_TYPE} – for the word in the Beginning chunk
    - I-{CHUNK_TYPE} – for words Inside the chunk
    - O – Outside any chunk

In [9]:
from nltk.chunk import tree2conlltags
iob_tagged = tree2conlltags(chunks)
for chunk in iob_tagged:
    print(chunk)

In [10]:
entities =[]
labels =[]
for chunk in chunks:
    if hasattr(chunk,'label'):
        # print(chunk)
        entities.append(' '.join(c[0] for c in chunk))
        labels.append(chunk.label())
        
entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

In [12]:
# Sentence based(whole code together)
entities = []
labels = []

sentence = nltk.sent_tokenize(text)
for sent in sentence:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False):
        if hasattr(chunk,'label'):
            entities.append(' '.join(c[0] for c in chunk))
            labels.append(chunk.label())
            
entities_labels = list(set(zip(entities,labels)))

entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities","Labels"]
entities_df

## Using Spacy

In [13]:
import spacy 
from spacy import displacy
spacy.__version__  # SpaCy 2.x brough significant speed and accuracy improvements

In [17]:
# what type a particular NE is

print(spacy.explain("ORG"))
print(spacy.explain("PERSON"))
print(spacy.explain("PRODUCT"))
print(spacy.explain("GPE"))
print(spacy.explain("LOC"))
print(spacy.explain("DATE"))
print(spacy.explain("ORDINAL"))
print(spacy.explain("MONEY"))

In [18]:
# Download and Load SpaCy model

# 3 types of pre-trained models
nlp = spacy.load("en_core_web_sm")    # EN:English, CORE:Vocabulary, syntax, entities, WEB:written text (blogs, news, comments), SM:size(12MB)
#nlp = spacy.load("en_core_web_md")
#nlp = spacy.load("en_core_web_lg")

In [19]:
doc = nlp(text)

entities = []
labels = []
position_start = []
position_end = []

for ent in doc.ents:
    entities.append(ent)
    labels.append(ent.label_)
    position_start.append(ent.start_char)
    position_end.append(ent.end_char)
    
df = pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end})
df

In [20]:
displacy.serve(doc, style="ent")   # display a ui visualization of entities of doc objects