# What is Named Entity Recognition (NER)?
Named entity recognition models are pre-trained models on a specific corpus that has already trained labels that can act as categories for labeled text.

This can be as simple as identifying if a label of text represents a person. For instance, high profile figures such as __Abraham Lincoln, FDR, or George Washington__ can be labeled as a __person__. 

There are many extensions of what NER can do for you but overall you can think of this as a tool to help extract categories of text without manually extracting information using customized regex functions or rules-based approaches.

SpaCy is an excellent NLP library for NER; documentation can be found [here](https://spacy.io/api/entityrecognizer)

In [9]:
# if you don't have the small english spacy model downloaded on local machine, uncomment this cell and execute.
# !pip install -U spacy
# !python -m spacy download en_core_web_sm

In [2]:
import spacy
from spacy import displacy # great for visualizing entitites/tokens
model = spacy.load("en_core_web_sm") # This is a pretrained NLP pipeline

# Some miscellaneous information that is not too relevant to NERs...

In [15]:
model.pipe_names # This is what the default model architecture looks like.

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

In [16]:
model.pipeline # some additional information.

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x21a0916bea0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x21a0915abd0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x219fe6a4fa0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x219fe6a4be0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x21a0920dd80>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x21a092098c0>)]

# Basic demo on what an NER can do.

In [31]:
text = "For instance, high profile figures such as Abraham Lincoln, FDR, or George Washington can be labeled as a person. I wonder what they would have thought about the USA today."

In [24]:
# pass text into model pipeline
processed_text = model(text)

In [25]:
processed_text # as we can see nothing is "out of the ordinary"

For instance, high profile figures such as Abraham Lincoln, FDR, or George Washington can be labeled as a person. I wonder what they would have thought about the USA today.

In [26]:
processed_text.text

'For instance, high profile figures such as Abraham Lincoln, FDR, or George Washington can be labeled as a person. I wonder what they would have thought about the USA today.'

In [27]:
for word in processed_text.ents:
    print(word.text, word.label_)

Abraham Lincoln PERSON
FDR PERSON
George Washington PERSON
USA GPE
today DATE


In [28]:
displacy.render(processed_text, style="ent", jupyter=True) # https://spacy.io/usage/visualizers
# style has 3 different attributes: "dep" for dependency parse, "ent" for entities, and "span" for specified lengths.

In [21]:
spacy.explain("PERSON") # What is this entity?

'People, including fictional'

In [29]:
spacy.explain("GPE") # What is this entity?

'Countries, cities, states'

In [30]:
spacy.explain("DATE") # What is this entity?

'Absolute or relative dates or periods'

Abraham Lincoln             |  Franklin Delano Roosevelt (FDR) | George Washington
:-------------------------:|:-------------------------:|:-------------------------:
<img src="lincoln.jpg" width="400"/>  |  <img src="FDR.jpg" width="500"> | <img src="GW.jpg" width="500">



# Where in the NLP pipelines is NER typically used?
In industry, I have typically used NER's as a categorical tool. Scale-wise, think in terms of millions of documents with tables, graphics, and most importantly thousands of words.

In essence, This can be used when tagging datasets, whether it be tweets, journal articles, or webpage content, this process reduce text sparsity, adding tremendous values to your dataset(s). 

You can check out my [Optical Character Recognition (OCR)](https://www.youtube.com/watch?v=rCgy4d2pyyA) video I did on how to extract text from non-structured based data.

Imagine that we have thousands upon thousands of documents and we need to categorize our data so we can actually work!
- This is where we can use an NER to help our process.

In [60]:
import glob
import pandas as pd

In [91]:
fileList = glob.glob('data/*.csv')
fileList # in this case we only have 10 documents. But, as we can see, these datasets are unlabled.

['data\\0.csv',
 'data\\1.csv',
 'data\\2.csv',
 'data\\3.csv',
 'data\\4.csv',
 'data\\5.csv',
 'data\\6.csv',
 'data\\7.csv',
 'data\\8.csv',
 'data\\9.csv']

# Let's get the datasets that only refer to people. 
- This helps narrow our search tremndously without lifting much of a finger.
    - We can rely on already built tools to help further our search for what we need and want.

In [96]:
# Just for our sake, we will only be looking at the first row.
# If true, then label dataset as store file names to be used later on and further preprocess.
for f in fileList:
    file_read = pd.read_csv(f, header = None)
    print(file_read)
    print("-----")

                                                    0
0   Assuming the Presidency at the depth of the Gr...
1   Born in 1882 at Hyde Park, New York�now a nati...
2   Following the example of his fifth cousin, Pre...
3   In the summer of 1921, when he was 39, disaste...
4   He was elected President in November 1932, to ...
5   By 1935 the Nation had achieved some measure o...
6   In 1936 he was re-elected by a top-heavy margi...
7   Roosevelt had pledged the United States to the...
8   When the Japanese attacked Pearl Harbor on Dec...
9   Feeling that the future peace of the world wou...
10  As the war drew to a close, Roosevelt�s health...
11  The Presidential biographies on WhiteHouse.gov...
-----


In [104]:
# Let's see how we can use NER in action.
white_list_files = []
for f in fileList:
    file_read = pd.read_csv(f, header = None)
    print(f"Text being read in..... for file {f}")
    print(file_read.values[0][0])
    processed_text = model(file_read.values[0][0])
    print("----- Checking if there are any PERSON entitities")
    is_people = False
    for word in processed_text.ents:
        print(word.text, word.label_)
        if word.label_ == "PERSON":
            is_people = True
    if is_people:
        white_list_files.append(f)
    print("-----")

print("Our current files that have the PERSON entity labeled in the first paragraph.: ")
print(white_list_files)

Text being read in..... for file data\0.csv
Assuming the Presidency at the depth of the Great Depression, Franklin D. Roosevelt helped the American people regain faith in themselves. He brought hope as he promised prompt, vigorous action, and asserted in his Inaugural Address, �the only thing we have to fear is fear itself.�
----- Checking if there are any PERSON entitities
the Great Depression EVENT
Franklin D. Roosevelt PERSON
American NORP
Inaugural Address ORG
-----
Text being read in..... for file data\1.csv
Abraham Lincoln became the United States� 16th President in 1861, issuing the Emancipation Proclamation that declared forever free those slaves within the Confederacy in 1863.
----- Checking if there are any PERSON entitities
Abraham Lincoln PERSON
the United States GPE
16th ORDINAL
1861 DATE
the Emancipation Proclamation FAC
1863 DATE
-----
Text being read in..... for file data\2.csv
On April 30, 1789, George Washington, standing on the balcony of Federal Hall on Wall Street 

# That's cool! What's next?
- You can check out all sorts of models that have an NER built into its model architecture. In fact, a great repository is [Huggingface](https://huggingface.co/) where all sorts of NLP models are stored publicy. They have models from all different types of industries! Might be worth to check it out.