**Example 1:** Extracting names mentioned in a tweet. We are using spaCy NLP library in this example. This is a highly performant NLP library.

In [1]:
%%capture capt
# the installation will take a while when first run
# verbose output is supresses using a juypter feature call cell magic
# remove the first line and run again to see output
import sys
!{sys.executable} -m pip install spacy
!python -m spacy download en_core_web_sm

# load a library so we can print coloured text
!{sys.executable} -m pip install termcolor

In [2]:
import spacy
from termcolor import colored

# load the English, tokenizer, tagger, parser and NER clever stuff
nlp = spacy.load("en_core_web_sm")

# used a tweet as the sample text
tweet = ("I'm in favour of people of talent being able to come to this country... but I'm also "
         "in favour of control Boris Johnson describes how UK immigration would work under a Tory government, "
         "as Home Secretary Priti Patel pledges to cut immigration overall.")
doc = nlp(tweet)

# analyze syntax
print ("\nNoun phrases\n", [chunk.text for chunk in doc.noun_chunks])
print ("\nVerbs\n", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# display just the names mentioned in the tweet
print("\nPeople mentioned in tweet\n")
for entity in doc.ents:
    if (entity.label_ == "PERSON"): 
        print (colored(entity.text, 'blue'))


Noun phrases
 ['I', 'favour', 'people', 'talent', 'this country', 'I', 'favour', 'control', 'Boris Johnson', 'UK immigration', 'a Tory government', 'immigration']

Verbs
 ['come', 'describe', 'would', 'work', 'pledge', 'cut']

People mentioned in tweet

[34mBoris Johnson[0m
[34mPriti Patel[0m


**Example 2:** Extracting multiple names from a block of text. The first example seems to work, but can multiple names be extracted reliably from unstructured text?

In [3]:
import spacy
import sys
import os
from termcolor import colored

# load the English, tokenizer, tagger, parser and NER clever stuff
nlp = spacy.load("en_core_web_sm")

# load a suitable test file, the lyrics to We Didn't Start the Fire by Billy Joel
f = open('Files/lyrics.txt', 'r')

# show the lyrics for the Gen Z's looking at this
lyrics = f.read()
print(lyrics)

# lets see how many names are identified
doc = nlp(lyrics)
print("\nPeople Billy Joel mentions...\n")
name_count=0
for entity in doc.ents:
    if (entity.label_ == "PERSON"):
        name_count+=1
        print (name_count, ' ',colored(entity.text.splitlines(), 'blue'), entity.label_)

Harry Truman, Doris Day, Red China, Johnnie Ray
South Pacific, Walter Winchell, Joe DiMaggio
Joe McCarthy, Richard Nixon, Studebaker, television
North Korea, South Korea, Marilyn Monroe
Rosenbergs, H-bomb, Sugar Ray, Panmunjom
Brando, "The King and I" and "The Catcher in the Rye"
Eisenhower, vaccine, England's got a new queen
Marciano, Liberace, Santayana goodbye
We didn't start the fire
It was always burning
Since the world's been turning
We didn't start the fire
No we didn't light it
But we tried to fight it
Joseph Stalin, Malenkov, Nasser and Prokofiev
Rockefeller, Campanella, Communist Bloc
Roy Cohn, Juan Peron, Toscanini, Dacron
Dien Bien Phu falls, "Rock Around the Clock"
Einstein, James Dean, Brooklyn's got a winning team
Davy Crockett, Peter Pan, Elvis Presley, Disneyland
Bardot, Budapest, Alabama, Krushchev
Princess Grace, "Peyton Place", trouble in the Suez
We didn't start the fire
It was always burning
Since the world's been turning
We didn't start the fire
No we didn't ligh

The output from the above code is not bad but the trained Machine Learning model that is used has missed some names and incorrectly identifed others. Some of the errors come from the way two single names end one line and begin the next. This is assumed to be a name e.g. Kerouac Sputnik, Beatlemania Ole Miss and Psycho.

Some names are missed completely such as Eisenhower and Davy Crockett. The take away here is that this sort of process is not for situations where 100% of names in a document must be extracted.

**Example 3:** OCR. This example shows how to extract names from images. Both typed and handwritten examples are tested. Pytesseract is a wrapper to Google's Tesseract-OCR Engine which also needs to be installed.

In [4]:
%%capture capt
# install an OCR library
import sys
!conda install -y -c conda-forge tesseract
!{sys.executable} -m pip install Pillow
!{sys.executable} -m pip install pytesseract

In [5]:
import pytesseract 
from pytesseract import image_to_string
from PIL import Image

def ocr(filename):
    txt = pytesseract.image_to_string(Image.open(filename))
    return txt

txt = ocr('Files/image.png')
print(txt)

txt2 = ocr('Files/handwritten.png')
print(txt2)

Company Pool Ladder

 

 

 

 

 

Player Name Skill
Player 1 | Robin Banks Advanced
Player 2 Joe Bloggs Good
Player 3 Frank Sidebottom | Ace
Player 4 | Sally Snider Lucky
Player 5 Mary Marks Advanced
Player 6 Billy Kidson Good
Player 7 Robin Robyns Average
Player 8 Kim Jon Ho Average
Player 9 Aimee Bisset Average
Company Pool Ladder

 

Player 6 Billy kidson Good

 

‘Player 8 | ea Jen te overage


Note how the OCR of the handwritten text is poor. To overcome this you would need to train a tesseract model or use a combination of other vision detection software and machine learning tools such as OpenCV and TensorFlow. The code for this becomes much more complex.

**Example 4:** Extracting names from the OCR data. spaCy was pretty good so we'll give it a try again.

In [6]:
import spacy

# load the English, tokenizer, tagger, parser and NER clever stuff
nlp = spacy.load("en_core_web_sm")

# lets see how many names are identified
doc = nlp(txt)
print("\nPeople Billy Joel mentions...\n")
name_count=0
for entity in doc.ents:
    if (entity.label_ == "PERSON"):
        name_count+=1
        print (name_count, ' ',entity.text.splitlines(), entity.label_)


People Billy Joel mentions...

1   ['Joe Bloggs Good'] PERSON
2   ['Frank'] PERSON
3   ['| Sally'] PERSON
4   ['Mary Marks'] PERSON
5   ['Billy Kidson Good'] PERSON
6   ['Robin Robyns'] PERSON
7   ['Kim Jon Ho'] PERSON


We can do better than this with some pre processing.   When running locally using Docker vs Azure Notebooks there were slight differences in the accuracy of the detection.

In [7]:
import spacy
from spacy import displacy
sp = spacy.load('en_core_web_sm')

s = sp(txt.strip().replace('\n','').replace('|',''))

displacy.render(s, style='ent', jupyter=True)