# Named Entity Recognition

### Logic for identifying person names and occurrence frequencies from a given Chinese wikipedia page


'Names of persons' might occur in the same page by complete, incomplete, and alias representations. 

Examples: 'Donald Trump' and 'Trump'; and 'Ing-Wen Tsai', '蔡英文', '小英', '英文', '蔡博士', '蔡總統', '蔡女士', and '小英總統'.


#### Entry

How can we improve the precision and recall by filtering 'nouns which are not the name of a person' and 'incorrect names of persons created by segmentation'?

#### Advanced 

How can we find synonyms? For instance, '蔡英文', '小英', '英文', '蔡博士', '蔡總統', '蔡女士', and '小英總統' are 'Ing-Wen Tsai'.

#### Challenge 

How can we distinguish the names of persons from ambiguous results? For instance, 'Mr. Smith is a smith.' in which the second 'smith' is not the name of a person. Another example (in Chinese) is '蔡英文的英文很好', in which the first '蔡英文' is 'Ing-Wen Tsai' but the second '英文' is 'English' which is not the name of a person.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import copy
import datetime
import wikipedia
import wptools
import jieba.posseg as pseg
import jieba.analyse
import pynlpir

## Entry Level

How can we improve the precision and recall by filtering 'nouns which are not the name of a person' and 'incorrect names of persons created by segmentation'?

### Init helper functions

I introduce another Chinese segmentation libray [`pynlpir`](http://pynlpir.readthedocs.io/en/latest/index.html) as it offers detailed part of speech tags that are required to solve the challneges. My anecdotal experience is that [`jieba`](https://github.com/fxsjy/jieba) does a better job of segmentation but the drawback is that it doesn't give us the POS tags that we need.

The following tags can be referenced [here.](http://pynlpir.readthedocs.io/en/latest/pos_map.html?highlight=mapping)

In [3]:
jieba_POS_lookup = set(["n","nr","ns","nt","nz","eng"])
pynlpir_POS_lookup = set(["noun","personal name","Chinese surname","Chinese given name","Japanese personal name",
                         "transcribed personal name","toponym","other proper noun"])

POS_MAP = {"n":"noun","nr":"personal name","nr1":"Chinese surname","nr2":"Chinese given name","nrf":"transcribed personal name",
           "nrj":"Japanese personal name","nz":"other proper noun","ns":"toponym"
          }

inv_POS_MAP = inv_map = {v: k for k, v in POS_MAP.items()}

### Init and compile regex

I define several regex patterns to determine if a given sequence of characters should be collapsed and considered as one word. While not 100% accurate, I came up with this logic by looking at the parse results for several sentences and examples. 

Examples of these patterns are:
- surname, proper noun ex: 蔡英文
- surname, noun, noun ex: 蔡博士
- surname, transcibed noun ex: 唐納川普
- given name, noun, noun ex: 小英總統

The results seem to hold up but ultimately, my Chinese is not native so I am limited in that regard. With more domain knowledge more patterns and filters can be added.

In [4]:
surname_proper_noun = r"\b(?P<index_1>\d{1,4})_(?P<pos_tag_1>nr1)(?:\s)(?P<index_2>\d{1,4})_(?P<pos_tag_2>nz|nr2)(?:\s)\b"

surname_noun_noun = r"\b(?P<index_1>\d{1,4})_(?P<pos_tag_1>nr1)(?:\s)(?P<index_2>\d{1,4})_(?P<pos_tag_2>n)(?:\s)(?P<index_3>\d{1,4})?_?(?P<pos_tag_3>n)?\b"

given_name_noun_noun = r"\b(?P<index_1>\d{1,4})_(?P<pos_tag_1>nr2)(?:\s)(?P<index_2>\d{1,4})_(?P<pos_tag_2>n)(?:\s)(?P<index_3>\d{1,4})?_?(?P<pos_tag_3>n)\b"

surname_transcribed_noun = r"\b(?P<index_1>\d{1,4})_(?P<pos_tag_1>nr1)(?:\s)(?P<index_2>\d{1,4})_(?P<pos_tag_2>n)(?:\s)(?P<index_3>\d{1,4})?_?(?P<pos_tag_3>nrf)\b"

cmp_surname_proper_noun = re.compile(surname_proper_noun)
cmp_surname_noun_noun = re.compile(surname_noun_noun)
cmp_given_name_noun_noun = re.compile(given_name_noun_noun)
cmp_surname_transcribed_noun = re.compile(surname_transcribed_noun)

### Correct segmentation and extract person names

The above patterns take precedence and are used to extract the first batch of names from the page and correct
segmentation issues, removing any token that was seen from the final result set to avoid duplication. The text is parsed and tagged with its given part of speech (POS), retaining all tagged as personal nouns.

Finally, the remaining words tagged as person nouns are returned along with the collapsed words.

In [5]:
def extract_nouns(page_text):
    """ Parse all the person names from the given page text.
    
    A few regex rules are defined to correct incorrect names created by segmentation.
    These rules are processed first, removing any token that was seen from the final
    result set to avoid duplication. 
    
    Finally, the remaining words tagged as person nouns are returned along with the collapsed
    words.
    
    Args:
        page_text (str): Wikipedia page text.
    
    Returns:
        person_names (list): List of person names parsed from the page.
    """
    
    # Here we initialize the API and filter the pos tags we want to keep
    pynlpir.open()
    pos_tags = pynlpir.segment(page_text, pos_names='child')
    pos_tags = [item for item in pos_tags if item[1] in pynlpir_POS_lookup]
    
    # In order to collapse words, we create a condensed string that we will check
    # against our regex patterns for segmentation correction
    condensed_pos_list = [str(idx) + "_" + inv_POS_MAP[item[1]] \
                         for idx, item in enumerate(pos_tags)]
    condensed_pos_str = " ".join(condensed_pos_list)
    pynlpir.close()
    
    surname_proper_noun_list = list(cmp_surname_proper_noun.finditer(condensed_pos_str))
    surname_noun_noun_list = list(cmp_surname_noun_noun.finditer(condensed_pos_str))
    given_name_noun_noun_list = list(cmp_given_name_noun_noun.finditer(condensed_pos_str))
    surname_transcribed_noun_list = list(cmp_surname_transcribed_noun.finditer(condensed_pos_str))
    
    # Merge all the lists for processing
    match_object_list = surname_proper_noun_list + surname_noun_noun_list \
                        + given_name_noun_noun_list + surname_transcribed_noun_list
    
    person_names, seen_indices = process_match_list(pos_tags, match_object_list)
    
    # We create a set out of the indices we have seen
    # Tokens positions appearing as part of multiple words are not considered
    # We also limit the results to characters above 1 in length
    seen_indices = set(seen_indices)
    for idx, item in enumerate(pos_tags):
        if idx not in seen_indices and len(item[0]) > 1:
            person_names.append(item[0])
    
    return person_names

def process_match_list(pos_tags, match_object_list):
    """ Process the stored match objects and extract
    the collapsed words at each index.
    
    Args:
        pos_tags (list): List of word:tag tuples.
        match_object_list (list): List of regex match objects.
    
    Returns
        collapsed_list(list), seen_indices(list)
    """
    
    collapsed_list = []
    seen_indices = []
    for item in match_object_list:
        result, seen = collapse_words(pos_tags,item)
        collapsed_list.append(result)
        seen_indices = seen_indices + seen
    return collapsed_list, seen_indices
    
def collapse_words(pos_list, match_object):
    """ Use the result of the regex patterns and 
    collapse the words into single units. This attempts
    to correct incorrect names created by segmentation.
    
    Args:
        pos_list (list): List of word:POS tuples.
        match_object: (regex match object)
    
    Returns:
        collapsed(str), seen_indices(list)
    """
    
    seen_indices = []
    collapsed = ""
    group_str = "index_"
    for idx in range(1,4):
        curr = group_str + str(idx)
        try:
            char_position = int(match_object.group(curr))
            character = pos_list[char_position][0]
            collapsed += character
            seen_indices.append(char_position)
        except:
            pass
    return collapsed, seen_indices

### The English double tap

While the above code can get us most of what we need for Chinese words, English is another issue. We therefore need to do a secondary pass to extract English tokens and get the entity that they represent. We need to use an English Named Entity Recognition library. I have experience with the NLP toolkit [spaCy](https://spacy.io/) so I'll use that. I keep the function modular in case we need to extract entities of a different type later on.

In [6]:
from collections import Counter
from pprint import pprint
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

def get_eng_tokens(page_text):
    """ Parse all the English tokens from the page text.
    
    Args:
        page_text (str): Wikipedia page text.
    
    Returns:
        eng_tokens (list): List of tokens tagged as English.
    """
    
    # topK refers to the number of tags to return, based on tf-idf
    # Set this to an arbitrarily high number to get all words
    tags = jieba.analyse.extract_tags(page_text, topK=50000)
    filtered_tags = [tag for tag in tags if not tag.isnumeric() and len(tag) >= 2]
    filtered_string = ", ".join(filtered_tags)
    
    token_tags = list(pseg.cut(filtered_string))
    return [token.word for token in token_tags if token.flag == "eng"]

def get_named_entities(page_text, entity="PERSON"):
    """ Filter English tokens by entity, defaults to PERSON.
    
    We want to retain all tokens that are tagged as a given entity.
    
    Args:
        page_text (str): Wikipedia page text.
    """
    
    entity_names = []
    eng_tokens = get_eng_tokens(page_text)
    
    for token in eng_tokens:
        doc = nlp(token)
        for ent in doc.ents:
            if ent.label_ == entity:
                entity_names.append(ent.text)
    
    return entity_names

### Init page url 

In [7]:
# Set wikipedia api language
lang = "zh"
wikipedia.set_lang(lang)

# Here I declare the title of the zh wikipedia page, formatted as https://zh.wikipedia.org/wiki/<page_title>
page_title = "唐納·川普"
page_text = wikipedia.page(page_title).content

# Strip away new lines
page_text = page_text.replace('\n', '')

## Entry Level Results

#### Static test

In [8]:
test_str = "蔡英文', '小英', '英文', '蔡博士', '蔡總統', '蔡女士', and '小英總統, Donald Trump, 唐納·川普"
pynlpir.open()
pos_tags = pynlpir.segment(test_str, pos_names='child')
pos_tags = [item for item in pos_tags if item[1] in pynlpir_POS_lookup]
pynlpir.close()
pos_tags[0:8]

[('蔡', 'Chinese surname'),
 ('英文', 'other proper noun'),
 ('小英', 'Chinese given name'),
 ('英文', 'other proper noun'),
 ('蔡', 'Chinese surname'),
 ('博士', 'noun'),
 ('蔡', 'Chinese surname'),
 ('總', 'noun')]

Using the default segmenter, words like 蔡英文 and 蔡總統 become split into their component characters. However, we can learn some rules that let us know how to correct these issues as seen in the results below.

In [9]:
print(extract_nouns(test_str))

['蔡英文', '蔡博士', '蔡總統', '蔡女士', '唐納', '小英總統', '唐納川普', '小英', '英文', 'Donald', 'Trump']


#### Extract English tokens first

In [10]:
eng_tokens = get_eng_tokens(page_text)
eng_token_set = set(eng_tokens)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.911 seconds.
Prefix dict has been built succesfully.


#### Get English names

In [11]:
english_names = get_named_entities(page_text, "PERSON")

#### Create a set of tokens that we don't want to keep

In [12]:
token_blacklist = eng_token_set.difference(set(english_names))

#### Entry Results

In [13]:
page_nouns = extract_nouns(page_text)
word_frequencies = Counter()
for word in page_nouns:
    if word not in token_blacklist:
        word_frequencies[word] += 1
        
pprint(word_frequencies.most_common())

part of speech not recognized: 'gjtgj'
part of speech not recognized: 'gms'
part of speech not recognized: 'gms'
part of speech not recognized: 'gms'
part of speech not recognized: 'gms'
part of speech not recognized: 'gms'
part of speech not recognized: 'gms'
part of speech not recognized: 'gms'


[('特朗普', 84),
 ('川普', 48),
 ('移民', 25),
 ('共和', 24),
 ('总统', 13),
 ('政策', 11),
 ('公司', 11),
 ('英文', 10),
 ('穆斯林', 10),
 ('世界', 9),
 ('小姐', 9),
 ('女性', 8),
 ('媒体', 7),
 ('命令', 7),
 ('行政', 6),
 ('对手', 6),
 ('弗雷德', 6),
 ('代表', 6),
 ('政治', 6),
 ('克林顿', 5),
 ('白人', 5),
 ('候选人', 5),
 ('法院', 5),
 ('名人', 5),
 ('利益', 5),
 ('家庭', 4),
 ('子女', 4),
 ('部分', 4),
 ('政府', 4),
 ('共和党', 4),
 ('公民', 4),
 ('柯林', 4),
 ('时报', 4),
 ('皇后', 4),
 ('选票', 3),
 ('Trump', 3),
 ('社交', 3),
 ('言论', 3),
 ('法案', 3),
 ('生意', 3),
 ('委员会', 3),
 ('节目', 3),
 ('同盟', 3),
 ('选民', 3),
 ('关系', 3),
 ('火星', 3),
 ('故事', 3),
 ('事件', 3),
 ('中国', 3),
 ('棒球', 3),
 ('工人', 3),
 ('盟友', 3),
 ('联邦', 3),
 ('希拉蕊', 3),
 ('公寓', 3),
 ('民主党', 3),
 ('詹姆斯·科米', 3),
 ('气候', 3),
 ('历史', 3),
 ('电视', 3),
 ('主席', 3),
 ('人物', 3),
 ('房地产', 2),
 ('婚姻', 2),
 ('选手', 2),
 ('梅拉尼娅', 2),
 ('总体', 2),
 ('父母', 2),
 ('手段', 2),
 ('公职', 2),
 ('方式', 2),
 ('全国', 2),
 ('核心', 2),
 ('人口', 2),
 ('色彩', 2),
 ('官员', 2),
 ('商业', 2),
 ('形象', 2),
 ('庄园', 2),
 ('孩子', 2),
 ('外界', 2),


In [14]:
print("Word count: " + str(len(word_frequencies)))

Word count: 491


#### One last filter

The initial results look messy, we can further refine the list since we don't need to do any more segmentation correction. Let's define the POS tags we want to keep at the end. Here we will try to remove all nouns that remain after trying to segment a word.

I reason that if a word is segmented and it does not have a POS tag we want to keep then it should be discarded.

In [15]:
final_POS_lookup = set(["personal name","Chinese surname","Chinese given name","Japanese personal name",
                         "transcribed personal name"])

In [16]:
def cleanup_tokens(original_frequencies, english_names):
    """ Restrict the words to those with a POS we want to retain
    and those that do not break into characters upon individual
    segmentation.
    
    Args:
        original_frequencies (Counter): Counter of words.
        english_names (set): Set of English names to retain.
        
    Return:
        word_frequencies (Counter): Filtered words.
    """
    
    pynlpir.open()
    
    # Create a copy to avoid altering the orignal Counter
    word_frequencies = copy.deepcopy(original_frequencies)
    token_set = set(word_frequencies)
    
    for token in token_set:
        # Keep all of these
        if token in english_names:
            pass
        
        # Lets test to see if the word breaks into smaller characters
        # Most Chinese names are either 2 or 1 characters in length.
        elif len(token) <= 3:
            pos_tag = pynlpir.segment(token, pos_names='child')
            
            # The token has been broken into smaller characters, check if it
            # should be discarded. Our patterns take the form of a Surname character
            # followed by some other noun character
            
            if len(pos_tag) > 1:
                if len(pos_tag) == 2 and pos_tag[0][1] == "Chinese surname" and pos_tag[1][1] in pynlpir_POS_lookup:
                    pass
                else:
                    del word_frequencies[token]
            
            # The token remains whole, check the POS
            elif pos_tag[0][1] not in final_POS_lookup:
                del word_frequencies[token]
        
        # Check for transcibed names
        elif len(token) > 3:
            pos_tag = pynlpir.segment(token, pos_names='child')
            if len(pos_tag) == 1 and pos_tag[0][1] not in final_POS_lookup:
                del word_frequencies[token]
    
    pynlpir.close()
    return word_frequencies

#### Revised Entry Results

In [17]:
revised_word_frequencies = cleanup_tokens(word_frequencies, english_names)
pprint(revised_word_frequencies.most_common())

part of speech not recognized: 'gms'


[('川普', 48),
 ('弗雷德', 6),
 ('克林顿', 5),
 ('柯林', 4),
 ('詹姆斯·科米', 3),
 ('Trump', 3),
 ('希拉蕊', 3),
 ('梅拉尼娅', 2),
 ('唐纳德·特朗普', 2),
 ('福坦莫', 2),
 ('康尼', 2),
 ('伊凡娜·特朗普', 2),
 ('科米', 2),
 ('习近平', 2),
 ('希拉里·克林顿', 2),
 ('布希', 2),
 ('川普商', 1),
 ('史密斯', 1),
 ('小丘', 1),
 ('最高法院', 1),
 ('高品', 1),
 ('里斯', 1),
 ('玛拉·梅普尔', 1),
 ('宣人身份', 1),
 ('於理性生命', 1),
 ('迪表', 1),
 ('弗雷德·特朗普', 1),
 ('後行程行政', 1),
 ('福布斯', 1),
 ('Elizabeth', 1),
 ('邱林', 1),
 ('特朗普因', 1),
 ('後TPP主', 1),
 ('後横财', 1),
 ('後事情沙', 1),
 ('弗林', 1),
 ('於Rascals台', 1),
 ('唐納川普', 1),
 ('Tony', 1),
 ('於总统川普', 1),
 ('後公民', 1),
 ('蔡英文', 1),
 ('冷石', 1),
 ('特朗普成', 1),
 ('韦德案', 1),
 ('冠夫', 1),
 ('於人梅拉尼娅·特朗普', 1),
 ('若中步政策', 1),
 ('唐納', 1),
 ('罗西·奥唐奈', 1),
 ('於穆斯林', 1),
 ('川普公', 1),
 ('克·彭斯', 1),
 ('谢尔盖', 1),
 ('霍士新', 1),
 ('後裁判奥斯汀', 1),
 ('布什', 1),
 ('於影片', 1),
 ('乌玛加', 1),
 ('杰西·文图拉', 1),
 ('埃里克', 1),
 ('於表印象', 1),
 ('後G7德', 1),
 ('Vince', 1),
 ('宣情世界', 1),
 ('夫球部分右派', 1),
 ('拉夫罗夫', 1),
 ('迪表川普', 1),
 ('後家', 1),
 ('川普村', 1),
 ('後德理默克尔', 1),
 ('日美核武

In [18]:
print("Word count: " + str(len(revised_word_frequencies)))

Word count: 137


## Advanced Level 

This seems like overkill for this task but I covered them extensively in my MSc thesis on Hate Speech so I'll suggest it anyway: Word Embeddings. Word Embeddings refer to the set of NLP techniques that are used to map objects (most often words or phrases) into dense vector representations.

They enable efficient computation of semantic similarities of words based on their distribution in the underlying language corpus. The core idea is based on the theory of Distributional Hypothesis which states that “words that appear in the same contexts share semantic meaning”. In the domain of Word Embeddings this means
that a word will share characteristics with the words that are typically its neighbours in a sentence.

The current state of the art method for learning Word Embeddings is [fasttext](https://github.com/facebookresearch/fastText) from Facebook. It offers significant improvements over [word2vec](https://en.wikipedia.org/wiki/Word2vec) as it is able to learn words at a character level.

So in this case, if we were to learn Embeddings from Chinese Wikipedia or even rely on a pretrained model then this would be a no brainer. Facebook has made our job easy and they provide pre-trained word embeddings for over 250+ languages [here](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md).

The models are quite large so it might not be appropriate to include them here, but I'll just add a small sample and show what would need to be done. I honestly can't remember where exactly I got the pretrained embedding that I include here but I believe it came from this [page](https://sites.google.com/site/rmyeid/projects/polyglot)

A non-fancy solution might be to simply use the Edit Distance between words, [Levenshtein Distance](https://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Distance.htm) for instance.

#### Import gensim and define helper functions

In [19]:
from gensim.models import KeyedVectors, Word2Vec

In [20]:
def load_embedding(filename, embedding_type):
    """ Load a fasttext or word2vec embedding
    Args:
        filename (str)
        embedding_type (str): kv:keyedVectors w2v:word2vec
    """
    if embedding_type == "kv":
        return KeyedVectors.load_word2vec_format(filename, binary=False, unicode_errors="ignore")

In [21]:
zh_embedding_model = load_embedding("wiki.zh_classical.vec", "kv")

#### Let's explore

We can start with the Chinese word English, 英文. From the results we can see that the most similar words are other languages, included Latin, Russian, and French.

I should note that the similarity is calculated on the cosine distance between the vector representation of 英文 and all other vector representations stored in the model. 

The model is quite small so it does not contain many examples for us to rely on. Learning such a model over Wikipedia data would allow us to infer the top synonyms.

In [22]:
word = "英文"
topn = 10

In [23]:
zh_embedding_model.similar_by_word(word, topn=10, restrict_vocab=None)

[('拉丁文', 0.8240991234779358),
 ('法文', 0.8182531595230103),
 ('英語', 0.8131850361824036),
 ('書面', 0.773542582988739),
 ('希臘文', 0.7686800956726074),
 ('俄語', 0.7637884616851807),
 ('教本', 0.7624318599700928),
 ('交易所', 0.7621933221817017),
 ('阿拉伯文', 0.7604696750640869),
 ('拉丁', 0.7577136754989624)]

A nice feature of spaCy is that it includes pretrained word vectors by default.

In [24]:
def spacy_top_k_similar(word, k):
    """ Returns the top k similar word vectors from a spacy embedding model.
    Args
    ----
        word (spacy.token): Gensim word embedding model.

        k (int): Number of results to return.
    """
    queries = [w for w in word.vocab if not
               (word.is_oov or word.is_punct or word.like_num or
                word.is_stop or word.lower_ == "rt")
               and w.has_vector and w.lower_ != word.lower_
               and w.is_lower == word.is_lower and w.prob >= -15]

    by_similarity = sorted(
        queries, key=lambda w: word.similarity(w), reverse=True)
    cosine_score = [word.similarity(w) for w in by_similarity]
    return by_similarity[:k], cosine_score[:k]

In [25]:
english = nlp("english")
similar_words, cosine_vals = spacy_top_k_similar(english[0], 10)

In [26]:
for item in similar_words:
    print(item.lower_)

french
translation
american
language
dictionary
grammar
translate
wikipedia
languages
torrent


## Challenge Level

I think that I sort of touched on this in the logic for the correcting the segmentation issues. For Chinese particularly, I think that this can be solved by outline the common patterns that are used to form words. As most Chinese full names are formed with with full characters, if a token that is a common Chinese last name preceeds some other characters tagged as nouns then it might indicate that those characters taken together should form a name.

While this by no means an exhaustive list of patterns, I present an example below.

### Chinese Version

#### Lets test a string and see what POS tags we get

In [27]:
test_str_2 = "蔡英文的英文很好"
pynlpir.open()
pos_tags = pynlpir.segment(test_str, pos_names='child')
pos_tags = [item for item in pos_tags if item[1] in pynlpir_POS_lookup]
pynlpir.close()
pos_tags

[('蔡', 'Chinese surname'),
 ('英文', 'other proper noun'),
 ('小英', 'Chinese given name'),
 ('英文', 'other proper noun'),
 ('蔡', 'Chinese surname'),
 ('博士', 'noun'),
 ('蔡', 'Chinese surname'),
 ('總', 'noun'),
 ('統', 'noun'),
 ('蔡', 'Chinese surname'),
 ('女士', 'noun'),
 ('小英', 'Chinese given name'),
 ('總', 'noun'),
 ('統', 'noun'),
 ('Donald', 'other proper noun'),
 ('Trump', 'noun'),
 ('唐', 'Chinese surname'),
 ('納', 'noun'),
 ('川普', 'transcribed personal name')]

#### Extract the nouns

In [28]:
test_nouns = extract_nouns(test_str_2)
test_nouns

['蔡英文', '英文']

#### Cleanup the tokens to and keep the names

In [29]:
test_frequencies = Counter()
for word in test_nouns:
    if word not in token_blacklist:
        test_frequencies[word] += 1

In [30]:
revised_test_frequencies = cleanup_tokens(test_frequencies, english_names)
revised_test_frequencies.most_common()

[('蔡英文', 1)]

### English version

I'm much more familiar with how to do this in English, owing to both the process ready availability of models and my experience with spaCy. The first thing we need to do is collapse the noun phrases, this would give us "Mr Smith" as a single token instead of "Mr", "Smith".

Next, we just need to iterate entity property of the object and return entities with a label matching PERSON or whatever we want to do.

In [31]:
# Init doc and collapse phrases
doc_1 = nlp("Mr Smith is a smith")
doc_2 = nlp("Mr Smith is a smith")
for _np in list(doc_1.noun_chunks):
    _np.merge(_np.root.tag_, _np.root.lemma_, _np.root.ent_type_)

#### Lets check the tags for both

We see that doc_1 has collapsed the words to one token

In [32]:
for token in doc_1:
    print(token.tag_)

NNP
VBZ
NN


In [33]:
for token in doc_2:
    print(token.tag_)

NNP
NNP
VBZ
DT
NN


#### Extract entities, Voila

In [34]:
for ent in doc_1.ents:
    if ent.label_ == "PERSON":
        print(ent.text)

Mr Smith


In [35]:
for ent in doc_2.ents:
    if ent.label_ == "PERSON":
        print(ent.text)

Smith
