In [3]:
import nltk
import pandas
import unicodedata

## Read and transform data

Read data and replace utf-8 specific symbols from the text. From that, select some number of sentences.

In [4]:
df = pandas.read_csv('news.csv')

In [5]:
sentences = unicodedata.normalize("NFKD", " ".join(df['text'][0:12]) ) # 0:12 - exactly 500 sentences

## Sentences

Let's check, if the sentences selection fulfill the requirement of the task, that it contains at least 500 sentences.

In [6]:
sentences_tokens = nltk.sent_tokenize(sentences)

print("Length:"+str(len(sentences_tokens)))

Length:500


## Import from tutorial
The following code is from tutorial 3.

In [14]:
# ExctractEntities    
def extract_entities(ne_chunked):
    data = {}
    for entity in ne_chunked:
        if isinstance(entity, nltk.tree.Tree):
            text = " ".join([word for word, tag in entity.leaves()])
            ent = entity.label()
            data[text] = ent
        else:
            continue
    return data

## POS TAG
Also known as Part-of-Speech tagging is a process of marking words to corresponding parts of speech in a given sentence. 

In [15]:
tokens = nltk.word_tokenize(sentences)
tagged = nltk.pos_tag(tokens)
tagged

[('Canadian', 'JJ'),
 ('pharmacies', 'NNS'),
 ('are', 'VBP'),
 ('limiting', 'VBG'),
 ('how', 'WRB'),
 ('much', 'JJ'),
 ('medication', 'NN'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('dispensed', 'VBN'),
 ('to', 'TO'),
 ('try', 'VB'),
 ('to', 'TO'),
 ('prevent', 'VB'),
 ('shortages', 'NNS'),
 (',', ','),
 ('recognizing', 'VBG'),
 ('that', 'IN'),
 ('most', 'JJS'),
 ('active', 'JJ'),
 ('ingredients', 'NNS'),
 ('for', 'IN'),
 ('drugs', 'NNS'),
 ('come', 'VBP'),
 ('from', 'IN'),
 ('India', 'NNP'),
 ('and', 'CC'),
 ('China', 'NNP'),
 ('and', 'CC'),
 ('medical', 'JJ'),
 ('supply', 'NN'),
 ('chains', 'NNS'),
 ('have', 'VBP'),
 ('been', 'VBN'),
 ('disrupted', 'VBN'),
 ('by', 'IN'),
 ('the', 'DT'),
 ('spread', 'NN'),
 ('of', 'IN'),
 ('COVID-19', 'NNP'),
 ('.', '.'),
 ('Provincial', 'NNP'),
 ('regulatory', 'JJ'),
 ('colleges', 'NNS'),
 ('are', 'VBP'),
 ('complying', 'VBG'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('Canadian', 'NNP'),
 ('Pharmacists', 'NNP'),
 ('Association', 'NNP'),
 ('call', 'NN'),
 ('to', 'T

## Named Entity Recognition
Sentences in different languagues follow some kind of a structure (or grammar), where each part of a speech has its  position in the sentence. Also, sentences contain something called sentence elements. For example, those are subject, verb or object and more. From that we can determine what is a sentence about. **Named entity** is a subject or an object, that can be denoted with a proper name. The marking process finds patterns in the sentence. Technically, the sentences is a tree-like structure and the marking process finds subtrees that follows the chosen pattern.


**The following text is only about differences in Czech an English, nothing important for this task. I was just thinking how the marking of sentence elements differs from English to Czech.**

English has a very strict sentence structure in comparison to Czech language. Thus, RegEx expressions for marking sentence elements in Czech language would be more of a challenge, because position of words in Czech sentence can be at different positions and still have the same meaning. 

Howerever, the position of the word is sometimes crutial for the meaning of the sentence. Also, sentence elements are not ambiguous and object could be easily mistaken for the subject, for example. From that, it is really quite challenging to understand Czech language, even for natives.

For example:

**"Manželství vám zachrání kampaň."**<br>
Can be translated into English as:<br>
**"A marriage will be saved by a campaign."**<br>
or:<br>
**"A marriage will save by a campaign."**<br>
Both sentences has totally different meaning in English, but in Czech we do not know, until we get it from context.

In [17]:
ne_chunked = nltk.ne_chunk(tagged, binary=True)
name_entities_orig = extract_entities(ne_chunked)
name_entities_orig

{'Canadian': 'NE',
 'India': 'NE',
 'China': 'NE',
 'Canadian Pharmacists Association': 'NE',
 'Mina Tadrous': 'NE',
 'Toronto': 'NE',
 'U.S.': 'NE',
 'Tadrous': 'NE',
 'Canada': 'NE',
 'Queen': 'NE',
 'Kingston': 'NE',
 'North America': 'NE',
 'Kas': 'NE',
 'Duffin': 'NE',
 'refillsNew Brunswick': 'NE',
 'Europe': 'NE',
 'Yukon': 'NE',
 'Whitehorse': 'NE',
 'Bethany Church': 'NE',
 'Alaska Highway': 'NE',
 'Elias Dental': 'NE',
 'Paul': 'NE',
 'Senate': 'NE',
 'Washington': 'NE',
 'Republican': 'NE',
 'Senate Majority Leader Mitch': 'NE',
 'McConnell': 'NE',
 'Democratic': 'NE',
 'Senate Minority Leader Chuck Schumer': 'NE',
 'American': 'NE',
 'WATCH': 'NE',
 'Louisiana': 'NE',
 'Treasury': 'NE',
 'Steven Mnuchin': 'NE',
 'Donald Trump': 'NE',
 'New York': 'NE',
 'New York Gov': 'NE',
 'Andrew Cuomo': 'NE',
 'Washington Democrats': 'NE',
 'Congress': 'NE',
 'Democratic House': 'NE',
 'House': 'NE',
 'House Democratic': 'NE',
 'Trump': 'NE',
 'Capitol Hill': 'NE',
 'Tom': 'NE',
 'Demo

### Custom NER

We can define the followed pattern by a regular expression. Each part of speech has its code. For example, NN is a noun, singular or mass, NNS is its plural form, VB is a verb in base form and so on. The full list is described in this link: https://cs.nyu.edu/grishman/jet/guide/PennPOS.html

I have tried different regular expressions and I stood with the one below. Just from the observation I decided that it was the most precise one.

My regular expression is:<br>
&lt;DT&gt;? - Optional determined<br>
&lt;JJ&gt;* - optional multiple adjective<br>
&lt;NNP|NNOS&gt;* - proper noun in singular, mass or plural form

In other words the named entity is a structure of a proper noun introduced or not introduced by a determined. It can have zero to unlimited number of adjectives.

The regular expression would differ for a specific use case. For example, the restrictions would have been more strict if false positives could do much more damage than true negatives.

In [18]:
grammar = "NE: {<DT>?<JJ>*<NNP|NNPS>}"
cp = nltk.RegexpParser(grammar)
custom_ne_chunked = cp.parse(tagged)
name_entities_custom = extract_entities(custom_ne_chunked)
name_entities_custom

{'India': 'NE',
 'China': 'NE',
 'COVID-19': 'NE',
 'Provincial': 'NE',
 'the Canadian': 'NE',
 'Pharmacists': 'NE',
 'Association': 'NE',
 'Mina': 'NE',
 'Tadrous': 'NE',
 'Toronto': 'NE',
 'worried Canadians': 'NE',
 'the U.S.': 'NE',
 'Canada': 'NE',
 'Ongoing': 'NE',
 'Dr.': 'NE',
 'Jacalyn': 'NE',
 'Duffin': 'NE',
 'Queen': 'NE',
 'University': 'NE',
 'Kingston': 'NE',
 'Ont.': 'NE',
 'North': 'NE',
 'America': 'NE',
 'Kas': 'NE',
 'Roussy/CBC': 'NE',
 'Brunswick': 'NE',
 'outbreak India': 'NE',
 'Europe': 'NE',
 'Whitehorse': 'NE',
 'Thursday': 'NE',
 '—': 'NE',
 'Sunday': 'NE',
 'Bethany': 'NE',
 'Church': 'NE',
 'the Alaska': 'NE',
 'Highway': 'NE',
 'March': 'NE',
 'Kids': 'NE',
 'Zone': 'NE',
 'Elias': 'NE',
 'Dental': 'NE',
 'the Yukon': 'NE',
 'Mar': 'NE',
 'Paul': 'NE',
 'Tukker/CBC': 'NE',
 'Wednesday': 'NE',
 'Yukon': 'NE',
 'Health': 'NE',
 'The Senate': 'NE',
 'late Wednesday': 'NE',
 'US': 'NE',
 'Washington': 'NE',
 'U.S.': 'NE',
 'Republican Senate': 'NE',
 'Majorit

# Entity Classification
Each entity is classified from the first sentence of its page on Wikipedia. I defined another Regular Expression that finds noun phrase in this very first sentence.

I looked up for the proper regular expression of the noun phrase on the Internet, but I found many examples. Because of that, I decided that the regular expression for the noun phrase also differs for different purposes or use cases. I created my own, that returns optional nubmer of determiners, optional number of adjectives (now even with superlative and comparative adjectives) and at least one noun to unlimited number of nouns.

The number of classified entities is limited to 20 for each list of named entities.

In [20]:
def extract_noun_phrase(ne_chunked):
    for entity in ne_chunked:
        if isinstance(entity, nltk.tree.Tree):
            text = " ".join([word for word, tag in entity.leaves()])
            return text
        else:
            continue
    return "Thing"
    
import wikipedia

In [21]:
count = 0;
limit = 20;

for name in name_entities_orig:
    try:
        page = wikipedia.page(name)
        sentence = nltk.sent_tokenize(page.summary)[0] 
        grammar = "NP: <VB.?>{<DT>*<JJ.*>*<NN.*>+}"
        cp = nltk.RegexpParser(grammar)
        tokens = nltk.word_tokenize(sentence)
        tagged = nltk.pos_tag(tokens)
        custom_ne_chunked = cp.parse(tagged)
        print(name + " ==> "+ extract_noun_phrase(custom_ne_chunked))
        count=count+1
        if count >= limit: break
    except:
        print(name + " ==> Thing")

Canadian ==> people
India ==> a country
China ==> a country
Canadian Pharmacists Association ==> Thing
Mina Tadrous ==> Thing
Toronto ==> the provincial capital
U.S. ==> a country
Tadrous ==> Thing
Canada ==> a country
Queen ==> Thing
Kingston ==> Thing
North America ==> a continent
Kas ==> the brand name
Duffin ==> a surname
refillsNew Brunswick ==> an informal name
Europe ==> a continent
Yukon ==> the Yukon
Whitehorse ==> the capital
Bethany Church ==> Μαρία
Alaska Highway ==> the contiguous United States
Elias Dental ==> the inferior dental nerve
Paul ==> Thing
Senate ==> a deliberative assembly
Washington ==> Thing
Republican ==> Thing
Senate Majority Leader Mitch ==> an American politician
McConnell ==> an American politician


In [22]:
count = 0;
limit = 20;

for name in name_entities_custom:
    try:
        page = wikipedia.page(name)
        sentence = nltk.sent_tokenize(page.summary)[0] 
        grammar = "NP: <VB.?>{<DT>*<JJ.*>*<NN.*>+}"
        cp = nltk.RegexpParser(grammar)
        tokens = nltk.word_tokenize(sentence)
        tagged = nltk.pos_tag(tokens)
        custom_ne_chunked = cp.parse(tagged)
        print(name + " ==> "+ extract_noun_phrase(custom_ne_chunked))
        count=count+1
        if count >= limit: break
    except:
        print(name + " ==> Thing")

India ==> a country
China ==> a country
COVID-19 ==> Thing
Provincial ==> Thing
the Canadian ==> a transcontinental passenger train
Pharmacists ==> health professionals
Association ==> Thing
Mina ==> Thing
Tadrous ==> Thing
Toronto ==> the provincial capital
worried Canadians ==> Thing
the U.S. ==> a country
Canada ==> a country
Ongoing ==> Thing
Dr. ==> an American rapper
Jacalyn ==> a Canadian medical historian
Duffin ==> a surname
Queen ==> Thing
University ==> an institution
Kingston ==> Thing
Ont. ==> Thing
North ==> Thing
America ==> a country
Kas ==> the brand name
Roussy/CBC ==> achievements
Brunswick ==> Thing
outbreak India ==> Thing
Europe ==> a continent
Whitehorse ==> the capital
Thursday ==> the day


## Issues during the implementation
- I was not sure about the Regular Expressions for ntlk parser. I tested different expressions and left the ones, which gave the most retrieval.
- As I have already noted before, without the use case, determining the proper regular expressions was up to my preferences.

## Future work
- better entity classification that supports at least the first paragraph on Wikipedia
- data cleansing using stemming and lemmatization to reduce number of found entities (I did not have to deal with this, because my dataset contains a lot of sentences from various number of topics. Thus, named entities do not repeat that much in the selection of 500 sentences)