Generate new examples based on this dataset: 
https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus

This notebook takes the ner dataset from the previous link, and creates templates (utterances with placeholders) for a PII synthetic data generator to use in order to create new sentences.
Note that due to the nature of the tagging, there might be weird output sentences. For example:

- The same entity shows multiple times in sentence: "I travel from Argentina to Argentina"
- Bad grammer due to the lack of inflection and changes to nouns due to context: "*The statement said no Denmark or India-led troops were killed*" instead of "*The statement said no Danish or Indian led troops were killed*"
- Unrealistic sentences due to change in entities: "Prime minister Lebron James enters the government building in Kuala Lumpur"


The notebook additionally introduces two new entities: TITLE and ROLE, in order to overcome cases like "UK David Scott called his wife", where the original sentence is "UK Prime Minister Boris Johnson called his wife" as "Prime Minister" was originally tagged as PER in the original dataset. Same logic goes for titles, like Mr., Mrs., Ms.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#First, Download ner.csv from https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus
ner_dataset = pd.read_csv("ner.csv",encoding = "ISO-8859-1", error_bad_lines=False)

b'Skipping line 281837: expected 25 fields, saw 34\n'


In [3]:
ner_dataset.columns

Index(['Unnamed: 0', 'lemma', 'next-lemma', 'next-next-lemma', 'next-next-pos',
       'next-next-shape', 'next-next-word', 'next-pos', 'next-shape',
       'next-word', 'pos', 'prev-iob', 'prev-lemma', 'prev-pos',
       'prev-prev-iob', 'prev-prev-lemma', 'prev-prev-pos', 'prev-prev-shape',
       'prev-prev-word', 'prev-shape', 'prev-word', 'sentence_idx', 'shape',
       'word', 'tag'],
      dtype='object')

In [4]:
len(ner_dataset)

1050795

In [5]:
ner_dataset = ner_dataset.drop_duplicates()
len(ner_dataset)

768960

Example sentence:

In [6]:
ner_dataset[ner_dataset['sentence_idx']==13]

Unnamed: 0.1,Unnamed: 0,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,...,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
267,267,iran,'s,new,JJ,lowercase,new,POS,other,'s,...,__start2__,__START2__,wildcard,__START2__,wildcard,__START1__,13.0,capitalized,Iran,B-gpe
268,268,'s,new,presid,NNP,capitalized,President,JJ,lowercase,new,...,__start1__,__START1__,wildcard,__START1__,capitalized,Iran,13.0,other,'s,O
269,269,new,presid,mahmoud,NNP,capitalized,Mahmoud,NNP,capitalized,President,...,iran,NNP,capitalized,Iran,other,'s,13.0,lowercase,new,O
270,270,presid,mahmoud,ahmadinejad,NNP,capitalized,Ahmadinejad,NNP,capitalized,Mahmoud,...,'s,POS,other,'s,lowercase,new,13.0,capitalized,President,B-per
271,271,mahmoud,ahmadinejad,said,VBD,lowercase,said,NNP,capitalized,Ahmadinejad,...,new,JJ,lowercase,new,capitalized,President,13.0,capitalized,Mahmoud,I-per
272,272,ahmadinejad,said,tuesday,NNP,capitalized,Tuesday,VBD,lowercase,said,...,presid,NNP,capitalized,President,capitalized,Mahmoud,13.0,capitalized,Ahmadinejad,I-per
273,273,said,tuesday,that,IN,lowercase,that,NNP,capitalized,Tuesday,...,mahmoud,NNP,capitalized,Mahmoud,capitalized,Ahmadinejad,13.0,lowercase,said,O
274,274,tuesday,that,european,JJ,capitalized,European,IN,lowercase,that,...,ahmadinejad,NNP,capitalized,Ahmadinejad,lowercase,said,13.0,capitalized,Tuesday,B-tim
275,275,that,european,incent,NNS,lowercase,incentives,JJ,capitalized,European,...,said,VBD,lowercase,said,capitalized,Tuesday,13.0,lowercase,that,O
276,276,european,incent,aim,VBN,lowercase,aimed,NNS,lowercase,incentives,...,tuesday,NNP,capitalized,Tuesday,lowercase,that,13.0,capitalized,European,B-gpe


### New entities - Title and Role

- **Title**: Mr., Mrs., Professor, Doctor, ...
- **Role**: President, Secretary General, U.N. Secretary, ...

Quick exploratory analysis of frequencies:
- First PER token
- Second PER token
- First and second PER token
- One before and first tokens of PER

In [7]:
# Evaluate words before I-per
bper = ner_dataset[ner_dataset['tag']=='B-per']
bper_tokens = bper['word']
prev_bper_token = bper['prev-word']
next_bper_token = bper['next-word']
two_prev_tokens = zip(prev_bper_token, bper_tokens)
two_next_tokens = zip(bper_tokens, next_bper_token)

In [8]:
from collections import Counter
print("20 most common PER token frequencies:")
Counter(bper_tokens).most_common(20)

20 most common PER token frequencies:


[('Mr.', 2261),
 ('President', 1750),
 ('Prime', 567),
 ('Ms.', 137),
 ('Minister', 135),
 ('John', 116),
 ('General', 103),
 ('Saddam', 94),
 ('Senator', 84),
 ('Secretary', 74),
 ('Obama', 63),
 ('Condoleezza', 59),
 ('Kofi', 58),
 ('King', 56),
 ('Mahmoud', 54),
 ('Sunni', 53),
 ('Bush', 51),
 ('Ali', 47),
 ('Osama', 45),
 ('Vice', 44)]

In [9]:
print("20 most common previous and first PER token frequencies:")
Counter(two_prev_tokens).most_common(20)

20 most common previous and first PER token frequencies:


[(('__START1__', 'Mr.'), 1056),
 (('__START1__', 'President'), 307),
 ((',', 'Mr.'), 293),
 (('__START1__', 'Ms.'), 73),
 (('of', 'President'), 71),
 ((',', 'President'), 71),
 (('Venezuelan', 'President'), 70),
 (('U.S.', 'President'), 65),
 (('Israeli', 'Prime'), 57),
 (('State', 'Condoleezza'), 57),
 (('Russian', 'President'), 55),
 (('that', 'Mr.'), 55),
 (('Secretary-General', 'Kofi'), 55),
 (('of', 'Mr.'), 54),
 (('Foreign', 'Minister'), 54),
 (('Palestinian', 'President'), 53),
 (("'s", 'President'), 52),
 (('said', 'Mr.'), 50),
 (('with', 'President'), 48),
 (('former', 'Prime'), 48)]

In [10]:
print("20 most common first and second PER token frequencies:")
Counter(two_next_tokens).most_common(20)

20 most common first and second PER token frequencies:


[(('Prime', 'Minister'), 563),
 (('President', 'Bush'), 354),
 (('Mr.', 'Bush'), 233),
 (('President', 'Mahmoud'), 122),
 (('Mr.', 'Chavez'), 107),
 (('President', 'Hugo'), 95),
 (('Mr.', 'Abbas'), 67),
 (('President', 'Hamid'), 66),
 (('Saddam', 'Hussein'), 62),
 (('Mr.', 'Sharon'), 60),
 (('Condoleezza', 'Rice'), 59),
 (('Kofi', 'Annan'), 58),
 (('President', 'Pervez'), 55),
 (('President', 'Vladimir'), 52),
 (('Mr.', 'Yushchenko'), 48),
 (('Mr.', 'Obama'), 44),
 (('Mr.', 'Annan'), 44),
 (('President', 'Barack'), 43),
 (('Osama', 'bin'), 42),
 (('Mr.', 'Ahmadinejad'), 41)]

In [11]:
# Lists of titles and roles to update as ttl, rol
TITLES = ['Mr.','Ms.','Mrs.']
ROLES = ['President','General','Senator','Secretary-General','Minister','General']
BIGRAMS_ROLES = [('Prime','Minister'),('prime','minister'),('U.S.','President'),
                 ('Venezuelan', 'President'),('Vice','President'), ('Foreign', 'Minister'),
                 ('U.S.','Secretary'),('U.N.','Secretary'),('Defence','Secretary')]


In [12]:
# Update title and per for most common cases

def fix_bigram_title(df, row,index,first='Prime',second='Minister',tag='ttl'):
    if row['word'] == first and row['next-word'] == second and 'per' in row['tag']:
        df.loc[index,'tag'] = 'B-{}'.format(tag)
    elif row['word'] == second and row['prev-word'] == first and 'per' in row['tag']:
        df.loc[index,'tag'] = 'I-{}'.format(tag)
    elif row['tag']== 'I-per' and row['prev-word'] == second and 'per' in row['tag']:
        df.loc[index,'tag'] = 'B-per'

def fix_unigram_title(df, prev_row,prev_index, row , index, title='President',tag='ttl'):
    #print(row)
    if prev_row['word'] == title and prev_row['tag'] == 'B-per' and row['tag']=='I-per':
        df.loc[prev_index,'tag']='B-{}'.format(tag)
        df.loc[index,'tag'] = 'B-per'

prev_row = None
prev_index = None
for index, row in ner_dataset.iterrows():
    # Handle 'Prime Minister'
    for bigram in BIGRAMS_ROLES:
        fix_bigram_title(ner_dataset,row,index,bigram[0],bigram[1],'rol')

    if prev_row is not None:
        for title in TITLES:
            fix_unigram_title(df=ner_dataset,prev_row=prev_row,prev_index=prev_index,row=row,index=index,title=title,tag='ttl')
        for role in ROLES:
            fix_unigram_title(ner_dataset,prev_row,prev_index,row,index,role,'rol')

    prev_row = row
    prev_index = index

In [13]:
ner_dataset[ner_dataset['sentence_idx']==13][['sentence_idx','word','tag','prev-word','prev-prev-word','next-word']]

Unnamed: 0,sentence_idx,word,tag,prev-word,prev-prev-word,next-word
267,13.0,Iran,B-gpe,__START1__,__START2__,'s
268,13.0,'s,O,Iran,__START1__,new
269,13.0,new,O,'s,Iran,President
270,13.0,President,B-rol,new,'s,Mahmoud
271,13.0,Mahmoud,B-per,President,new,Ahmadinejad
272,13.0,Ahmadinejad,I-per,Mahmoud,President,said
273,13.0,said,O,Ahmadinejad,Mahmoud,Tuesday
274,13.0,Tuesday,B-tim,said,Ahmadinejad,that
275,13.0,that,O,Tuesday,said,European
276,13.0,European,B-gpe,that,Tuesday,incentives


In [14]:
# keep only relevant columns
dataset = ner_dataset[['sentence_idx','word','tag']]

In [15]:
dataset.to_csv("../../../datasets/ner_with_titles.csv")

### Create templates base on NER dataset

In [16]:
import re
class SentenceGetter(object):
    
    def __init__(self, dataset):
        self.n_sent = 1
        self.dataset = dataset
        self.empty = False
        agg_func = lambda s: [(w, t) for w,t in zip(s["word"].values.tolist(),
                                                        s["tag"].values.tolist())]
        self.grouped = self.dataset.groupby("sentence_idx").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None
    
    @staticmethod    
    def get_template(grouped,entity_name_replace_dict=None):
        TAGS_TO_IGNORE = ['nat','eve','art','tim']
        template = ""
        i=0
        cur_index = 0
        ents = []
        for token in grouped:
            token_text = token[0].replace("[", "").replace("]","")
            token_tag = token[1]
            if token_tag == 'O':
                template += " " + token_text
            elif 'B-' in token_tag and token_tag[2:] not in TAGS_TO_IGNORE:
                if entity_name_replace_dict:
                    ent = entity_name_replace_dict[token[1][2:]]
                else:
                    ent = token_tag[2:]
                ents.append(ent)
                template += " [" + ent + "]"
        template = re.sub(r'\s([?,\':.!"](?:|$))+', r'\1', template)
        
        for ent in ents:
            weird = "[{}] [{}]".format(ent,ent)
            template = template.replace(weird,"[{}]".format(ent))
        
        #remove additional weird combinations:
        
        to_replace = {
            "[COUNTRY] [ROLE] [PERSON]": "[ROLE] [PERSON]",
            "[COUNTRY] [ROLE]" : "[ROLE]",
            "[ORGANIZATION] [ROLE] [PERSON]" : "[ORGANIZATION]'s [ROLE] [PERSON]",
            "[COUNTRY] [LOCATION]" : "[LOCATION]",
            "[LOCATION] [COUNTRY]": "[LOCATION]",
            "[PERSON] [COUNTRY]" : "[PERSON]",
            "[PERSON] [LOCATION]" : "[PERSON]",
            "[COUNTRY] [PERSON]" : "[PERSON]",
            "[LOCATION] [PERSON]" : "[PERSON]",
            "The [ORGANIZATION]" : "[ORGANIZATION]",
            "[PERSON] [ORGANIZATION]" : "[PERSON]",
            "of [ORGANIZATION] [PERSON]" : "of [ORGANIZATION], [PERSON]",
            "[ORGANIZATION] [PERSON]" : "[PERSON]",
            "[PERSON] [PERSON]": "[PERSON]",
            "[LOCATION] says" : "[PERSON] says",
            "[LOCATION] said" : "[PERSON] said"
            
            
        }
        
        for weird in to_replace.keys():
            template = template.replace(weird,to_replace[weird])
        
        return template.strip()
    
getter = SentenceGetter(dataset)

In [17]:
ENTITIES_DICTIONARY = {"per":"PERSON","gpe":"COUNTRY","geo":"LOCATION","org":"ORGANIZATION",'ttl':'TITLE','rol':'ROLE'}

sentences = getter.sentences
print("original:",sentences[12])
print("template:", getter.get_template(sentences[12],entity_name_replace_dict=ENTITIES_DICTIONARY))

original: [('Iran', 'B-gpe'), ("'s", 'O'), ('new', 'O'), ('President', 'B-rol'), ('Mahmoud', 'B-per'), ('Ahmadinejad', 'I-per'), ('said', 'O'), ('Tuesday', 'B-tim'), ('that', 'O'), ('European', 'B-gpe'), ('incentives', 'O'), ('aimed', 'O'), ('at', 'O'), ('persuading', 'O'), ('Iran', 'B-gpe'), ('to', 'O'), ('end', 'O'), ('its', 'O'), ('nuclear', 'O'), ('fuel', 'O'), ('program', 'O'), ('are', 'O'), ('an', 'O'), ('insult', 'O'), ('to', 'O'), ('the', 'O'), ('Iranian', 'B-gpe'), ('nation', 'O'), ('.', 'O')]
template: [COUNTRY]'s new [ROLE] [PERSON] said that [COUNTRY] incentives aimed at persuading [COUNTRY] to end its nuclear fuel program are an insult to the [COUNTRY] nation.


In [18]:
dataset.columns

Index(['sentence_idx', 'word', 'tag'], dtype='object')

In [19]:
new_templates = [SentenceGetter.get_template(sentence, ENTITIES_DICTIONARY) for sentence in sentences]
new_templates[:5]

['Thousands of demonstrators have marched through [LOCATION] to protest the war in [LOCATION] and demand the withdrawal of [COUNTRY] troops from that country.',
 'Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as" [PERSON] Number One Terrorist" and" Stop the Bombings."',
 'They marched from the Houses of Parliament to a rally in [LOCATION].',
 'Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000.',
 "The protest comes on the eve of the annual conference of [LOCATION]'s ruling [ORGANIZATION] in the southern [COUNTRY] seaside resort of [LOCATION]."]

In [20]:
# save to file

with open("../raw_data/kaggle_based_templates.txt","w+", encoding='utf-8') as f:
    for template in new_templates:
        f.write("%s\n" % template)
        

In [21]:
np.unique(dataset['tag'].values.astype('str'))

array(['B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per',
       'B-rol', 'B-tim', 'B-ttl', 'I-art', 'I-eve', 'I-geo', 'I-gpe',
       'I-nat', 'I-org', 'I-per', 'I-rol', 'I-tim', 'O', 'nan'],
      dtype='<U5')

In [22]:
dataset[(dataset['tag'] == 'B-rol')]['word']

270        President
443        President
1861         Foreign
3003           Prime
3009       President
             ...    
1049991      Senator
1050029    President
1050349    President
1050354    President
1050449     Minister
Name: word, Length: 2632, dtype: object