In [1]:
import os
from os import listdir
from os.path import isfile, join
import numpy as np
import pandas as pd
import spacy
import json
import nltk
from nltk.tokenize import word_tokenize
from cleantext import clean
nltk.download('punkt')
import re
import transformers
from transformers import AutoTokenizer
from collections import Counter
import random

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/romainbourgeois/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# This notebook differ from the other one in that it merges product-services with market (somewhat similar and sometime ambiguous entities)


The goal of this notebook id to pre-process the data to a target json format. The sequences must be split between training and testing sets. Sequences must be presented as lists of strings (tokens), each sequence must be associated with its own label.
We are modelling 6 target entities: firm, product/service, amount of money, person, market and location. This model allows for multi-words entities, labels will differentiate between the first word and the remaining words of a multi-word entity. Tokens that do not correspond to an entity will be assigned to a label. In total, 13 labels are targets for this NER model. 

The text preprocessing tasks are the following. First the texts are split around the "\n" tag. Lists are further seperated in sentences using nltk's sentence tokenizer method. These operations were computed so that our algorithm could process our data as sequences of sentences. Also, only the sentences incorporating a verb were kept. 
In addition, '?' '@' and '®' are removed from the dataset. This was important because some entities were followed by these characters. Although one could argue that question marks are very informative, there are extremely rare in financial documents. If we were to keep them, the algorithm would infer the entity type to be linked to the presence of the question mark and is hence not worth the trade off. Some entities were also fully put into uppercase and are hence put back to title format (keeping the first letter only). There is however a little trade-off to the extent to which we can remove upper-case letters except the first one: acronyms. 
English language is also constituted of "contractions" such as 'won't' instead of 'will not'. The major ones were removed in the off chance that it would make training faster. Unicode errors, urls, emails and phone numbers were also removed and replaced by some tags. In fact, the actual content of these informations do not matter for entity predictions and tag replacement will probably enable the algorithm to focus on more important information. Furthermore, I hesitated a lot to lowercase the text. I decided not to because it can be very informative when distinguishing a market from a Name for example. An argument against is that uppercase do not always occurr when it should and the opposite is also true. For example, some "firm" are displayed in complete uppercase. I should have written an appropriate function to fix this but unfortunately I did not.

Other preprocessing tasks could have been used. Unfortunately, I thought about it too late in the project. The first one would have been to replace the acronyms. Articles often define acronyms to longer nominal structures and use this symbols for the rest of the article. This coreference resolution problem could be tackled by writting a function that identifies and replaces these acronyms. Another way to tackle this problem is probably more interesting: add acronym definition to the list of relations and link the relations entities via graph knowledge inferences. It shall be noted that a new acronym entity shall be added to the list as well. Finally, another issue that has not been dealt with is the one where entities are stacked next to each other without punctiation. One solution to deal with this data mining issue would be to delete or seperate symmetric strings even though that could lead to errors with existing symmetric words. One could check if the candidate word actually exists in nltk dictionary for example.

The rest of the notebook formats the data the desired output.


In [2]:
def contractions(phrase):
    phrase = re.sub(r"won\'t", "will not", phrase) # 's could mean possession
    phrase = re.sub(r"won't", "will not", phrase)  
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"can't", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"n't", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"'re", " are", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    phrase = re.sub(r"'m", " am", phrase)
    phrase = re.sub(r"wont", "will not", phrase)
    phrase = re.sub(r"dont", "do not", phrase)
    phrase = re.sub(r"werent", "were not", phrase)
    phrase = re.sub(r"'m", " am", phrase)

    return phrase

In [3]:
def cleanfunc(t):
    return clean(t,
    fix_unicode=True,               # fix various unicode errors
    to_ascii=True,                  # transliterate to closest ASCII representation
    lower=False,  #if YES lowercase targets            # lowercase text
    no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
    no_urls=True,                  # replace all URLs with a special token
    no_emails=True,                # replace all email addresses with a special token
    no_phone_numbers=True,         # replace all phone numbers with a special token
    no_numbers=False,               # replace all numbers with a special token
    no_digits=False,                # replace all digits with a special token
    no_currency_symbols=False,      # replace all currency symbols with a special token
    no_punct=False,                 # remove punctuations
    replace_with_punct="",          # instead of removing punctuations you may replace them
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    lang="en"                       
)

In [4]:
def rm_uppercase(token,n):
    token=str(token)
    if len(token)>=n:
        if token.isupper()==True:
            token_=token.lower()
            token_=token_.title()
            l=[]
            l.append(token)
            l.append(token_)
            return l

In [5]:
def preprocessText(data):
    for i in data:
        for j in i['annotation']:
            j['text']=j['text'].replace("?","")
            j['text']=j['text'].replace("@","")
            j['text']=j['text'].replace("®","")
            j['text']=contractions(j['text'])
            j['text']=cleanfunc(j['text'])
            if type(rm_uppercase(j['text'],0))==list:
                rm=rm_uppercase(j['text'],0)
                j['text']=j['text'].replace(str(rm[0]),str(rm[1]))
    return data

In [6]:
def tokenize_sents(text):
    split=text.split('\n')
    sents_list=[]
    for ss in split:
        sents=nltk.sent_tokenize(ss)
        for s_ in sents:
            s_=contractions(s_)
            s_=cleanfunc(s_)
            n=nlp(s_)
            d=False
            uppercases=[]
            for token in n:
                if type(rm_uppercase(token,0))==list: 
                    uppercases.append(rm_uppercase(token,0))
                if token.pos_=='VERB':
                    d=True
            if d==True:
                s_=s_.replace('@','')
                s_=s_.replace("®","")
                if len(uppercases)>0:
                    for i in range(len(uppercases)):
                        s_=s_.replace(str(uppercases[i][0]),str(uppercases[i][1]))
                sents_list.append(s_.replace('?','')) 

    return sents_list

In [7]:
def comp_data(data_,label_dict):
    data=[]
    for i in range(len(data_)):
        sents=tokenize_sents(data_[i]['document'])
        for j in sents:
            tokens=[]
            labels=[]
            jj=word_tokenize(j)
            jjj=0
            while jjj<len(jj):
                idx_label={}
                for l in range(len(data_[i]['annotation'])):
                    if word_tokenize(data_[i]['annotation'][l]['text'])[0]==jj[jjj]:
                        idx_label[str(l)]=len(word_tokenize(data_[i]['annotation'][l]['text']))
                sorted_=sorted(idx_label.items(), key=lambda item: item[1],reverse=True)
                global_decision=False
                for s in sorted_:
                    decision=False
                    for ss in range(0,int(float(s[1]))):
                        if s[1]>len(jj)-jjj:
                            break
                        elif word_tokenize(data_[i]['annotation'][int(float(s[0]))]['text'])[ss]==jj[jjj+ss]:    
                            decision=True
                        else:
                            decision=False
                    if decision==True:
                        global_decision=True
                        tokens.append(jj[jjj])
                        labels.append("B-"+label_dict[data_[i]['annotation'][int(float(s[0]))]['label']])
                        for s_ in range(1,s[1]):
                            tokens.append(jj[jjj+s_])
                            labels.append("I-"+label_dict[data_[i]['annotation'][int(float(s[0]))]['label']])
                        jjj=jjj+s[1]
                    else:
                        continue
                if global_decision==False:
                    tokens.append(jj[jjj])
                    labels.append('O')
                    jjj=jjj+1
            data.append([tokens,labels])
    return data

In [8]:
def traintestsplit(datalabels,label_list):
    idx = {x:i for i,x in enumerate(label_list)} 
    tr=random.sample(range(0, len(datalabels)), round(0.85*len(datalabels)))
    tr.sort()
    tt=[]
    for i in range(len(datalabels)):
        if i not in tr:
            tt.append(i)  
    dataset_train=[]
    j=0
    for i in tr:
        d={}
        d['index']=j
        d['tokens']=datalabels[i][0]
        d['label']=[idx[x] for x in datalabels[i][1] if x in idx]
        dataset_train.append(d)
        j=j+1

    dataset_test=[]
    j=0
    for i in tt:
        d={}
        d['index']=j
        d['tokens']=datalabels[i][0]
        d['label']=[idx[x] for x in datalabels[i][1] if x in idx]
        dataset_test.append(d)
        j=j+1

    dtr={}
    dtr['data']=dataset_train
    dtt={}
    dtt['data']=dataset_test

    return dtr, dtt

In [10]:
f = open('NERdata.json')
data = json.load(f)
nlp = spacy.load("en_core_web_trf")
data=preprocessText(data)
label_dict={'FIRM':"ORG",'PRODUCT-SERVICE':'PSM','AMOUNT':'AMNT','PERSON':'PSN', 'MARKET':'PSM','LOCATION':'LOC'}
label_list=['O', 'B-ORG','I-ORG','B-PSM','I-PSM','B-AMNT','I-AMNT','B-PSN','I-PSN','B-LOC','I-LOC']
datalabels=comp_data(data,label_dict)
dtr, dtt=traintestsplit(datalabels,label_list)

In [11]:
countlabels=[]
for i in datalabels:
    for j in i[1]:
        countlabels.append(j)

pd.DataFrame(list(Counter(countlabels).values()), index=list(Counter(countlabels).keys()))

Unnamed: 0,0
B-ORG,901
O,25140
B-PSM,912
I-PSM,873
B-LOC,139
B-PSN,111
I-PSN,123
I-ORG,410
B-AMNT,47
I-AMNT,102


In [12]:
with open("traindataNER_merged.json", "w") as final:
   json.dump(dtr, final)

with open("testdataNER_merged.json", "w") as final:
   json.dump(dtt, final)