TEFE - TimeBankPT Event Frame Extraction

[![Github](https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/FORMAS/TEFE)

[![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge&logo=docker&logoColor=white)](https://hub.docker.com/r/andersonsacramento/tefe)



# DESCRIPTION
TEFE is a closed domain event extractor system for sentences in the Portuguese language. It extracts events from sentences, which means that it does event detection (i.e., event trigger identification and classification), and argument role prediction (i.e., argument identification and role classification). The event types are based on the typology of the FrameNet project (BAKER; FILLMORE; LOWE, 1998). The models were trained on an enriched TimeBankPT (COSTA; BRANCO,2012) corpus.


Currently, in this Colab, 5 different trained models are available to execution: 0, 100, 0\_0, 100\_0, 100\_100, which respectively correspond to: 514 event types (ET) and 1936 argument roles (AR), 7 ET and 93 AR, 214 ET and 477 AR, 5 ET and 42 AR, and 5 ET and 12 AR.

## How to cite this work

Peer-reviewed accepted paper:


* Sacramento, A., Souza, M.: Joint Event Extraction with Contextualized Word Embeddings for the Portuguese Language. In: 10th Brazilian Conference on Intelligent System, BRACIS, São Paulo, Brazil, from November 29 to December 3, 2021.


# Download and locate BERTimbau Base model and TEFE model files

In [1]:
!pip install gdown



In [2]:
!gdown --id 1lEhJK2gpD8ep7N3KPtFbNPzOB4gQXZX6 --output tefe.zip
!unzip tefe.zip

Downloading...
From: https://drive.google.com/uc?id=1lEhJK2gpD8ep7N3KPtFbNPzOB4gQXZX6
To: /content/tefe.zip
190MB [00:01, 152MB/s]
Archive:  tefe.zip
   creating: models/
  inflating: models/blstmea_0.h5     
  inflating: models/blstme_100_100.h5  
  inflating: models/blstme_100.h5    
  inflating: models/blstmeat2_100_0.h5  
  inflating: models/blstme_0_0.h5    
   creating: res/
  inflating: res/args_by_pos_types_12.json  
  inflating: res/args_by_pos_types_477.json  
  inflating: res/args_by_pos_types_42.json  
  inflating: res/events_by_pos_types_7.json  
  inflating: res/events_by_pos_types_514.json  
  inflating: res/args_by_pos_types_1936.json  
  inflating: res/args_by_pos_types_93.json  
  inflating: res/events_by_pos_types_5.json  
  inflating: res/events_by_pos_types_214.json  


In [3]:
!gdown --id 1qIR2GKpBqB-sOmX0Q5j1EQ6NSugYMCsX --output bertimbau.zip

Downloading...
From: https://drive.google.com/uc?id=1qIR2GKpBqB-sOmX0Q5j1EQ6NSugYMCsX
To: /content/bertimbau.zip
1.21GB [00:07, 158MB/s]


In [4]:
!mv bertimbau.zip models/
!unzip models/bertimbau.zip -d models/
!rm models/bertimbau.zip

Archive:  models/bertimbau.zip
  inflating: models/BERTimbau/bert_model.ckpt.index  
  inflating: models/BERTimbau/bert_config.json  
  inflating: models/BERTimbau/vocab.txt  
  inflating: models/BERTimbau/bert_model.ckpt.meta  
  inflating: models/BERTimbau/bert_model.ckpt.data-00000-of-00001  


# Load TEFE code

## install dependencies

In [5]:
!pip install tensorflow>=2.6.0
!pip install keras-bert>=0.88
!pip install numpy



## load functions

In [6]:
import sys
import os
import numpy as np
import re
import json
import glob

from keras_bert import load_vocabulary, load_trained_model_from_checkpoint, Tokenizer, get_checkpoint_paths
import tensorflow as tf
from tensorflow.keras.models import load_model

BERTIMBAU_MODEL_PATH = 'models/BERTimbau/'
EMBEDDING_ID = 'last_hidden'


RUN_CONFIGS = {
        '0':       {'model':        'models/blstmea_0.h5',
                    'events-types': 'res/events_by_pos_types_514.json',
                    'args-types':   'res/args_by_pos_types_1936.json'},
        '100':     {'model':        'models/blstme_100.h5',
                    'events-types': 'res/events_by_pos_types_7.json',
                    'args-types':   'res/args_by_pos_types_93.json'},
        '0-0':     {'model':        'models/blstme_0_0.h5',
                    'events-types': 'res/events_by_pos_types_214.json',
                    'args-types':   'res/args_by_pos_types_477.json'},
        '100-0':   {'model':        'models/blstmeat2_100_0.h5',
                    'events-types': 'res/events_by_pos_types_5.json',
                    'args-types':   'res/args_by_pos_types_42.json'},
        '100-100': {'model':        'models/blstme_100_100.h5',
                    'events-types': 'res/events_by_pos_types_5.json',
                    'args-types':   'res/args_by_pos_types_12.json'}}

DEFAULT_RUN_CONFIG = '100'

def tokenize_and_compose(text):
        tokens = tokenizer.tokenize(text)
        text_tokens = []
        for i, token in enumerate(tokens):
            split_token = token.split("##")
            if len(split_token) > 1:
                token = split_token[1]
                text_tokens[-1] += token
            else:
                text_tokens.append(token)
        if len(text_tokens[1:-1]) == 1:
          return text_tokens[1]
        else:
          return text_tokens[1:-1]


def compose_token_embeddings(sentence, tokenized_text, embeddings):
        tokens_indices_composed = [0] * len(tokenized_text)
        j = -1
        for i, x in enumerate(tokenized_text):
            if x.find('##') == -1:
                j += 1
            tokens_indices_composed[i] = j
        word_embeddings = [0] * len(set(tokens_indices_composed))
        j = 0
        for i, embedding in enumerate(embeddings):
            if j == tokens_indices_composed[i]:
                word_embeddings[j] = embedding
                j += 1
            else:
                word_embeddings[j - 1] += embedding
        return word_embeddings

def extract(text, options={'sum_all_12':True}, seq_len=512, output_layer_num=12):
        features = {k:v for (k,v) in options.items() if v}
        tokens = tokenizer.tokenize(text)
        indices, segments = tokenizer.encode(first = text, max_len = seq_len)
        predicts = model_bert.predict([np.array([indices]), np.array([segments])])[0]
        predicts = predicts[1:len(tokens)-1,:].reshape((len(tokens)-2, output_layer_num, 768))

        for (k,v) in features.items():
            if k == 'sum_all_12':
                features[k] = compose_token_embeddings(text, tokens[1:-1], predicts.sum(axis=1))
            if k == 'sum_last_4':
                features[k] = compose_token_embeddings(text, tokens[1:-1], predicts[:,-4:,:].sum(axis=1))
            if k == 'concat_last_4':
                features[k] = compose_token_embeddings(text, tokens[1:-1], predicts[:,-4:,:].reshape((len(tokens)-2,768*4)))
            if k == 'last_hidden':
                features[k] = compose_token_embeddings(text, tokens[1:-1], predicts[:,-1:,:].reshape((len(tokens)-2, 768)))
        return features



def get_sentence_original_tokens(sentence, tokens):
        token_index = 0
        started = False
        sentence_pos_tokens = []
        i = 0
        while i < len(sentence):
                if sentence[i] != ' ' and not started:
                        start = i
                        started = True
                if sentence[i] == tokens[token_index] and started:
                        sentence_pos_tokens.append(sentence[i])
                        started = False
                        token_index += 1
                elif i<len(sentence) and (sentence[i] == ' ' or tokenize_and_compose(sentence[start:i+1]) == tokens[token_index] ) and started:
                        sentence_pos_tokens.append(sentence[start:i+1])
                        start = i+1
                        started = False
                        token_index += 1
                i += 1
        return sentence_pos_tokens


def get_text_location(text, arg, start_search_at=0):
        text = text.lower()
        arg = arg.lower()
        pattern = re.compile(r'\b%s\b' % arg)
        match = pattern.search(text, start_search_at)
        if match:
                return (match.start(), match.end())
        else:
                return (-1, -1)


def get_args_from_labels(label_args, is_arp=True):
        args = []
        cur_arg = []
        started_arg = False
        fn_normalize_label = lambda cur_label : cur_label[-1] if cur_label[-1] <= len(args_types) else cur_label[-1] - len(args_types)
        for i,label in enumerate(label_args):
                if not started_arg and label != 0 and label <= len(args_types):
                        cur_arg.append((i, label if is_arp else 1))
                        started_arg = True
                elif started_arg and label != 0 and label > len(args_types):
                        last_label = fn_normalize_label(cur_arg[-1])
                        if label-len(args_types) != last_label and is_arp:
                                cur_arg = []
                                started_arg = False
                        else:
                                cur_arg.append((i,label if is_arp else 1))
                elif started_arg and label == 0:
                        args.append(tuple(cur_arg))
                        cur_arg = []
                        started_arg = False
                elif started_arg and  label <= len(args_types):
                        args.append(tuple(cur_arg))
                        cur_arg = []
                        cur_arg.append((i, label if is_arp else 1))
                        started_arg = True
        if cur_arg:
                args.append(tuple(cur_arg))
        return args


def extract_events(text, feature_option, is_pprint=True):
        text_tokens = get_sentence_original_tokens(text, tokenize_and_compose(text))
        features = extract(text, {feature_option:True})[feature_option]
        embeddings = np.array(features).reshape((len(text_tokens), 768))
        sentence_embeddings = np.zeros((1,128,768))
        sentence_embeddings[0,:len(text_tokens)] = embeddings
        predictions = [model.predict([e.reshape((1, 768)), sentence_embeddings]) for e in embeddings]
        positions = list(filter((lambda i: i>= 0 and i < len(text_tokens)), [pos for (pos, (pred_ed, pred_args)) in enumerate(predictions) if np.argmax(pred_ed) != 0]))
        output = []
        if len(positions) > 0:
                start_at = sum([len(token) for token in text_tokens[:positions[0]]])
        for pos in positions:
                loc_start, loc_end = get_text_location(text, text_tokens[pos], start_at)
                start_at = loc_end
                args_preds =  [np.argmax(predictions[pos][1][0,i,:]) for i in range(predictions[pos][1].shape[1]) if i < len(text_tokens)]
                start_arg_search = 0
                args_event = []
                event_type = events_types[str(np.argmax(predictions[pos][0]))]
                for arg_tokens in get_args_from_labels(args_preds):
                        first_arg_token = arg_tokens[0]
                        last_arg_token = arg_tokens[-1]
                        try:
                          pattern = re.compile(r'\b%s\b' % '\s*'.join([text_tokens[arg_token[0]] for arg_token in arg_tokens]))
                        except:
                          if is_pprint:
                            return json.dumps(output, indent=4)
                          return output
                        match = pattern.search(text, start_arg_search)
                        if match:
                                arg_type = args_types[str(first_arg_token[1])]
                                if str(arg_type['id']) in event_type['args']:
                                        args_event.append({'role':arg_type['name'],
                                                           'text': text[match.start():match.end()],
                                                           'start': match.start(),
                                                           'end': match.end()
                                                           })
                           
                                start_arg_search = match.end()
                output.append({'trigger':{
                        'text': text[loc_start:loc_end],
                        'start': loc_start,
                        'end' : loc_end},
                               'arguments':args_event,
                               'event_type': event_type['name']
                               })
        if is_pprint:
          return json.dumps(output, indent=4)
        return output



def load_bertimbau_model():    
        global tokenizer
        global model_bert
        
        paths = get_checkpoint_paths(BERTIMBAU_MODEL_PATH)

        model_bert = load_trained_model_from_checkpoint(paths.config, paths.checkpoint, seq_len=512, output_layer_num=12)

        token_dict = load_vocabulary(paths.vocab)
        tokenizer = Tokenizer(token_dict)

def load_tefe_model():
        global model
        global events_types
        global args_types

        events_types, args_types = load_events_args_info()
        model = load_model(RUN_CONFIGS[model_config]['model'])
        return model

def load_events_args_info():
        events_types, args_types = {}, {}

        with open(RUN_CONFIGS[model_config]['events-types'], 'r') as read_content:        
                events_types = json.load(read_content)
                
        with open(RUN_CONFIGS[model_config]['args-types'], 'r') as read_content:        
                args_types = json.load(read_content)                

        return events_types, args_types



def extract_from_files(input_path, output_path):
        for filepathname in glob.glob(f'{input_path}*.txt'):
                extractions = []
                for line in open(filepathname):
                        line = line.strip()
                        print(line)
                        extractions.append(extract_events(line, EMBEDDING_ID))

                filename = filepathname.split('.txt')[0].split(os.sep)[-1]
                with open(f'{output_path}{filename}.json', 'w')  as outfile:
                        json.dump(extractions, outfile)
                print(f'{filename}')


def extract_events_from(input_path, output_path):
        run_extraction_context(lambda : extract_from_files(input_path, output_path))
        

def extract_events_from_sentence(sentence):
        sentence = sentence.strip()
        run_extraction_context(lambda : print(extract_events(sentence, EMBEDDING_ID)))
        

def run_extraction_context(run_extraction_func):                        
        if len(tf.config.list_physical_devices('GPU')) > 0:
                with tf.device('/GPU:0'):
                        load_bertimbau_model()
                        load_tefe_model()
                        run_extraction_func()
        else:
                with tf.device('/cpu:0'):
                        load_bertimbau_model()
                        load_tefe_model()
                        run_extraction_func()
                        


# RUN

## Extract Events From Sentence

In [7]:
#@title Input the sentence and select the model

sentence = 'A Petrobras aumentou o preço da gasolina para 2,30 reais, disse o presidente.' #@param {type:"string"}
model_config = '100' #@param ["0", "100", "0-0", "100-0", "100-100"]


print(sentence)
print(model_config)
extract_events_from_sentence(sentence)

A Petrobras aumentou o preço da gasolina para 2,30 reais, disse o presidente.
100
[
    {
        "trigger": {
            "text": "aumentou",
            "start": 12,
            "end": 20
        },
        "arguments": [
            {
                "role": "Cause_change_of_position_on_a_scale#Agent",
                "text": "A Petrobras",
                "start": 0,
                "end": 11
            },
            {
                "role": "Cause_change_of_position_on_a_scale#Attribute",
                "text": "o pre\u00e7o da",
                "start": 21,
                "end": 31
            },
            {
                "role": "Cause_change_of_position_on_a_scale#Item",
                "text": "gasolina",
                "start": 32,
                "end": 40
            },
            {
                "role": "Cause_change_of_position_on_a_scale#Value_2",
                "text": "2,30 reais",
                "start": 46,
                "end": 56
            }
     

## Extract Events From Directory

In [None]:
# If you want to be able to process files from your drive folders 

from google.colab import drive
drive.mount('/content/drive')

In [None]:
#@title ## Input and Output directory fields

#@markdown The text files in the input directory are expected to have the format:

#@markdown * all text files end with the extension .txt
#@markdown * sentences are separated by newlines


#@markdown ---
#@markdown ### Enter the directories paths:
input_dir = "/content/drive/MyDrive/input-files/" #@param {type:"string"}
output_dir = "/content/drive/MyDrive/output-files/" #@param {type:"string"}
model_config = '100' #@param ["0", "100", "0-0", "100-0", "100-100"]
#@markdown ---

extract_events_from(input_dir, output_dir)