# AUTOMATIC KEYWORD EXTRACTOR - Training experiments
*Germán García García - gggsman@gmail.com*

_________________________________

MSC in Artificial Intelligence

Final Master´s project: Applying Deep Learning Techniques to Terminology Extraction in Specific Domains

Universidad Politecnica de Madrid


## Manual

**Description:**

The following Jupyter Notebook contains the cells of code needed to replicate the experiments presented in the document of the Final Master's Project, the main document can be found in the https://github.com/3Gsman/DeepTerminologyExtraction repository. It is an automatic keyword extractor system for training a model able to pick keywords from given texts.

**Abstract:**

Automatic terminology extraction or automatic keyphrase extraction is a very useful subfield of natural language processing when it comes to synthesizing information from texts in concise terms. In this master's thesis, this problem is approached with a sequence labeling approach using supervised deep learning techniques in specific domains, using bidirectional LTSM (Long short-term memory) neural networks and contextual word embeddings. A statistical significance study has been carried out to verify that the results presented in this work are significant. The final result in the F1 score in the Inspec dataset is 0.5730 slightly better than the higher result of the state of the art and it offers less dispersed results. Additionally, in this work the variation of samples in the training set is analyzed, a program to convert the datasets to a sequence labeling format needed for the final system is provided and there is an available sample program to test the keyphrase extractor in texts given by the user.

**Usage:**

To use this notebook you need to have a compatible dataset. Within this notebook, in the [github repository](https://github.com/3Gsman/DeepTerminologyExtraction) of this work, a folder named *datasets* is available, with all the datasets used during the experimentarion. Other option is to format your own dataset, in the same github repository can be found a script able to convert from 20 available keyword extraction datasets to the format needed in this notebook, please follow the instructions found in the folder *format_dataset*. The last option is to create or adapt your own dataset, the format needed is similar to CoNLL-03 but with only 3 tags: *B-KEY* for the word that begins the keyword, *I-KEY* for the subsequent words of the keyword, and *O* if that word it is not a keyword. If you have any question, plese contact with the author, Germán García in gggsman@gmail.com.

Once you have the dataset downloaded, upload it to google colab, in the left part of the interface a folder shaped button should be clicked, click and drag the dataset to the *Files* section and wait until is uploaded.

## Editable variables and hyperparameters:

In [None]:
# Set download_model to True if you want to download the model at the end of the execution
download_model = False

In [None]:
# Edit the hyperparameters if you want to change the traning behaviour
class hyperparam:
  embedding = 'Bert'
  embedding_path = ''
  dataset_base_path = ''
  dataset = 'dataset'
  output_base_path = 'result/'
  iteration = ''
  gpu = 1
  lr = 0.05
  anneal_factor = 0.5
  patience = 4
  batch_size = 4
  num_epochs = 150
  threads = 12
  param_selection_mode = False
  use_tensorboard = False
  no_dev = False
  use_crf = True
  rnn_layers = 3
  hidden_size = 128
  dropout = 0.3
  word_dropout = 0.05
  locked_dropout = 0.5
  not_in_memory = False

## System startup

This section unzips the dataset and download the needed libraries to make this notebook work, also it imports the libraries

In [None]:
!unzip dataset.zip

In [None]:
!pip install flair
!pip install pytorch_transformers

In [None]:
import sys
from typing import List
import argparse
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, CharacterEmbeddings, BertEmbeddings, TransformerXLEmbeddings, ELMoTransformerEmbeddings, ELMoEmbeddings,OpenAIGPTEmbeddings, RoBERTaEmbeddings,XLMEmbeddings, XLNetEmbeddings, OpenAIGPT2Embeddings
from flair.datasets import DataLoader
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
import flair.datasets
from flair.visual.training_curves import Plotter
from torch.utils.data import DataLoader
from flair.models import SequenceTagger
from google.colab import files
from flair.trainers import ModelTrainer
import torch.optim as optim
from pytorch_transformers import BertTokenizer
from flair.models import SequenceTagger

## Training

In [None]:
def bs(tokenizer,x,l,r,max_seq_len):
    if r>=l:
        mid = int(l + (r - l)/2)
        res=verifymid(tokenizer,x,mid,max_seq_len)
        if res==3:
            return mid
        elif res==2:
            return bs(tokenizer,x,mid+1,r,max_seq_len)
        else:
            return bs(tokenizer,x,l,mid-1,max_seq_len)
            
    else:
        print("wrong binary search")
        sys.exit()


def verifymid(tokenizer,x,mid,max_seq_len):
    limit=mid
    lw=x.to_tokenized_string().split(" ")
    lw=lw[:limit]
    sent=" ".join(lw)
    tokenized_text = tokenizer.tokenize(sent)
    if len(tokenized_text)>max_seq_len:
        return 1
    else:
        if verifymid_1(tokenizer,x,mid+1,max_seq_len)==True:
            return 2
        return 3
        
        
def verifymid_1(tokenizer,x,mid,max_seq_len):
    limit=mid
    lw=x.to_tokenized_string().split(" ")
    lw=lw[:limit]
    sent=" ".join(lw)
    tokenized_text = tokenizer.tokenize(sent)
    if len(tokenized_text)>max_seq_len:
        return False
    else:
        return True

In [None]:
def train(data_path, list_embedding, output, hyperparameter ):

    # define columns
    columns = {0: 'text', 1: 'ner'}

    if hyperparam.no_dev==True:
        corpus: Corpus = ColumnCorpus(data_path, 
                                      columns, 
                                      train_file='train.txt',
                                      test_file='test.txt',
                                      in_memory=not hyperparam.not_in_memory
                                     )
        
    else:
        corpus: Corpus = ColumnCorpus(data_path,
                                      columns,
                                      train_file='train.txt',
                                      test_file='test.txt',
                                      dev_file='dev.txt',
                                      in_memory=not hyperparam.not_in_memory
                                      )


    # 2. what tag do we want to predict?
    tag_type = 'ner'


    # 3. make the tag dictionary from the corpus
    tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
    print(tag_dictionary.idx2item)
    
    stats=corpus.obtain_statistics()
    print("Original\n",stats)
    

    if hyperparam.embedding=="Bert":

        print("Tokenizer",hyperparam.embedding)
        if hyperparam.embedding_path!="":
            tokenizer = BertTokenizer.from_pretrained(hyperparam.embedding_path)
        else:
            tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
        
        max_seq_len=500   
        print("taking max seq len as ",max_seq_len)
           

        new_train=[]
        for x in corpus.train:

            tokenized_text = tokenizer.tokenize(x.to_plain_string())

            if len(tokenized_text)<=max_seq_len:
                new_train.append(x)
            
            else:          
                limit=bs(tokenizer,x,1,max_seq_len,max_seq_len)
                lw=x.to_tokenized_string().split(" ")
                lw=lw[:limit]
                sent=" ".join(lw)
                tokenized_text = tokenizer.tokenize(sent)
                
                if len(tokenized_text)>max_seq_len:
                    print("wrong binary search 1")
                    sys.exit()

                new_sent=Sentence(sent)
                for index in range(len(new_sent)):
                    try:
                        new_sent[index].add_tag('ner', x[index].get_tag('ner').value)
                    except:
                        pass

                new_train.append(new_sent)

        new_test=[]
        for x in corpus.test:

            tokenized_text = tokenizer.tokenize(x.to_plain_string())

            if len(tokenized_text)<=max_seq_len:
                new_test.append(x)
            
            else:           
                limit=bs(tokenizer,x,1,max_seq_len,max_seq_len)
                lw=x.to_tokenized_string().split(" ")
                lw=lw[:limit]
                sent=" ".join(lw)
                tokenized_text = tokenizer.tokenize(sent)

                if len(tokenized_text)>max_seq_len:
                    print("wrong binary search 1")
                    sys.exit()

                new_sent=Sentence(sent)
                for index in range(len(new_sent)):
                    try:
                        new_sent[index].add_tag('ner', x[index].get_tag('ner').value)
                    except:
                        pass

                new_test.append(new_sent)

        new_dev=[]
        for x in corpus.dev:

            tokenized_text = tokenizer.tokenize(x.to_plain_string())

            if len(tokenized_text)<=max_seq_len:
                new_dev.append(x)
            
            else:           
                limit=bs(tokenizer,x,1,max_seq_len,max_seq_len)
                lw=x.to_tokenized_string().split(" ")
                lw=lw[:limit]
                sent=" ".join(lw)
                tokenized_text = tokenizer.tokenize(sent)

                if len(tokenized_text)>max_seq_len:
                    print("wrong binary search 1")
                    sys.exit()

                new_sent=Sentence(sent)
                for index in range(len(new_sent)):
                    try:
                        new_sent[index].add_tag('ner', x[index].get_tag('ner').value)
                    except:
                        pass

                new_dev.append(new_sent)   

        corpus._train=new_train
        corpus._test=new_test
        corpus._dev=new_dev
        stats=corpus.obtain_statistics()
        print("Modified",stats)  


    elif hyperparam.embedding=="RoBERTa":

        print("Tokenizer",hyperparam.embedding)

        from pytorch_transformers import BertTokenizer
        print("Using Bert tokenizer bert-base-uncased")
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        
        max_seq_len=500   
        print("taking max seq len as ",max_seq_len)
           
        new_train=[]
        for x in corpus.train:

            tokenized_text = tokenizer.tokenize(x.to_plain_string())

            if len(tokenized_text)<=max_seq_len:
                new_train.append(x)
            
            else:            
                limit=bs(tokenizer,x,1,max_seq_len,max_seq_len)
                lw=x.to_tokenized_string().split(" ")
                lw=lw[:limit]
                sent=" ".join(lw)
                tokenized_text = tokenizer.tokenize(sent)

                if len(tokenized_text)>max_seq_len:
                    print("wrong binary search 1")
                    sys.exit()

                new_sent=Sentence(sent)
                for index in range(len(new_sent)):
                    try:
                        new_sent[index].add_tag('ner', x[index].get_tag('ner').value)
                    except:
                        pass

                new_train.append(new_sent)

        new_test=[]
        for x in corpus.test:
            tokenized_text = tokenizer.tokenize(x.to_plain_string())
            if len(tokenized_text)<=max_seq_len:
                new_test.append(x)
            
            else:            
                limit=bs(tokenizer,x,1,max_seq_len,max_seq_len)
                lw=x.to_tokenized_string().split(" ")
                lw=lw[:limit]
                sent=" ".join(lw)
                tokenized_text = tokenizer.tokenize(sent)

                if len(tokenized_text)>max_seq_len:
                    print("wrong binary search 1")
                    sys.exit()

                new_sent=Sentence(sent)
                for index in range(len(new_sent)):
                    try:
                        new_sent[index].add_tag('ner', x[index].get_tag('ner').value)
                    except:
                        pass

                new_test.append(new_sent)

        new_dev=[]
        for x in corpus.dev:

            tokenized_text = tokenizer.tokenize(x.to_plain_string())
            
            if len(tokenized_text)<=max_seq_len:
                new_dev.append(x)
            
            else:            
                limit=bs(tokenizer,x,1,max_seq_len,max_seq_len)
                lw=x.to_tokenized_string().split(" ")
                lw=lw[:limit]
                sent=" ".join(lw)
                tokenized_text = tokenizer.tokenize(sent)

                if len(tokenized_text)>max_seq_len:
                    print("wrong binary search 1")
                    sys.exit()

                new_sent=Sentence(sent)
                for index in range(len(new_sent)):
                    new_sent[index].add_tag('ner', x[index].get_tag('ner').value)

                new_dev.append(new_sent)             
        
        corpus._train=new_train
        corpus._test=new_test
        corpus._dev=new_dev
        stats=corpus.obtain_statistics()
        print("Modified",stats)  


    # 4. initialize embeddings
    embedding_types: List[TokenEmbeddings] = list_embedding

    embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

    # 5. initialize sequence tagger

    tagger: SequenceTagger = SequenceTagger(hidden_size=hyperparam.hidden_size,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=hyperparam.use_crf,
                                        rnn_layers=hyperparam.rnn_layers,
                                        dropout=hyperparam.dropout, word_dropout=hyperparam.word_dropout, locked_dropout=hyperparam.locked_dropout
                                        )

    # 6. initialize trainer

    trainer: ModelTrainer = ModelTrainer(tagger, 
                                        corpus,
                                        use_tensorboard=hyperparam.use_tensorboard,
                                        optimizer=optim.SGD
                                        ) 

    # 7. start training
    trainer.train(output,
              learning_rate=hyperparam.lr,
              mini_batch_size=hyperparam.batch_size,
              anneal_factor=hyperparam.anneal_factor,
              patience=hyperparam.patience,
              max_epochs=hyperparam.num_epochs,
              param_selection_mode=hyperparam.param_selection_mode,
              num_workers=hyperparam.threads,
              
              )
    return trainer
    

In [None]:
if hyperparam.embedding=='Bert':
    if hyperparam.embedding_path!="":
        embedding=BertEmbeddings(hyperparam.embedding_path)
    else:
        embedding=BertEmbeddings()

if hyperparam.embedding=='ELMo':
    if hyperparam.embedding_path!="":
        embedding=ELMoEmbeddings(hyperparam.embedding_path)
    else:
        embedding=ELMoEmbeddings()

if hyperparam.embedding=='RoBERTa':
    if hyperparam.embedding_path!="":
        embedding=RoBERTaEmbeddings(hyperparam.embedding_path)
    else:
        embedding=RoBERTaEmbeddings()

output=hyperparam.output_base_path+hyperparam.embedding+"_"+hyperparam.embedding_path +"_"+hyperparam.dataset+hyperparam.iteration+"_bs_"+str(hyperparam.batch_size)+ "_lr_"+str(hyperparam.lr)+ '_af_'+str(hyperparam.anneal_factor)+ '_p_'+ str(hyperparam.patience) +\
               "_hsize_"+str(hyperparam.hidden_size)+"_crf_"+str(int(hyperparam.use_crf))+"_lrnn_"+str(hyperparam.rnn_layers)+"_dp_"+str(hyperparam.dropout)+"_wdp_"+str(hyperparam.word_dropout)+"_ldp_"+str(hyperparam.locked_dropout)+"/"
dataset_path=hyperparam.dataset_base_path+hyperparam.dataset+"/"

print(output)
print(dataset_path)

print("\nHyper-Parameters\n")
arguments=vars(hyperparam)
for i in arguments:
    print('{0:25}  {1}'.format(i+":", str(arguments[i])))
    # print(i+" : "+str(arguments[i]))

trainer=train(dataset_path,[embedding],output,hyperparam)

## Results

In [None]:
a = DataLoader(trainer.corpus.dev, batch_size=hyperparam.batch_size, num_workers=hyperparam.threads,)

#now save both train and dev predictions
dev_eval_result, dev_loss = trainer.model.evaluate(trainer.corpus.dev,out_path=output+"dev.tsv")


b = DataLoader(trainer.corpus.train,batch_size=hyperparam.batch_size,num_workers=hyperparam.threads,)

train_eval_result, train_loss = trainer.model.evaluate(trainer.corpus.train,out_path=output+"train.tsv")

In [None]:
print(dev_eval_result.detailed_results)
print(dev_loss)
print(train_eval_result.detailed_results)
print(train_loss)

In [None]:
plotter = Plotter()
plotter.plot_training_curves(output+'loss.tsv')

In [None]:
# load the model you trained
model = SequenceTagger.load(output+'final-model.pt')
# create example sentence
sentence = Sentence("Automatic terminology extraction or automatic keyphrase extraction is a very useful subfield of natural language processing when it comes to synthesizing information from texts in concise terms. In this master's thesis, this problem is approached with a sequence labeling approach using supervised deep learning techniques in specific domains, using bidirectional LTSM (Long short-term memory) neural networks and contextual word embeddings. A statistical significance study has been carried out to verify that the results presented in this work are significant. The final result in the F1 score in the Inspec dataset is 0.5730 slightly better than the higher result of the state of the art and it offers less dispersed results. Additionally, in this work the variation of samples in the training set is analyzed, a program to convert the datasets to a sequence labeling format needed for the final system is provided and there is an available sample program to test the keyphrase extractor in texts given by the user.")
# predict tags and print
model.predict(sentence)

print(sentence.to_tagged_string())

In [None]:
if download_model:
  files.download(output+'best-model.pt') 