## Adaptando dataset ao Flair
O Flair recebe o dataset dividido em train, test e dev para treinar o modelo. Portanto inicialmente tratarei o dataset para adequá-lo ao Flair.
As células abaixo embaralham, dividem e salvam o dataset em arquivos txt.

A implementação busca o dataset montando o Google Drive. Para mudar o caminho do arquivo em seu drive atualize a variável DATASET_GDRIVE_PATH.

In [1]:
pip install pandas numpy




In [2]:
import pandas as pd
import numpy as np
import random

In [3]:
df = pd.read_csv('./ner_dataset.csv', encoding='Latin-1')
df = df.fillna(method='ffill') # Preenche as colunas NA com a informação da célula acima
df = df.set_index('Sentence #', append=True)

In [4]:
# Embaralha o dataset mantendendo a estrutura das sentenças intactas. 
def shuffle_preserving_sentences(df):
  sentence_groupby = df.groupby('Sentence #') # Agrupa por sentença

  sentences_shuffled = list(sentence_groupby.groups.keys()) # Lista das chaves de cada grupo 
  random.shuffle(sentences_shuffled) # Embaralha as chaves

  # Cria e preenche uma lista com os dataframes de cada sentença. 
  shuffled_dfs = []
  for sentence_n in sentences_shuffled:
    shuffled_dfs.append(sentence_groupby.get_group(sentence_n))

  return shuffled_dfs

In [5]:
def write_sentences_to_file(groupby, file):
  for _, group in groupby:
    group.to_csv(file, index=False, header=False, sep=' ', encoding='Latin-1', lineterminator='\n')
    file.write('\n')


In [6]:
def split_dataset(df, train_ratio, test_ratio):
  sentences_dfs = shuffle_preserving_sentences(df)

  total_size = len(sentences_dfs)
  train_size = int(total_size * train_ratio)
  test_size = int(total_size * test_ratio)

  train_data = pd.concat(sentences_dfs[:train_size])
  test_data = pd.concat(sentences_dfs[train_size:train_size + test_size])
  dev_data = pd.concat(sentences_dfs[train_size + test_size:])

  return train_data, test_data, dev_data


In [7]:
train_ratio = 0.8
test_ratio = 0.1
train_df, test_df, dev_df = split_dataset(df, train_ratio, test_ratio)

with open('./train.txt', 'w', encoding='Latin-1') as f:
  train_df_groupby = train_df.groupby('Sentence #')
  write_sentences_to_file(train_df_groupby, f)

with open('./test.txt', 'w', encoding='Latin-1') as f:
  test_df_groupby = test_df.groupby('Sentence #')
  write_sentences_to_file(test_df_groupby, f)

with open('./dev.txt', 'w', encoding='Latin-1') as f:
  dev_df_groupby = dev_df.groupby('Sentence #')
  write_sentences_to_file(dev_df_groupby, f)

## Carregando dataset

In [8]:
pip install flair

Note: you may need to restart the kernel to use updated packages.


In [9]:
from flair.data import Corpus
from flair.datasets import ColumnCorpus

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
columns = {0: 'text', 1: 'pos', 2: 'ner'}
data_folder = './'

corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt',
                              encoding='Latin-1')

2023-07-31 02:03:07,999 Reading data from .
2023-07-31 02:03:08,000 Train: train.txt
2023-07-31 02:03:08,000 Dev: dev.txt
2023-07-31 02:03:08,000 Test: test.txt


In [11]:
print("Train size: " + str(len(corpus.train)))
print("Test size: " + str(len(corpus.test)))
print("Dev size: " + str(len(corpus.dev)))

Train size: 38367
Test size: 4795
Dev size: 4797


In [12]:
print(corpus.train[0].to_tagged_string('ner'))
print(corpus.train[0].to_tagged_string('pos'))

Sentence[25]: "Iranian officials say they expect to get access to sealed sensitive parts of the plant Wednesday , after an IAEA surveillance system begins functioning ." → ["Iranian"/gpe, "Wednesday"/tim, "IAEA"/org]
Sentence[25]: "Iranian officials say they expect to get access to sealed sensitive parts of the plant Wednesday , after an IAEA surveillance system begins functioning ." → ["Iranian"/JJ, "officials"/NNS, "say"/VBP, "they"/PRP, "expect"/VBP, "to"/TO, "get"/VB, "access"/NN, "to"/TO, "sealed"/JJ, "sensitive"/JJ, "parts"/NNS, "of"/IN, "the"/DT, "plant"/NN, "Wednesday"/NNP, ","/,, "after"/IN, "an"/DT, "IAEA"/NNP, "surveillance"/NN, "system"/NN, "begins"/VBZ, "functioning"/VBG, "."/.]


## Treinando modelo


In [13]:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118
Note: you may need to restart the kernel to use updated packages.


In [14]:
import sys
import torch

print(sys.executable)
print(torch.__file__)

print('\nCUDA')
print(torch.version.cuda)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(torch.cuda.current_device()))

c:\Users\lucas\OneDrive\Área de Trabalho\TM\.venv\Scripts\python.exe
c:\Users\lucas\OneDrive\Área de Trabalho\TM\.venv\lib\site-packages\torch\__init__.py

CUDA
11.8
True
NVIDIA GeForce RTX 3060 Ti


In [15]:
from flair.embeddings import TransformerWordEmbeddings, WordEmbeddings, FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

In [16]:
label_type = 'ner' # named entity recognition

label_dict = corpus.make_label_dictionary(label_type=label_type, add_unk=False)
print("\nLabels Dictionary: " + str(label_dict))

2023-07-31 02:03:28,305 Computing label dictionary. Progress:


38367it [00:00, 64859.77it/s]

2023-07-31 02:03:28,914 Dictionary created for label 'ner' with 9 values: geo (seen 30064 times), tim (seen 16262 times), org (seen 16085 times), per (seen 13615 times), gpe (seen 12680 times), art (seen 325 times), eve (seen 253 times), nat (seen 166 times),  (seen 1 times)

Labels Dictionary: Dictionary with 9 tags: geo, tim, org, per, gpe, art, eve, nat, 





Modelo NER com transformers utilizando o RoBERTa.

In [17]:
embeddings = TransformerWordEmbeddings(model='xlm-roberta-large',
                                       layers="-1",
                                       subtoken_pooling="first",
                                       fine_tune=True,
                                       use_context=True,
                                       )

tagger = SequenceTagger(hidden_size=256,
                        embeddings=embeddings,
                        tag_dictionary=label_dict,
                        tag_type='ner',
                        use_crf=False,
                        use_rnn=False,
                        reproject_embeddings=False,
                        )

trainer = ModelTrainer(tagger, corpus)

trainer.fine_tune('resources/taggers/sota-ner-flert',
                  learning_rate=5.0e-6,
                  mini_batch_size=4,
                  max_epochs=1     
                  )

2023-07-31 02:03:39,928 SequenceTagger predicts: Dictionary with 37 tags: O, S-geo, B-geo, E-geo, I-geo, S-tim, B-tim, E-tim, I-tim, S-org, B-org, E-org, I-org, S-per, B-per, E-per, I-per, S-gpe, B-gpe, E-gpe, I-gpe, S-art, B-art, E-art, I-art, S-eve, B-eve, E-eve, I-eve, S-nat, B-nat, E-nat, I-nat, S-, B-, E-, I-
2023-07-31 02:03:39,936 ----------------------------------------------------------------------------------------------------
2023-07-31 02:03:39,938 Model: "SequenceTagger(
  (embeddings): TransformerWordEmbeddings(
    (model): XLMRobertaModel(
      (embeddings): XLMRobertaEmbeddings(
        (word_embeddings): Embedding(250003, 1024)
        (position_embeddings): Embedding(514, 1024, padding_idx=1)
        (token_type_embeddings): Embedding(1, 1024)
        (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): XLMRobertaEncoder(
        (layer): ModuleList(
          (0-23): 24 x XLMRo

100%|██████████| 1200/1200 [28:58<00:00,  1.45s/it]

2023-07-31 13:12:53,174 Evaluating as a multi-label problem: False





2023-07-31 13:12:53,274 DEV : loss 0.15668772161006927 - f1-score (micro avg)  0.8163
2023-07-31 13:12:59,911 ----------------------------------------------------------------------------------------------------
2023-07-31 13:12:59,913 Testing using last state of model ...


100%|██████████| 1199/1199 [16:25<00:00,  1.22it/s]

2023-07-31 13:29:25,544 Evaluating as a multi-label problem: False





2023-07-31 13:29:25,610 0.8091	0.8138	0.8115	0.7284
2023-07-31 13:29:25,611 
Results:
- F-score (micro) 0.8115
- F-score (macro) 0.5049
- Accuracy 0.7284

By class:
              precision    recall  f1-score   support

         geo     0.8207    0.9003    0.8587      3813
         tim     0.8281    0.8285    0.8283      1983
         org     0.6970    0.6338    0.6639      2029
         per     0.7691    0.7813    0.7751      1692
         gpe     0.9333    0.8941    0.9133      1596
         art     0.0000    0.0000    0.0000        36
         eve     0.0000    0.0000    0.0000        32
         nat     0.0000    0.0000    0.0000        14

   micro avg     0.8091    0.8138    0.8115     11195
   macro avg     0.5060    0.5048    0.5049     11195
weighted avg     0.8018    0.8138    0.8069     11195

2023-07-31 13:29:25,611 ----------------------------------------------------------------------------------------------------


{'test_score': 0.811489645958584,
 'dev_score_history': [0.816324718003375],
 'train_loss_history': [0.27575176944602453],
 'dev_loss_history': [0.15668772161006927]}