# NER
В этом документе я создаю теги для нужных нам сущностей, обучаю модель распознавания на них и вывожу результат этого распознавания


В этой ячейке проверяем, как работает модель spaCy для русского языка. Поскольку сейчас исопльзуется предобученная на датасете новостей модель, она умеет определять только формальные сущности (персона, локация, время), которые неприменимы к нашему проекту. Поэтому далее я буду создавать распознаватель конкретно под наши задачи. 

А в этой ячейке удостоверяемся, что spaCy корректно отвечает на заданные для нее условия (определила "Ульяна" как "персону", "Москве" как "локацию"

In [6]:
import spacy
import pandas as pd
import random
import json
from spacy.lang.ru import Russian
from spacy.pipeline import EntityRuler

import ru_core_news_lg

Считываем датасет всех команд, далее выбираем отдельно столбец с командами для дальнейшей работы

In [2]:
import pandas as pd
all_df = pd.read_csv('all_commands.csv')
all_commands = pd.read_csv('all_commands.csv')[['command']]

In [3]:
all_df

Unnamed: 0.1,Unnamed: 0,command,intent,entity
0,0,я быть в отчаяние чтобы пойти направо,move_ship_by_direction,ship_direction
1,1,я пойти наверх корабль на,move_ship_by_direction,ship_direction
2,2,давать подняться наверх,move_ship_by_direction,ship_direction
3,3,пожалуйста слева слышать,move_ship_by_direction,ship_direction
4,4,пойти я на корабль,move_ship_by_direction,ship_direction
...,...,...,...,...
11109,11109,возвращаться к свой пиратка,pirate_rebirth,none
11110,11110,я хотеть крепость в отдохнуть,pirate_rebirth,none
11111,11111,я хотеть отдохнуть в крепость,pirate_rebirth,none
11112,11112,крепость хотеть отдохнуть в я,pirate_rebirth,none


Переводим колонку команд в список

In [4]:
all_commands_list = []
for i in all_commands['command']:
        all_commands_list.append(i)

Функция для записи списка команд в JSON файл, с ним далее удобнее работать

In [7]:
def write_list(a_list):
    with open("all_commands.json", "w", encoding='utf8') as f:
        json.dump(a_list, f, ensure_ascii=False)

write_list(all_commands_list)

Заводим 2 функции для чтения файла и сохранения в файл. Загружаем из JSON, сохраняем тоже в JSON

In [8]:
def load_data(file):
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

def save_data(file, data):
    with open (file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)

Пишем функции для обработки команд. 

In [9]:
# В отдельном файле у нас есть список сущностей и их лэйбл. Библиотека spaCy требует особой 
# структуры данных для обработки пар сущность — лэйбл (это паттерн). Эта функция позволяет создать такую структуру
# из имеющихся данных 

def create_training_data(file, type):
    data = load_data(file)
    patterns = []
    for item in data:
        pattern = {
                    "label": type,
                    "pattern": item
                    }
        patterns.append(pattern)
    return (patterns)



# Эта фунция создает и сохраняет кастомную модель NER, которая работает с созданными выше паттернами.

def generate_rules(patterns):
    nlp = Russian()
    ruler = nlp.add_pipe("entity_ruler")
    ruler.add_patterns(patterns)
    nlp.to_disk("jackal_ner_all_entities")
    

    
# Эта функция обрабатывает входящий текст (ищет сущности) с использованием созданной выше модели и записывает
# найденное в список 
    
def test_model(model, text):
    doc = nlp(text)
    results = []
    entities = []
    for ent in doc.ents:
        entities.append((ent.start_char, ent.end_char, ent.label_))
    if len(entities) > 0:
        results = [text, {"entities": entities}] # специальный формат для spaCy
        return (results)
                

                                                                            

Создаем паттерны для каждого типа сущностей и объединяем в единый список — это нужно, чтобы конечная модель имела в себе паттерны сущностей всех типов

In [10]:
patterns_dir = create_training_data("NER_dir.json", "DIR")
patterns_tile = create_training_data("NER_tiles.json", "TILE")
pattern_act = create_training_data("NER_act.json", "ACT")
pattern_num = create_training_data("NER_num.json", "NUM")

all_patterns = patterns_dir + patterns_tile + pattern_act + pattern_num

Создаем модель распознавания сущностей

In [11]:
generate_rules(all_patterns)

# Объединила паттерны и сделала единую модель, чтобы иметь множество лейблов, а не один
# print (patterns)

Функция ниже создает список команд из JSON файла, а далее создает тренировочный размеченный датасет в 70% от всего объема команд

In [119]:
def get_text(file):
    data = load_data(file)
    text = []
    for item in data:
        text.append(item) 
    return (text)


nlp = spacy.load("jackal_ner_all_entities")
ALL_DATA = []
outsiders = []
text = get_text("all_commands.json")
hits = []
counter = 0
test_size = round(0.7 * len(text))
while counter < len(text): # делаем тренировочный датасет в 70% от всего
    for command in text:
        command = command.strip()
    #        command = command.replace("\n", " ")
        results = test_model(nlp, command)
        if results != None:
            ALL_DATA.append(results)
            
        ########################
        #далее идут временный команды. нужны для отслеживания качества
        else:
             outsiders.append(command)
            
        
        #######################
        counter += 1

In [120]:
TRAIN_DATA = ALL_DATA[:test_size]

In [121]:
test_data = ALL_DATA[test_size:]

In [122]:
save_data("ML_NER_train_data.json", TRAIN_DATA)

In [123]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import train_test_split

ALL_DATA = pd.DataFrame(ALL_DATA) # стандартный брейк на трейн и тест не производился, поскольку далее нужны json 

In [124]:
X = ALL_DATA[[0]]
y = ALL_DATA[[1]]

In [125]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size = .7)

Сохраняем тренировочный датасет

In [68]:
import pickle
TRAIN_DATA.to_pickle("ML_NER_train_data.pkl")

## Тренировка модели

In [436]:
import random
from spacy.training.example import Example

def train_spacy(data, epochs):
    TRAIN_DATA = data
    nlp = spacy.blank("ru")
    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner")
    else:
        ner = nlp.get_pipe("ner")
    
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
            
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.select_pipes(disable=other_pipes):
        optimizer = nlp.initialize()
        for epoch in range(epochs):
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                example = Example.from_dict(nlp.make_doc(text), annotations)
                nlp.update(
                    [example],
                    sgd=optimizer,
                    losses=losses
                )
            print(f"Epoch {epoch}, losses {losses}")
    return nlp

In [437]:
TRAIN_DATA = load_data("ML_NER_train_data.json")
random.shuffle(TRAIN_DATA)



In [128]:
nlp = train_spacy(TRAIN_DATA, 30)
nlp.to_disk("jackal_ner_trained_model")

Epoch 0, losses {'ner': 685.8730080071026}
Epoch 1, losses {'ner': 183.33603484448463}
Epoch 2, losses {'ner': 100.24093842858794}
Epoch 3, losses {'ner': 129.72534167452804}
Epoch 4, losses {'ner': 65.5520702693125}
Epoch 5, losses {'ner': 77.73507472944524}
Epoch 6, losses {'ner': 80.61540234219471}
Epoch 7, losses {'ner': 53.952361240802205}
Epoch 8, losses {'ner': 18.3246503182347}
Epoch 9, losses {'ner': 72.16701022966815}
Epoch 10, losses {'ner': 51.49825774152919}
Epoch 11, losses {'ner': 45.28911730945829}
Epoch 12, losses {'ner': 2.1879180613981437}
Epoch 13, losses {'ner': 45.47298765742108}
Epoch 14, losses {'ner': 32.44536760498236}
Epoch 15, losses {'ner': 35.90443976631265}
Epoch 16, losses {'ner': 6.027575720772911e-06}
Epoch 17, losses {'ner': 21.20078511158035}
Epoch 18, losses {'ner': 26.756651870979084}
Epoch 19, losses {'ner': 27.73650418823898}
Epoch 20, losses {'ner': 74.25410757470138}
Epoch 21, losses {'ner': 18.282319861887448}
Epoch 22, losses {'ner': 18.13120

In [434]:
import re
import spacy
import nltk
nltk.download("stopwords")
#--------#

from nltk.corpus import stopwords
from pymystem3 import Mystem
from string import punctuation

#Create lemmatizer and stopwords list
mystem = Mystem() 
russian_stopwords = stopwords.words("russian")



[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pabakst/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [495]:
#Preprocess function
def preprocess_text(text):
    tokens = mystem.lemmatize(text.lower())
    tokens = [token for token in tokens if token not in russian_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    
    text = " ".join(tokens)
    
    return text

def input_for_gui (test):
    arrow = ['стрелка', 'указатель']
    balloon = ['воздушный шар', 'шар']
    barrel = ['бочка']
    cannibal = ['людоед']
    castle = ['крепость']
    castle_girl = ["девушка", "абориген", "аборигенка"]
    croc = ['крокодил']
    gold = ['сундук', 'деньги', 'сокровища',"золотишко", "монета", "золото", "деньга", "монетка", 
            "клад", "сундучок", "мелочь", "сокровище", "сокровищница"]
    horse = ['конь', 'лошадь']
    ice = ['лед', 'лёд']
    labyrinth = [ 'лабиринт', 'джунгли', 'пустыня', 'болото',
        'горы', "тропик", "пустынь", "заросль", "гора", "скала" , "лес", 
         "песок", "дюна","тропический", "леса"]
    plane = ['самолет', "самолёт"]
    trap = ['капкан', 'ловушка']
    cannon = ['пушка']
    field = ['поляна', 'пустышка', "клетка",
         "клеточка","холм"]
    
    
#     test = 'первым пиратом пойду налево и попаду на шар'
    test = preprocess_text(test)
    nlp = spacy.load("jackal_ner_trained_model") # сущности определяются посредством обучения модели
    doc = nlp(test)
    results = test_model(nlp, command)
    dict_ = {}
    for ent in doc.ents:
        dict_[ent.label_] = ent.text
#         print (ent.text, ent.label_)
    print(dict_)
    
    if "NUM" in dict_.keys():
        if "пер" in dict_["NUM"]:
            number = 1
        elif "втор" in dict_["NUM"]:
            number = 2     
        elif "тр" in dict_["NUM"]:
            number = 3
    else:
        number = 101
        
    if "DIR" in dict_.keys():
        if "лев" in dict_["DIR"]:
            direction = 0
        elif "прав" in dict_["DIR"]:
            direction = 1   
        elif "прям" in dict_["DIR"]:
            direction = 2  
        elif "наз" in dict_["DIR"]:
            direction = 3
    
        return number, direction   
            
    elif "DIR" not in dict_.keys()and "TILE" in dict_.keys() :
        if dict_["TILE"] in arrow:
            tile = 1
        elif dict_["TILE"] in balloon:
            tile = 8   
        elif dict_["TILE"] in barrel:
            tile = 9     
        elif dict_["TILE"] in cannibal:
            tile = 10   
        elif dict_["TILE"] in castle:
            tile = 11             
        elif dict_["TILE"] in castle_girl:
            tile = 12  
        elif dict_["TILE"] in croc:
            tile = 13 
        elif dict_["TILE"] in gold:
            tile = 14
        elif dict_["TILE"] in horse:
            tile = 19            
        elif dict_["TILE"] in ice:
            tile = 20            
        elif dict_["TILE"] in labyrinth:
            tile = 21   
        elif dict_["TILE"] in field:
            tile = 25               
        elif dict_["TILE"] in plane:
            tile = 29               
        elif dict_["TILE"] in trap:
            tile = 30   
        elif dict_["TILE"] in cannon:
            tile = 31             
    
        return number, tile  
            
# print(results[1])

In [497]:
input_for_gui("первым пиратом пойду попаду на шар")

{'ACT': 'пират', 'TILE': 'шар'}


In [489]:
# test = 'первым пиратом пойду налево и попаду на шар'


# test = preprocess_text(test)
# nlp = spacy.load("jackal_ner_all_entities") # все сущности посчитаны искусственно, механически
# doc = nlp(test)
# results = test_model(nlp, command)
# for ent in doc.ents:
#     print (ent.text, ent.label_)
# # print(results[1])

## Считаем f-меру

In [352]:
X_test_list = X_test[0].tolist()

In [397]:
nlp = spacy.load("jackal_ner_trained_model")
y_pred = []
for i in X_test_list:
#     test = preprocess_text(i) -- меняет индекс вхождения слова, осторожно при сравнении по индексу вхождения
    results = test_model(nlp, i)
    if results is None:
        y_pred.append("NaN")
#         print('NaN')
    else:
        y_pred.append(results[1])
#         print(results[1])

In [398]:
y_true = y_test[1].tolist() 

In [357]:
# надо сравнить значения на y_pred и y_true, отношение y_pred к y_test

# print(f'type(y_true) = {type(y_true)}, type(y_pred) = {type(y_pred)}')
# len(y_pred) == len(y_true)

### Метрики

In [399]:
print(y_pred[12], y_true[12])

NaN {'entities': [(2, 7, 'TILE')]}


In [400]:
TP, TN, FP, FN = 0, 0, 0, 0

for i in range(0,len(y_true)):
    if y_true[i] == y_pred[i]:
        TP += 1
    elif y_true[i] != "NaN" and y_pred[i] == "NaN":
        FN += 1
    elif y_true[i] == "NaN" and y_pred[i] != "NaN":
        FP += 1     
    elif y_true[i] == "NaN" and y_pred[i] == "NaN":
        TN += 1 
        
print(f'Отношение верных ответов ко всему кол-ву строк (сравнение по индексам): {TP/len(y_true)}')

Отношение верных ответов ко всему кол-ву строк (сравнение по индексам): 0.9741379310344828


In [401]:
len(y_true)

3016

In [404]:
print(f'значения для NER модели равна: {TP, TN, FP, FN}')

значения для NER модели равна: (2938, 0, 0, 46)


In [409]:
# Ручная f-мера

def f1_score (TP, TN, FP, FN):
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    f1_score = (2 * precision * recall) / (precision + recall)
    
    return f1_score

In [410]:
print(f1_score(TP, TN, FP, FN))

0.9922323539344816


In [411]:
print(f'F1-мера NER модели равна: {f1_score(TP, TN, FP, FN)}') # но на самом деле так потому что нет TN и FP

F1-мера NER модели равна: 0.9922323539344816


## Эра другой разметки для подсчета более значимых метрик

### Рассмотрим то же самое на другом формате — не в индексах, а в слово + класс

#### Исследуем возможность fine-tune'a модели:

1) Источник: https://www.freecodecamp.org/news/getting-started-with-ner-models-using-huggingface/

In [199]:
# создаем сырой датасет с колонками: слово — метка
# # написать цикл:
#     # для каждой команды в модели:
#     токенизировать, собрать в список
    
#     проходимся по списку и обрабатываем каждое слово: колонка со словом — колонка с меткой, 
#         метку ставить на слово, если оно есть в списке из подгруженных файлов (см как подгружала паттерны)
    
#     # 
#     вывести датасет

In [200]:
!pip install datasets
!pip install tokenizers
!pip install transformers

Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m455.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pyarrow>=8.0.0 (from datasets)
  Downloading pyarrow-12.0.0-cp39-cp39-macosx_10_14_x86_64.whl (24.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.8/24.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp39-cp39-macosx_10_9_x86_64.whl (35 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m898.4 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting responses<0.19 (from datasets)
  Using cached responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (

In [201]:
from datasets import load_dataset

dataset = load_dataset("wikiann", "bn")

Downloading builder script:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/617k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/131k [00:00<?, ?B/s]

Downloading and preparing dataset wikiann/bn to /Users/pabakst/.cache/huggingface/datasets/wikiann/bn/1.1.0/4bfd4fe4468ab78bb6e096968f61fab7a888f44f9d3371c2f3fea7e74a5a354e...


Downloading data:   0%|          | 0.00/234M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset wikiann downloaded and prepared to /Users/pabakst/.cache/huggingface/datasets/wikiann/bn/1.1.0/4bfd4fe4468ab78bb6e096968f61fab7a888f44f9d3371c2f3fea7e74a5a354e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [202]:
label_names = dataset["train"].features["ner_tags"].feature.names


In [203]:
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

In [279]:
label_names_our = ['O', 'B-ACT', 'B-NUM', 'B-DIR', 'B-TILE']
# label_names_our = ['O', 'B-ACT', 'I-ACT', 'B-NUM', 'I-NUM', 'B-DIR', 'I-DIR', 'B-TILE', 'I-TILE']

In [280]:
label_names_our

['O', 'B-ACT', 'B-NUM', 'B-DIR', 'B-TILE']

In [None]:
# продолжить из статей...

In [212]:
# делаем один длинный текст из всего датасета:
all_commands_str = ' '.join(all_commands_list)

In [267]:
all_doc = nlp(all_commands_str)

all_tokens = [token.text for token in all_doc]
# print(all_tokens)

In [268]:
data_DIR = load_data("NER_dir.json")
data_TILE = load_data("NER_tile.json")
data_ACT = load_data("NER_act.json")
data_NUM = load_data("NER_NUM.json")


In [438]:
d = {}
word = []
label = []
# 
for i in range(len(all_tokens)):
    current_word = all_tokens[i]
    word.append(current_word)
    
    if current_word in data_DIR:
#         if i != 1 and all_tokens[i-1] in data_DIR:
#             label.append('I-DIR')
#         else:
            label.append('B-DIR')
            
    elif current_word in data_ACT:
#         if i != 1 and all_tokens[i-1] in data_ACT:
#             label.append('I-ACT')
#         else:
            label.append('B-ACT')
            
    elif current_word in data_NUM:
#         if i != 1 and all_tokens[i-1] in data_NUM:
#             label.append('I-NUM')
#         else:
            label.append('B-NUM')

    elif current_word in data_TILE:
#         if i != 1 and all_tokens[i-1] in data_TILE:
#             label.append('I-TILE')
#         else:
            label.append('B-TILE')

    else:
        label.append('O')
        
        
d["word"] = word
d["label"] = label

In [439]:
words_labeled_df = pd.DataFrame(data=d)

In [459]:
words_labeled_df.to_csv('labeled.tsv',
sep='\t', # Tab separator
header=True,
index=False,
index_label = False,
encoding='utf-8'
)

In [460]:
# Convert .tsv file to json format. 
import json
import logging
import sys
def tsv_to_json_format(input_path,output_path,unknown_label):
    try:
        f=open(input_path,'r') # input file
        fp=open(output_path, 'w') # output file
        data_dict={}
        annotations =[]
        label_dict={}
        s=''
        start=0
        for line in f:
            if line[0:len(line)-1]!='.\tO':
                word,entity=line.split('\t')
                s+=word+" "
                entity=entity[:len(entity)-1]
                if entity!=unknown_label:
                    if len(entity) != 1:
                        d={}
                        d['text']=word
                        d['start']=start
                        d['end']=start+len(word)-1  
                        try:
                            label_dict[entity].append(d)
                        except:
                            label_dict[entity]=[]
                            label_dict[entity].append(d) 
                start+=len(word)+1
            else:
                data_dict['content']=s
                s=''
                label_list=[]
                for ents in list(label_dict.keys()):
                    for i in range(len(label_dict[ents])):
                        if(label_dict[ents][i]['text']!=''):
                            l=[ents,label_dict[ents][i]]
                            for j in range(i+1,len(label_dict[ents])): 
                                if(label_dict[ents][i]['text']==label_dict[ents][j]['text']):  
                                    di={}
                                    di['start']=label_dict[ents][j]['start']
                                    di['end']=label_dict[ents][j]['end']
                                    di['text']=label_dict[ents][i]['text']
                                    l.append(di)
                                    label_dict[ents][j]['text']=''
                            label_list.append(l)                          
                            
                for entities in label_list:
                    label={}
                    label['label']=[entities[0]]
                    label['points']=entities[1:]
                    annotations.append(label)
                data_dict['annotation']=annotations
                annotations=[]
                json.dump(data_dict, fp)
                fp.write('\n')
                data_dict={}
                start=0
                label_dict={}
    except Exception as e:
        logging.exception("Файл не конвертируется" + "\n" + "ошибка = " + str(e))
        return None



In [461]:
tsv_to_json_format("labeled.tsv",'labeled.json','abc')

In [265]:
# words_labeled_df_2 = words_labeled_df.groupby('label').count()

# японский туториал:

In [302]:
# # compilers and development settings
# sudo apt-get update
# sudo apt install -y gcc
# sudo apt-get install -y make

# # install CUDA 11.4.4 (because I use old generation K80 GPU)
# wget https://developer.download.nvidia.com/compute/cuda/11.4.4/local_installers/cuda_11.4.4_470.82.01_linux.run
# sudo sh cuda_11.4.4_470.82.01_linux.run
# echo -e "export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64" >> ~/.bashrc
# source ~/.bashrc

# # install and upgrade pip
# sudo apt-get install -y python3-pip
# sudo -H pip3 install --upgrade pip

# # install pytorch with GPU accelerated
# # (see https://pytorch.org/get-started/locally/ )
# pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu114

# # install sentencepiece for multi-lingual modeling
# pip3 install omegaconf hydra-core fairseq sentencepiece

# # install huggingface transformer with deepspeed
# sudo apt install python3-mpi4py
# sudo apt-get install ninja-build
# pip3 install transformers[deepspeed] datasets

# # install additional packages
# pip3 install numpy seqeval pandas matplotlib scikit-learn

# # install jupyter if you run code in notebook
# pip3 install jupyter

SyntaxError: invalid syntax (2533609562.py, line 2)

In [304]:
import json
from datasets import Dataset, Features, Sequence, Value, ClassLabel

# load dataset
with open("ner_yap.json") as f:
  json_all = json.load(f)

# create chracater-based annotated dataset
tokens_list = []
ner_tags_list = []
for json_dat in json_all:
  tokens = list(json_dat["text"])
  ner_tags = ["O"] * len(tokens)
  for ent in json_dat["entities"]:
    for i in range(ent["span"][0], ent["span"][1]):
      # See https://github.com/stockmarkteam/ner-wikipedia-dataset
      if ent["type"] == "人名":  # person
        ner_tags[i] = "PER"
      elif ent["type"] == "法人名":  # organization (corporation general)
        ner_tags[i] = "ORG"
      elif ent["type"] == "政治的組織名":  # organization (political)
        ner_tags[i] = "ORG-P"
      elif ent["type"] == "その他の組織名":  # organization (others)
        ner_tags[i] = "ORG-O"
      elif ent["type"] == "地名":  # location
        ner_tags[i] = "LOC"
      elif ent["type"] == "施設名":  # institution (facility)
        ner_tags[i] = "INS"
      elif ent["type"] == "製品名":  # product
        ner_tags[i] = "PRD"
      elif ent["type"] == "イベント名":  # event
        ner_tags[i] = "EVT"
  tokens_list.append(tokens)
  ner_tags_list.append(ner_tags)

features = Features({
  "tokens": Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
  "ner_tags": Sequence(feature=ClassLabel(names=["O", "PER", "ORG", "ORG-P", "ORG-O", "LOC", "INS", "PRD", "EVT"], id=None), length=-1, id=None)
})
ds = Dataset.from_dict(
  {"tokens": tokens_list, "ner_tags": ner_tags_list},
  features=features
)

# generate converter for index(int)-to-tag(string) and tag(string)-to-index(int)
tags = ds.features["ner_tags"].feature
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}

# separate dataset into train dataset and validation dataset
ds = ds.train_test_split(test_size=0.1, shuffle=True)

ClassLabel(names=['O', 'PER', 'ORG', 'ORG-P', 'ORG-O', 'LOC', 'INS', 'PRD', 'EVT'], id=None)

In [320]:
import json
from datasets import Dataset, Features, Sequence, Value, ClassLabel


features = Features({
  "tokens": Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
  "ner_tags": Sequence(feature=ClassLabel(names=['O', 'ACT', 'NUM', 'DIR', 'TILE'], id=None), length=-1, id=None)
})
# ds = Dataset.from_dict(
#   {"tokens": word, "ner_tags": label},
#   features=features
# )

# generate converter for index(int)-to-tag(string) and tag(string)-to-index(int)
tags = features["ner_tags"].feature
# index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
# tag2index = {tag: idx for idx, tag in enumerate(tags.names)}

# # separate dataset into train dataset and validation dataset
# ds = ds.train_test_split(test_size=0.1, shuffle=True)

In [332]:
from transformers import AutoTokenizer

# load tokenizer of pre-trained XML-RoBERTa model
xlmr_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# define function for dataset conversion
def tokenize_and_align_labels(data):
  text = ["".join(t) for t in data["tokens"]]
  # tokenized_inputs = xlmr_tokenizer(text, truncation=True, max_length=512)
  tokenized_inputs = xlmr_tokenizer(text)

  #
  # map label to the new token
  #
  # [example]
  #   org token (data)      : ["松", "崎", "は", "日", "本", "に", "い", "る"]
  #   new token (tokenized_inputs): ["_", "松", "崎", "は", "日本", "に", "いる"]
  labels = []
  for row_idx, label_old in enumerate(data["ner_tags"]):
    # label is initialized as [[], [], [], [], [], [], []]
    label_new = [[] for t in tokenized_inputs.tokens(batch_index=row_idx)]
    # label becomes [[1], [1], [1], [0], [5, 5], [0], [0, 0]]
    for char_idx in range(len(data["tokens"][row_idx])):
      token_idx = tokenized_inputs.char_to_token(row_idx, char_idx)
      if token_idx is not None:
        label_new[token_idx].append(data["ner_tags"][row_idx][char_idx])
        if (tokenized_inputs.tokens(batch_index=row_idx)[token_idx] == "▁") and (data["ner_tags"][row_idx][char_idx] != 0):
          label_new[token_idx+1].append(data["ner_tags"][row_idx][char_idx])
    # label becomes [1, 1, 1, 0, 5, 0, 0]
    label_new = list(map(lambda i : max(i, default=0), label_new))
    # append result
    labels.append(label_new)

  tokenized_inputs["labels"] = labels
  return tokenized_inputs

# run conversion
tokenized_ds = ds.map(
  tokenize_and_align_labels,
  remove_columns=["tokens", "ner_tags"],
  batched=True,
  batch_size=128)

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/xlm-roberta-base/resolve/main/config.json from cache at /Users/pabakst/.cache/huggingface/transformers/87683eb92ea383b0475fecf99970e950a03c9ff5e51648d6eee56fb754612465.dfaaaedc7c1c475302398f09706cbb21e23951b73c6e2b3162c1c8a99bb3b62a
Model config XLMRobertaConfig {
  "_name_or_path": "xlm-roberta-base",
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_versi

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

storing https://huggingface.co/xlm-roberta-base/resolve/main/sentencepiece.bpe.model in cache at /Users/pabakst/.cache/huggingface/transformers/9df9ae4442348b73950203b63d1b8ed2d18eba68921872aee0c3a9d05b9673c6.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
creating metadata file for /Users/pabakst/.cache/huggingface/transformers/9df9ae4442348b73950203b63d1b8ed2d18eba68921872aee0c3a9d05b9673c6.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
https://huggingface.co/xlm-roberta-base/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /Users/pabakst/.cache/huggingface/transformers/tmpxjwnekv2


Downloading:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

storing https://huggingface.co/xlm-roberta-base/resolve/main/tokenizer.json in cache at /Users/pabakst/.cache/huggingface/transformers/daeda8d936162ca65fe6dd158ecce1d8cb56c17d89b78ab86be1558eaef1d76a.a984cf52fc87644bd4a2165f1e07e0ac880272c1e82d648b4674907056912bd7
creating metadata file for /Users/pabakst/.cache/huggingface/transformers/daeda8d936162ca65fe6dd158ecce1d8cb56c17d89b78ab86be1558eaef1d76a.a984cf52fc87644bd4a2165f1e07e0ac880272c1e82d648b4674907056912bd7
loading file https://huggingface.co/xlm-roberta-base/resolve/main/sentencepiece.bpe.model from cache at /Users/pabakst/.cache/huggingface/transformers/9df9ae4442348b73950203b63d1b8ed2d18eba68921872aee0c3a9d05b9673c6.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
loading file https://huggingface.co/xlm-roberta-base/resolve/main/tokenizer.json from cache at /Users/pabakst/.cache/huggingface/transformers/daeda8d936162ca65fe6dd158ecce1d8cb56c17d89b78ab86be1558eaef1d76a.a984cf52fc87644bd4a2165f1e07e0ac880272c1e82

Map:   0%|          | 0/4808 [00:00<?, ? examples/s]

Map:   0%|          | 0/535 [00:00<?, ? examples/s]

In [333]:
import torch
from transformers import AutoConfig
from transformers.models.roberta.modeling_roberta import RobertaForTokenClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

xlmr_config = AutoConfig.from_pretrained(
  "xlm-roberta-base",
  num_labels=tags.num_classes,
  id2label=index2tag,
  label2id=tag2index
)
model = (RobertaForTokenClassification
         .from_pretrained("xlm-roberta-base", config=xlmr_config)
         .to(device))

loading configuration file https://huggingface.co/xlm-roberta-base/resolve/main/config.json from cache at /Users/pabakst/.cache/huggingface/transformers/87683eb92ea383b0475fecf99970e950a03c9ff5e51648d6eee56fb754612465.dfaaaedc7c1c475302398f09706cbb21e23951b73c6e2b3162c1c8a99bb3b62a
Model config XLMRobertaConfig {
  "_name_or_path": "xlm-roberta-base",
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "ACT",
    "2": "NUM",
    "3": "DIR",
    "4": "TILE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "ACT": 1,
    "DIR": 3,
    "NUM": 2,
    "O": 0,
    "TILE": 4
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_p

Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

storing https://huggingface.co/xlm-roberta-base/resolve/main/pytorch_model.bin in cache at /Users/pabakst/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
creating metadata file for /Users/pabakst/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
loading weights file https://huggingface.co/xlm-roberta-base/resolve/main/pytorch_model.bin from cache at /Users/pabakst/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing RobertaForTokenClassification: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'l

In [415]:
import torch
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel, RobertaPreTrainedModel

class MyCustomizationSampleModel(RobertaPreTrainedModel):
  _keys_to_ignore_on_load_unexpected = [r"pooler"]  # because we don't add pooling
  _keys_to_ignore_on_load_missing = [r"position_ids"]

  def __init__(self, config):
    super().__init__(config)
    
    #
    # The name of layer ("roberta", etc) is very important !
    # When you change these names, these weights and bias might be ignored in saving checkpoint.
    #

    self.num_labels = config.num_labels
    # hf roberta model
    self.roberta = RobertaModel(config, add_pooling_layer=False)
    # linear for classification
    self.dropout = torch.nn.Dropout(config.hidden_dropout_prob)
    self.linear = torch.nn.Linear(config.hidden_size, self.num_labels)
    # initialize weights
    ### self.init_weights()
    self.post_init()

  def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
    # build model
    roberta_output = self.roberta(
      input_ids,
      attention_mask=attention_mask,
      token_type_ids=token_type_ids, # this will always be None
      **kwargs
    )
    x = self.dropout(roberta_output[0])
    logits = self.linear(x)
    # calculate loss if labels are provided
    loss = None
    if labels is not None:
      cross_entropy = torch.nn.CrossEntropyLoss()
      loss = cross_entropy(logits.view(-1, self.num_labels), labels.view(-1))
    # return result
    return TokenClassifierOutput(
      loss=loss,
      logits=logits,
      hidden_states=roberta_output.hidden_states,
      attentions=roberta_output.attentions
    )

In [422]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir = "xlm-roberta-ner-try",
  log_level = "error",
  num_train_epochs = 1,
  per_device_train_batch_size = 12,
  per_device_eval_batch_size = 12,
  evaluation_strategy = "epoch",
#   fp16 = True,
  logging_steps = len(tokenized_ds["train"]),
  push_to_hub = False
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [424]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(
  xlmr_tokenizer,
  return_tensors="pt")

In [425]:
import numpy as np
from seqeval.metrics import f1_score

def metrics_func(eval_arg):
  preds = np.argmax(eval_arg.predictions, axis=2)
  batch_size, seq_len = preds.shape
  y_true, y_pred = [], []
  for b in range(batch_size):
    true_label, pred_label = [], []
    for s in range(seq_len):
      if eval_arg.label_ids[b, s] != -100:  # -100 must be ignored
        true_label.append(index2tag[eval_arg.label_ids[b][s]])
        pred_label.append(index2tag[preds[b][s]])
    y_true.append(true_label)
    y_pred.append(pred_label)
  return {"f1": f1_score(y_true, y_pred)}

In [426]:
from transformers import Trainer

trainer = Trainer(
  model = model,
  args = training_args,
  data_collator = data_collator,
  compute_metrics = metrics_func,
  train_dataset = tokenized_ds["train"],
  eval_dataset = tokenized_ds["test"],
  tokenizer = xlmr_tokenizer
)

In [419]:
trainer.train()

***** Running training *****
  Num examples = 4808
  Num Epochs = 7
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2107
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>)


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01675096110011509, max=1.0)…

Problem at: /Users/pabakst/Conda/anaconda3/lib/python3.9/site-packages/transformers/integrations.py 593 setup


CommError: Run initialization has timed out after 60.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-

In [427]:
import os
import torch
from transformers import AutoConfig

# save fine-tuned model in local
os.makedirs("./trained_ner_classifier_try", exist_ok=True)
if hasattr(trainer.model, "module"):
  trainer.model.module.save_pretrained("./trained_ner_classifier_try")
else:
  trainer.model.save_pretrained("./trained_ner_classifier_try")

# load from the saved checkpoint
xlmr_config = AutoConfig.from_pretrained(
  "xlm-roberta-base",
  num_labels=tags.num_classes,
  id2label=index2tag,
  label2id=tag2index
)
model = (RobertaForTokenClassification
         .from_pretrained("./trained_ner_classifier_try", config=xlmr_config)
         .to(device))

In [429]:
from datasets import Dataset
import torch
from torch.utils.data import DataLoader
import pandas as pd

# create dataset for prediction
sample_encoding = xlmr_tokenizer([
  "первым пиратом пойду налево",
], truncation=True, max_length=512)
sample_dataset = Dataset.from_dict(sample_encoding)
sample_dataset = sample_dataset.with_format("torch")

# predict
sample_dataloader = DataLoader(sample_dataset, batch_size=1)
tokens = []
labels = []
for batch in sample_dataloader:
  # predict
  with torch.no_grad():
    output = model(batch["input_ids"].to(device), batch["attention_mask"].to(device))
  predicted_label_id = torch.argmax(output.logits, axis=-1).cpu().numpy()
  # create output
  tokens.append(xlmr_tokenizer.convert_ids_to_tokens(batch["input_ids"][0]))
  labels.append([index2tag[i] for i in predicted_label_id[0]])

# show the first result
pd.DataFrame([tokens[0], labels[0]], index=["Tokens", "Tags"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁первым,▁,пира,том,▁пойду,▁на,ле,во,</s>
Tags,O,O,O,O,O,O,O,O,O,O


# индийский туториал:

In [270]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")

#Get the values for input_ids, token_type_ids, attention_mask
def tokenize_adjust_labels(all_samples_per_split):
  tokenized_samples = tokenizer.batch_encode_plus(all_samples_per_split["tokens"], is_split_into_words=True)
  #tokenized_samples is not a datasets object so this alone won't work with Trainer API, hence map is used 
  #so the new keys [input_ids, labels (after adjustment)]
  #can be added to the datasets dict for each train test validation split
  total_adjusted_labels = []
  print(len(tokenized_samples["input_ids"]))
  for k in range(0, len(tokenized_samples["input_ids"])):
    prev_wid = -1
    word_ids_list = tokenized_samples.word_ids(batch_index=k)
    existing_label_ids = all_samples_per_split["ner_tags"][k]
    i = -1
    adjusted_label_ids = []
   
    for wid in word_ids_list:
      if(wid is None):
        adjusted_label_ids.append(-100)
      elif(wid!=prev_wid):
        i = i + 1
        adjusted_label_ids.append(existing_label_ids[i])
        prev_wid = wid
      else:
        label_name = label_names[existing_label_ids[i]]
        adjusted_label_ids.append(existing_label_ids[i])
        
    total_adjusted_labels.append(adjusted_label_ids)
  tokenized_samples["labels"] = total_adjusted_labels
  return tokenized_samples

tokenized_dataset = dataset.map(tokenize_adjust_labels, batched=True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

1000


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

1000


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

1000
1000
1000
1000
1000
1000
1000
1000
1000
1000


In [271]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [273]:
pip install seqeval

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m326.3 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16165 sha256=d2740a2a6d47645ffb1e3daf342f0b78a5f6c876d6a8c0f5e171c5f8cd4e8d18
  Stored in directory: /Users/pabakst/Library/Caches/pip/wheels/e2/a5/92/2c80d1928733611c2747a9820e1324a6835524d9411510c142
Successfully built seqeval
Installing collected packages: seqeval
Successfully

In [274]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
import numpy as np
from datasets import load_metric
metric = load_metric("seqeval")
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    flattened_results = {
        "overall_precision": results["overall_precision"],
        "overall_recall": results["overall_recall"],
        "overall_f1": results["overall_f1"],
        "overall_accuracy": results["overall_accuracy"],
    }
    for k in results.keys():
      if(k not in flattened_results.keys()):
        flattened_results[k+"_f1"]=results[k]["f1"]

    return flattened_results


In [276]:
flattened_results = {"overall_precision": results["overall_precision"],"overall_recall": results["overall_recall"],"overall_f1": results["overall_f1"],"overall_accuracy": results["overall_accuracy"],}

for k in results.keys():
    if(k not in flattened_results.keys()):
        flattened_results[k+"_f1"]=results[k]["f1"]

TypeError: list indices must be integers or slices, not str

In [277]:
!pip install wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting wandb
  Downloading wandb-0.15.4-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m707.0 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.31-py3-none-any.whl (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m544.9 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.25.1-py2.py3-none-any.whl (206 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m206.7/206.7 kB[0m [31m584.1 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting doc

In [278]:
import os
import wandb
os.environ["WANDB_API_KEY"]="API KEY GOES HERE"
os.environ["WANDB_ENTITY"]="Suchandra"
os.environ["WANDB_PROJECT"]="finetune_bert_ner"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [281]:
model = AutoModelForTokenClassification.from_pretrained("bert-base-multilingual-cased", num_labels=len(label_names_our))
training_args = TrainingArguments(
    output_dir="./fine_tune_bert_output",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.01,
    logging_steps = 1000,
    report_to="wandb",
    run_name = "ep_10_tokenized_11",
    save_strategy='no'
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()
wandb.finish()

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>)


Problem at: /Users/pabakst/Conda/anaconda3/lib/python3.9/site-packages/transformers/integrations.py 593 setup


AuthenticationError: The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)