[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Henya14/deep-learning-ner/blob/main/basic_learning.ipynb)

# Setting up the environment

In [1]:
!pip install torch
!pip install transformers
!git clone https://github.com/Henya14/deep-learning-ner.git
!cp -R ./deep-learning-ner/data ./data

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
fatal: destination path 'deep-learning-ner' already exists and is not an empty directory.


In [2]:
!nvidia-smi

Sun Nov 20 07:11:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P8    12W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Loading the data

## Getting the directories storing our data

In [3]:
import os
train_test_devel_data_path = os.path.join("data", "train-devel-test")
train_test_devel_data_dirs = [os.path.join(train_test_devel_data_path, data_dir) for data_dir in os.listdir(train_test_devel_data_path) if os.path.isdir(os.path.join(train_test_devel_data_path, data_dir))]

## Helper functions for loading the data

In [4]:
import re
import pandas as pd
csv_file_pattern = re.compile(".*_full.csv") 
def get_csv_files_in_dir(path_to_dir):
    return [f for f in os.listdir(path_to_dir) if csv_file_pattern.match(f)]

In [5]:
def get_train_devel_test_dirs():
    train_devel_test_file_dirs = {}
    for d in train_test_devel_data_dirs:
        file_dirs = [os.path.join(d, genre_dir, "no-morph") for genre_dir in os.listdir(d) if os.path.isdir(os.path.join(d, genre_dir)) and "no-morph" in os.listdir(os.path.join(d, genre_dir))]
        train_devel_test_file_dirs[os.path.basename(d)] = file_dirs
    return train_devel_test_file_dirs

In [6]:
def load_all_csv_files_in_dir(path_to_dir, train_test_devel, genre, save_intermediate_dataframes_to_csv = False):
    data_file_paths = [os.path.join(path_to_dir, cf) for cf in get_csv_files_in_dir(path_to_dir)]
    combined_df = pd.DataFrame()
    for csv_file in data_file_paths:
        print(f"Loading: {csv_file}")
        df = pd.read_csv(csv_file)
        if "sentence_index" in combined_df:
            df["sentence_index"] = df["sentence_index"] + (combined_df["sentence_index"].max() + 1)
        combined_df = pd.concat([combined_df, df])
    return combined_df

## Combine each dataset (train, devel, test) into a single dataframe of its own

In [7]:
def get_dfs():
    dfs = {}
    train_devel_test_dirs = get_train_devel_test_dirs()
    for data_set in train_devel_test_dirs:
        dfs[data_set] = pd.DataFrame()
        for genre_dir in train_devel_test_dirs[data_set]:
            genre = genre_dir.split(os.path.sep)[-2]
            df = load_all_csv_files_in_dir(genre_dir, data_set, genre, True)
            dfs[data_set] = pd.concat([dfs[data_set], df], ignore_index=True)
    return dfs
dfs = get_dfs()

Loading: data/train-devel-test/train/fiction/no-morph/fiction_full.csv
Loading: data/train-devel-test/train/legal/no-morph/legal_full.csv
Loading: data/train-devel-test/train/news/no-morph/news_full.csv
Loading: data/train-devel-test/train/wikipedia/no-morph/wikipedia_full.csv
Loading: data/train-devel-test/devel/fiction/no-morph/fiction_full.csv
Loading: data/train-devel-test/devel/legal/no-morph/legal_full.csv
Loading: data/train-devel-test/devel/news/no-morph/news_full.csv
Loading: data/train-devel-test/test/fiction/no-morph/fiction_full.csv
Loading: data/train-devel-test/test/legal/no-morph/legal_full.csv
Loading: data/train-devel-test/test/news/no-morph/news_full.csv


# Tokenization

## Loading the tokenizer

In [8]:
from transformers import AutoTokenizer, AutoModel, BertForTokenClassification


tokenizer = AutoTokenizer.from_pretrained("SZTAKI-HLT/hubert-base-cc")


## Tokenization helper functions

We would like to tokenize the datasets on a sentence level, so making a function that returns the sentences in a dataset would make the things really easy for us

In [9]:
def get_sentences(df: pd.DataFrame):
    copy_df = df.copy()
    copy_df = copy_df.sort_values(["sentence_index", "position_number_in_sentence"])
    sentences = []
    print(f"There are {copy_df['sentence_index'].max()} sentences in the dataset")
    for i in range(copy_df["sentence_index"].max()):
        form_tag_pairs = copy_df[copy_df["sentence_index"]==i][["position_number_in_sentence", "FORM", "CONLL:NER"]]
        sentences.append({"FORM": form_tag_pairs["FORM"].tolist(),"TAG": form_tag_pairs["CONLL:NER"].tolist()})
        
    return sentences

In [10]:
get_sentences(dfs["test"])

There are 2514 sentences in the dataset


[{'FORM': ['ELSŐ',
   'A',
   'Tehát',
   'KÖTET',
   'Bizottság',
   'most',
   '–',
   'ezen',
   'az',
   'dolgozom',
   'Európai',
   '–',
   'Parlamentnek',
   'a',
   'és',
   'fejlesztések',
   'a',
   'bevezetésén',
   'Tanácsnak',
   '.',
   'való',
   'megküldéssel',
   'egyidejűleg',
   '–',
   'továbbítja',
   'a',
   'nemzeti',
   'parlamenteknek',
   'az',
   'éves',
   'törvényalkotási',
   'programot',
   ',',
   'valamint',
   'minden',
   'egyéb',
   ',',
   'a',
   'törvényalkotási',
   'tervezésre',
   'vagy',
   'a',
   'politikai',
   'stratégia',
   'kialakítására',
   'vonatkozó',
   'dokumentumot',
   '.'],
  'TAG': ['O',
   'O',
   'O',
   'O',
   'B-ORG',
   'O',
   'O',
   'O',
   'O',
   'O',
   'B-ORG',
   'O',
   'I-ORG',
   'O',
   'O',
   'O',
   'O',
   'O',
   'B-ORG',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
 

### Testing the sentence getter function

In [11]:
test_sentences = get_sentences(dfs["test"])
print(test_sentences)

There are 2514 sentences in the dataset
[{'FORM': ['ELSŐ', 'A', 'Tehát', 'KÖTET', 'Bizottság', 'most', '–', 'ezen', 'az', 'dolgozom', 'Európai', '–', 'Parlamentnek', 'a', 'és', 'fejlesztések', 'a', 'bevezetésén', 'Tanácsnak', '.', 'való', 'megküldéssel', 'egyidejűleg', '–', 'továbbítja', 'a', 'nemzeti', 'parlamenteknek', 'az', 'éves', 'törvényalkotási', 'programot', ',', 'valamint', 'minden', 'egyéb', ',', 'a', 'törvényalkotási', 'tervezésre', 'vagy', 'a', 'politikai', 'stratégia', 'kialakítására', 'vonatkozó', 'dokumentumot', '.'], 'TAG': ['O', 'O', 'O', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}, {'FORM': ['ELSŐ', 'A', 'GV', 'FEJEZET', 'tagállamokkal', ':', '.', 'folytatott', 'Megosztanál', 'konzultációt', 'néhányat', 'követően', 'a', 'a', 'reformelképzeléseid', 'Bizottság', 'közül', 'ajá

## Messing around with the tokenizer

Before we tokenized the test we messed around a bit with it, to find out how it works and what can be done with it

In [12]:
tokenized_text = tokenizer(test_sentences[0]["FORM"], padding='max_length', max_length=512, truncation=True, return_tensors="pt", is_split_into_words=True)
print(tokenized_text)

{'input_ids': tensor([[    2,  4933, 17863,  2038,  6738,  4474, 16705,  4978,  2672,  2292,
          3690,  2033, 16913,  4197,  2292, 12038,  2146,  2005,  2045, 10554,
          2005,  5781, 24994,  6459,  2127,  4575,  2563, 29499,  2964, 12284,
          2292, 31305,  2005,  5104, 10369,  3163,  2033,  3140,  2945, 21921,
         31742,  8309,  3576,  2747,  2295,  3176,  3576,  2005,  2945, 21921,
         31742, 11721,  2108,  2174,  2005,  4676, 13259, 20878,  3498, 22797,
          4575,     3,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

Here we can see the number representation of our sentence and then we decode it back

In [13]:
print(tokenized_text["input_ids"].tolist())
print(tokenizer.decode(tokenized_text.input_ids[0]))

[[2, 4933, 17863, 2038, 6738, 4474, 16705, 4978, 2672, 2292, 3690, 2033, 16913, 4197, 2292, 12038, 2146, 2005, 2045, 10554, 2005, 5781, 24994, 6459, 2127, 4575, 2563, 29499, 2964, 12284, 2292, 31305, 2005, 5104, 10369, 3163, 2033, 3140, 2945, 21921, 31742, 8309, 3576, 2747, 2295, 3176, 3576, 2005, 2945, 21921, 31742, 11721, 2108, 2174, 2005, 4676, 13259, 20878, 3498, 22797, 4575, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Here we can see that the tokenizer adds new tokens

In [14]:
print(tokenizer.convert_ids_to_tokens(tokenized_text.input_ids[0]))

['[CLS]', 'EL', '##SŐ', 'A', 'Tehát', 'KÖ', '##TET', 'Bizottság', 'most', '–', 'ezen', 'az', 'dolgozom', 'Európai', '–', 'Parlament', '##nek', 'a', 'és', 'fejlesztések', 'a', 'bevezet', '##ésén', 'Tanács', '##nak', '.', 'való', 'megküldés', '##sel', 'egyidejűleg', '–', 'továbbítja', 'a', 'nemzeti', 'parlament', '##eknek', 'az', 'éves', 'törvény', '##alkotás', '##i', 'programot', ',', 'valamint', 'minden', 'egyéb', ',', 'a', 'törvény', '##alkotás', '##i', 'tervezés', '##re', 'vagy', 'a', 'politikai', 'stratégia', 'kialakítására', 'vonatkozó', 'dokumentumot', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]',

Here are the ids assigned to each word

In [15]:
word_ids = tokenized_text.word_ids()

In [16]:
word_ids

[None,
 0,
 0,
 1,
 2,
 3,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 12,
 13,
 14,
 15,
 16,
 17,
 17,
 18,
 18,
 19,
 20,
 21,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 27,
 28,
 29,
 30,
 30,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 38,
 38,
 39,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None

# Create the Dataset

## Getting the labels

In [17]:
def get_labels(dfs):
    combined_df = pd.DataFrame()
    for df_key in dfs:
        combined_df = pd.concat([combined_df, dfs[df_key]])
    labels = combined_df["CONLL:NER"].unique()
    return labels
labels = get_labels(dfs)
ids_to_labels = {k: v for k, v in enumerate(sorted(labels)) }
labels_to_ids = {v: k for k, v in enumerate(sorted(labels)) }

In [18]:
print(tokenizer.convert_ids_to_tokens(tokenized_text["input_ids"][0]))
print(word_ids)

['[CLS]', 'EL', '##SŐ', 'A', 'Tehát', 'KÖ', '##TET', 'Bizottság', 'most', '–', 'ezen', 'az', 'dolgozom', 'Európai', '–', 'Parlament', '##nek', 'a', 'és', 'fejlesztések', 'a', 'bevezet', '##ésén', 'Tanács', '##nak', '.', 'való', 'megküldés', '##sel', 'egyidejűleg', '–', 'továbbítja', 'a', 'nemzeti', 'parlament', '##eknek', 'az', 'éves', 'törvény', '##alkotás', '##i', 'programot', ',', 'valamint', 'minden', 'egyéb', ',', 'a', 'törvény', '##alkotás', '##i', 'tervezés', '##re', 'vagy', 'a', 'politikai', 'stratégia', 'kialakítására', 'vonatkozó', 'dokumentumot', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]',

Code from: https://towardsdatascience.com/named-entity-recognition-with-bert-in-pytorch-a454405e0b6a

In [19]:
def align_labels_of_tokenized_sentence(sentence, labels, should_tokenize_sub_words = False):
    SPECIAL_TOKEN_ID = -100
    label_ids = []
    previous_word_id = None
    for word_id in sentence:
        if word_id is None:
            label_ids.append(SPECIAL_TOKEN_ID)
        elif word_id != previous_word_id:
            label_ids.append(labels_to_ids[labels[word_id]])
        else:
            label_ids.append(labels_to_ids[labels[word_id]] if should_tokenize_sub_words else SPECIAL_TOKEN_ID)
        previous_word_id = word_id
    return label_ids
    
    

In [20]:
print(tokenizer.convert_ids_to_tokens(tokenized_text["input_ids"][0]))

['[CLS]', 'EL', '##SŐ', 'A', 'Tehát', 'KÖ', '##TET', 'Bizottság', 'most', '–', 'ezen', 'az', 'dolgozom', 'Európai', '–', 'Parlament', '##nek', 'a', 'és', 'fejlesztések', 'a', 'bevezet', '##ésén', 'Tanács', '##nak', '.', 'való', 'megküldés', '##sel', 'egyidejűleg', '–', 'továbbítja', 'a', 'nemzeti', 'parlament', '##eknek', 'az', 'éves', 'törvény', '##alkotás', '##i', 'programot', ',', 'valamint', 'minden', 'egyéb', ',', 'a', 'törvény', '##alkotás', '##i', 'tervezés', '##re', 'vagy', 'a', 'politikai', 'stratégia', 'kialakítására', 'vonatkozó', 'dokumentumot', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]',

In [21]:
tokenized_text = tokenizer(test_sentences[0]["FORM"], padding='max_length', max_length=512, truncation=True, return_tensors="pt", is_split_into_words=True)
tokenized_text

{'input_ids': tensor([[    2,  4933, 17863,  2038,  6738,  4474, 16705,  4978,  2672,  2292,
          3690,  2033, 16913,  4197,  2292, 12038,  2146,  2005,  2045, 10554,
          2005,  5781, 24994,  6459,  2127,  4575,  2563, 29499,  2964, 12284,
          2292, 31305,  2005,  5104, 10369,  3163,  2033,  3140,  2945, 21921,
         31742,  8309,  3576,  2747,  2295,  3176,  3576,  2005,  2945, 21921,
         31742, 11721,  2108,  2174,  2005,  4676, 13259, 20878,  3498, 22797,
          4575,     3,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

In [22]:
align_labels_of_tokenized_sentence(tokenized_text.word_ids(), test_sentences[0]["TAG"])

[-100,
 8,
 -100,
 8,
 8,
 8,
 -100,
 2,
 8,
 8,
 8,
 8,
 8,
 2,
 8,
 6,
 -100,
 8,
 8,
 8,
 8,
 8,
 -100,
 2,
 -100,
 8,
 8,
 8,
 -100,
 8,
 8,
 8,
 8,
 8,
 8,
 -100,
 8,
 8,
 8,
 -100,
 -100,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 -100,
 -100,
 8,
 -100,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -1

In [23]:
aligned = align_labels_of_tokenized_sentence(tokenized_text.word_ids(), test_sentences[0]["TAG"])
converted = tokenizer.convert_ids_to_tokens(tokenized_text["input_ids"][0])
for i in range(len(aligned)):
    print(f"{converted[i].ljust(15)} {str(aligned[i]).ljust(4)} \t { 'None' if aligned[i] == -100 else ids_to_labels[aligned[i]]} ")

[CLS]           -100 	 None 
EL              8    	 O 
##SŐ            -100 	 None 
A               8    	 O 
Tehát           8    	 O 
KÖ              8    	 O 
##TET           -100 	 None 
Bizottság       2    	 B-ORG 
most            8    	 O 
–               8    	 O 
ezen            8    	 O 
az              8    	 O 
dolgozom        8    	 O 
Európai         2    	 B-ORG 
–               8    	 O 
Parlament       6    	 I-ORG 
##nek           -100 	 None 
a               8    	 O 
és              8    	 O 
fejlesztések    8    	 O 
a               8    	 O 
bevezet         8    	 O 
##ésén          -100 	 None 
Tanács          2    	 B-ORG 
##nak           -100 	 None 
.               8    	 O 
való            8    	 O 
megküldés       8    	 O 
##sel           -100 	 None 
egyidejűleg     8    	 O 
–               8    	 O 
továbbítja      8    	 O 
a               8    	 O 
nemzeti         8    	 O 
parlament       8    	 O 
##eknek         -100 	 None 
az              8    	 O

In [24]:
tokenized_text

{'input_ids': tensor([[    2,  4933, 17863,  2038,  6738,  4474, 16705,  4978,  2672,  2292,
          3690,  2033, 16913,  4197,  2292, 12038,  2146,  2005,  2045, 10554,
          2005,  5781, 24994,  6459,  2127,  4575,  2563, 29499,  2964, 12284,
          2292, 31305,  2005,  5104, 10369,  3163,  2033,  3140,  2945, 21921,
         31742,  8309,  3576,  2747,  2295,  3176,  3576,  2005,  2945, 21921,
         31742, 11721,  2108,  2174,  2005,  4676, 13259, 20878,  3498, 22797,
          4575,     3,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

## Creating the pytorch Dataset class

In [25]:
import torch
class NERDataset(torch.utils.data.Dataset):
    def __init__(self, data_df):
        sentences = get_sentences(data_df)
        self.tokenized_sentences = [tokenizer(sentence["FORM"], padding='max_length', max_length=512, truncation=True, return_tensors="pt", is_split_into_words=True) for sentence in sentences]
        self.labels = [align_labels_of_tokenized_sentence(tokenized_sentences.word_ids(), sentence["TAG"]) for tokenized_sentences, sentence in zip(self.tokenized_sentences, sentences)]

        
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, index):
            return self.tokenized_sentences[index],  torch.LongTensor(self.labels[index])

In [26]:
a = NERDataset(dfs["test"])

There are 2514 sentences in the dataset


# Training

https://www.kaggle.com/code/angyalfold/hugging-face-bert-with-custom-classifier-pytorch/notebook

https://www.youtube.com/watch?v=MqQ7rqRllIc

https://neptune.ai/blog/how-to-code-bert-using-pytorch-tutorial

https://towardsdatascience.com/deep-dive-into-the-code-of-bert-model-9f618472353e

In [27]:
from torch import nn
from transformers import BertModel

class NERModel(torch.nn.Module):
    
    def __init__(self, num_labels, loss_fn=nn.CrossEntropyLoss()):
        super(NERModel, self).__init__()
        self.num_labels = num_labels
        self.loss_fn = loss_fn

        self.bert = AutoModel.from_pretrained("SZTAKI-HLT/hubert-base-cc")
        self.dropout1 = nn.Dropout(0.1)
        self.linear1 = nn.Linear(in_features=768, out_features=512)
        self.relu1 =  nn.ReLU()
        self.linear2 = nn.Linear(in_features=512, out_features=num_labels)
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_ids, attention_mask, labels) :
        x = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        x = self.dropout1(x[0])
        x = self.linear1(x)
        x = self.relu1(x)
        x = self.linear2(x)
        x = self.sigmoid(x)
        
        #x =  torch.argmax(x, dim=2)
        loss = self.calculate_loss(x, labels, attention_mask)
        return loss, x

    def calculate_loss(self, predicted_labels: torch.Tensor, actual_labels: torch.Tensor, attention_mask: torch.Tensor)-> torch.Tensor:
        #Code from: https://github.com/abhishekkrthakur/bert-entity-extraction/tree/master
        mask = attention_mask.view(-1) == 1
        active_logits = predicted_labels.view(-1, self.num_labels)
        
        actual_labels_with_ignore_index = torch.where(mask,
        actual_labels.view(-1),
        torch.tensor(self.loss_fn.ignore_index).type_as(actual_labels))

        loss = self.loss_fn(active_logits, actual_labels_with_ignore_index)
        return loss

        
        

In [28]:
train_dataset, devel_dataset, test_dataset = NERDataset(dfs["train"][0:2000]), NERDataset(dfs["devel"]), NERDataset(dfs["test"])

There are 128 sentences in the dataset
There are 2813 sentences in the dataset
There are 2514 sentences in the dataset


In [29]:
len(dfs["train"])
from tqdm import tqdm


In [30]:
from torch.utils.data.dataloader import DataLoader

print(f"Cuda available? {torch.cuda.is_available()}")
def loop(model, train_df, devel_df, test_df):
    train_dataset, devel_dataset, test_dataset = NERDataset(train_df), NERDataset(devel_df), NERDataset(test_df)
    batch_size = 16
    epoch_num = 4
    learning_rate = 0.0001
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_dataloader = DataLoader(devel_dataset, batch_size=batch_size)
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    
    is_cuda_available = torch.cuda.is_available()
    device  = "cuda" if is_cuda_available else "cpu"
    if is_cuda_available:
        model.to(device)
    model.bert.requires_grad_(False)
    
   
    for epoch in range(epoch_num):
        model.train()
        total_acc_train = 0
        total_loss_train = 0

        train_count_of_good_predictions=0
        for tokenized_sentence, label in tqdm(train_dataloader):
            train_label: torch.Tensor = label.to(device)
            attention_mask = tokenized_sentence['attention_mask'].squeeze(1).to(device)
            input_ids = tokenized_sentence['input_ids'].squeeze(1).to(device)

            optimizer.zero_grad()
            loss, logits = model(input_ids, attention_mask, train_label)

            
            
            loss.backward()
            optimizer.step()
            model.eval()
            for i in range(logits.shape[0]):
                
                logits_clean = logits[i][train_label[i] != -100]
                label_clean = train_label[i][train_label[i] != -100]
                preds = logits_clean.argmax(dim=1)

                count_of_matches = (preds == label_clean).sum()
                train_count_of_good_predictions += count_of_matches
        
        
        print(f"Epoch: {epoch} Train Accuracy: {train_count_of_good_predictions/len(train_df) * 100} got {train_count_of_good_predictions} matches out of {len(train_df)}")
def main(): 
  dfs = get_dfs()
  labels=get_labels(dfs)
  ids_to_labels = {k: v for k, v in enumerate(sorted(labels)) }
  labels_to_ids = {v: k for k, v in enumerate(sorted(labels)) }


  model = NERModel(len(labels))
  loop(model=model, train_df=dfs["test"], devel_df=dfs["devel"], test_df=dfs["test"])

main()
            

Cuda available? True
Loading: data/train-devel-test/train/fiction/no-morph/fiction_full.csv
Loading: data/train-devel-test/train/legal/no-morph/legal_full.csv
Loading: data/train-devel-test/train/news/no-morph/news_full.csv
Loading: data/train-devel-test/train/wikipedia/no-morph/wikipedia_full.csv
Loading: data/train-devel-test/devel/fiction/no-morph/fiction_full.csv
Loading: data/train-devel-test/devel/legal/no-morph/legal_full.csv
Loading: data/train-devel-test/devel/news/no-morph/news_full.csv
Loading: data/train-devel-test/test/fiction/no-morph/fiction_full.csv
Loading: data/train-devel-test/test/legal/no-morph/legal_full.csv
Loading: data/train-devel-test/test/news/no-morph/news_full.csv


Some weights of the model checkpoint at SZTAKI-HLT/hubert-base-cc were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


There are 2514 sentences in the dataset
There are 2813 sentences in the dataset
There are 2514 sentences in the dataset


100%|██████████| 158/158 [01:32<00:00,  1.71it/s]


Epoch: 0 Train Accuracy: 20.74346160888672 got 13214 matches out of 63702


100%|██████████| 158/158 [01:29<00:00,  1.76it/s]


Epoch: 1 Train Accuracy: 81.5798568725586 got 51968 matches out of 63702


100%|██████████| 158/158 [01:29<00:00,  1.77it/s]


Epoch: 2 Train Accuracy: 94.2984619140625 got 60070 matches out of 63702


100%|██████████| 158/158 [01:29<00:00,  1.77it/s]


Epoch: 3 Train Accuracy: 94.81963348388672 got 60402 matches out of 63702
