# Token Classification
Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token as a person, an organisation or a location. Token classification models usually leverage pre-trained word embeddings and a bi-directional RNN (LSTM or GRU) to produce a tag for each token in the input sequence. The figure below shows the general architecture of a token classification model.

Models:
- English: [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER)
- German: [xlm-roberta-large-finetuned-conll03-german](https://huggingface.co/xlm-roberta-large-finetuned-conll03-german)


# Setup

In [None]:
!pip install torch torchvision torchaudio

# Get max sequence length
The models have the following max sequence lengths:
- dslim/bert-base-NER: 512
- xlm-roberta-large-finetuned-conll03-german: 514

Lyrics need to be preprocessed to fit the max sequence length.

In [2]:
from transformers import AutoModel

models = [
    "dslim/bert-base-NER",
    "xlm-roberta-large-finetuned-conll03-german"
]

for model_name in models:
    model = AutoModel.from_pretrained(model_name)
    print(f"{model_name}: {model.config.max_position_embeddings}")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


dslim/bert-base-NER: 512


Downloading (…)lve/main/config.json:   0%|          | 0.00/886 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-large-finetuned-conll03-german were not used when initializing XLMRobertaModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


xlm-roberta-large-finetuned-conll03-german: 514


# Load lyrics

In [8]:
import os
import json

processing1_folder_path = os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', 'processing2'))
file_names = os.listdir(processing1_folder_path)
file_paths = [os.path.join(processing1_folder_path, file) for file in file_names]

data = []
for file_path in file_paths:
    with open(file_path, 'r') as f:
        artist = json.load(f)
        data.append({
            'artist': artist[0]['artist'],
            'songs': artist
        })

print(f"Loaded {len(data)} artists")

Loaded 14 artists


# Functions

In [9]:
def text_to_chunks(text):
    """
    Splits text into chunks of max length 512
    :param text: Text to split
    :return: List of chunks
    """
    chunks = []
    chunk = ""
    for word in text.split():
        if len(chunk) + len(word) + 1 <= 512:
            chunk += f" {word}"
        else:
            chunks.append(chunk)
            chunk = word
    chunks.append(chunk)
    return chunks

def store_to_output(filename, data, subfolder = None):
    """
    Stores data to output folder as JSON file
    :param filename: Name of the file
    :param data: Data to store
    :param subfolder: Subfolder to store file in
    :return:
    """
    # Output folder
    if subfolder:
        # Create subfolder if it does not exist
        if not os.path.exists(os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', subfolder))):
            os.makedirs(os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', subfolder)))
        full_filepath = os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', subfolder, filename + '.json'))
    else:
        full_filepath = os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', filename + '.json'))

    # Object to JSON
    json_data = json.dumps(data, ensure_ascii=False, indent=4)

    # Write to file encoded as UTF-8
    with open(full_filepath, "w", encoding="utf-8") as file:
        file.write(json_data)


    return

# Run Token Classification

In [14]:
from transformers import pipeline

# Init models
en = pipeline("token-classification", model="dslim/bert-base-NER")
de = pipeline("token-classification", model="xlm-roberta-large-finetuned-conll03-german")

#print(en("Elton John lives in London"))
#print(de("Angela Merkel lebt in Berlin"))
print(de("(Blackout Shorty hat ein Blackout)"))

[{'entity': 'I-PER', 'score': 0.9996092, 'index': 4, 'word': '▁Short', 'start': 10, 'end': 15}, {'entity': 'I-PER', 'score': 0.9714063, 'index': 5, 'word': 'y', 'start': 15, 'end': 16}]


In [15]:
# function to run token classification on chunks of text
def token_classification(text, model):
    chunks = text_to_chunks(text)
    combined_model_results = []
    for chunk in chunks:
        chunk_results = model(chunk)

        for result in chunk_results:
            entity = result['entity']
            start = result['start']
            end = result['end']
            word = result['word']
            combined_model_results.append({
                'entity': entity,
                'word': word,
                'start': start,
                'end': end
            } )

    return combined_model_results

overall_progress = 0
for artist in data:
    overall_progress += 1
    sub_progress = 0
    for song in artist['songs']:
        sub_progress += 1
        if song['language'] == 'en':
            song['token-classification'] = {
                'en_bertbase': token_classification(song['lyrics'], en),
                'de_roberta': None
            }
        elif song['language'] == 'de':
            song['token-classification'] = {
                'en_bertbase': None,
                'de_roberta': token_classification(song['lyrics'], de)
            }
        else:
            song['token-classification'] = {
                'en_bertbase': None,
                'de_roberta': None
            }

        print(f"Progress: {sub_progress}/{len(artist['songs'])}")

    print(f"Progress: {overall_progress}/{len(data)}")
    store_to_output(artist['artist'], artist, 'token_classification')


Progress: 1/100
Progress: 2/100
Progress: 3/100
Progress: 4/100
Progress: 5/100
Progress: 6/100
Progress: 7/100
Progress: 8/100
Progress: 9/100
Progress: 10/100
Progress: 11/100
Progress: 12/100
Progress: 13/100
Progress: 14/100
Progress: 15/100
Progress: 16/100
Progress: 17/100
Progress: 18/100
Progress: 19/100
Progress: 20/100
Progress: 21/100
Progress: 22/100
Progress: 23/100
Progress: 24/100
Progress: 25/100
Progress: 26/100
Progress: 27/100
Progress: 28/100
Progress: 29/100
Progress: 30/100
Progress: 31/100
Progress: 32/100
Progress: 33/100
Progress: 34/100
Progress: 35/100
Progress: 36/100
Progress: 37/100
Progress: 38/100
Progress: 39/100
Progress: 40/100
Progress: 41/100
Progress: 42/100
Progress: 43/100
Progress: 44/100
Progress: 45/100
Progress: 46/100
Progress: 47/100
Progress: 48/100
Progress: 49/100
Progress: 50/100
Progress: 51/100
Progress: 52/100
Progress: 53/100
Progress: 54/100
Progress: 55/100
Progress: 56/100
Progress: 57/100
Progress: 58/100
Progress: 59/100
Progre

# Combine token of same entity and where start and end are next to each other

In [24]:
def combine_tokens_and_cleanup(tokens):
    combined_tokens = []
    for token in tokens:
        if len(combined_tokens) == 0:
            combined_tokens.append(token)
        else:
            last_token = combined_tokens[-1]
            if last_token['entity'] == token['entity'] and last_token['end'] == token['start']:
                last_token['word'] += f" {token['word']}"
                last_token['end'] = token['end']
            else:
                combined_tokens.append(token)

    for token in combined_tokens:
        token['word'] = token['word'].replace('▁', '').replace('#', '').replace(' ', '')

    # filter duplicate words ignore case and remove empty words and words with length 1
    filtered_tokens = []
    for token in combined_tokens:
        if token['word'].lower() not in [t['word'].lower() for t in filtered_tokens] and len(token['word']) > 1:
            filtered_tokens.append(token)

    return filtered_tokens

overall_progress = 0
for artist in data:
    overall_progress += 1
    sub_progress = 0
    for song in artist['songs']:
        sub_progress += 1
        if song['token-classification']['en_bertbase']:
            song['token-classification']['en_bertbase'] = combine_tokens_and_cleanup(song['token-classification']['en_bertbase'])
        if song['token-classification']['de_roberta']:
            song['token-classification']['de_roberta'] = combine_tokens_and_cleanup(song['token-classification']['de_roberta'])

        print(f"Progress: {sub_progress}/{len(artist['songs'])}")

    print(f"Progress: {overall_progress}/{len(data)}")
    store_to_output(artist['artist'], artist, 'token_classification')


Progress: 1/100
Progress: 2/100
Progress: 3/100
Progress: 4/100
Progress: 5/100
Progress: 6/100
Progress: 7/100
Progress: 8/100
Progress: 9/100
Progress: 10/100
Progress: 11/100
Progress: 12/100
Progress: 13/100
Progress: 14/100
Progress: 15/100
Progress: 16/100
Progress: 17/100
Progress: 18/100
Progress: 19/100
Progress: 20/100
Progress: 21/100
Progress: 22/100
Progress: 23/100
Progress: 24/100
Progress: 25/100
Progress: 26/100
Progress: 27/100
Progress: 28/100
Progress: 29/100
Progress: 30/100
Progress: 31/100
Progress: 32/100
Progress: 33/100
Progress: 34/100
Progress: 35/100
Progress: 36/100
Progress: 37/100
Progress: 38/100
Progress: 39/100
Progress: 40/100
Progress: 41/100
Progress: 42/100
Progress: 43/100
Progress: 44/100
Progress: 45/100
Progress: 46/100
Progress: 47/100
Progress: 48/100
Progress: 49/100
Progress: 50/100
Progress: 51/100
Progress: 52/100
Progress: 53/100
Progress: 54/100
Progress: 55/100
Progress: 56/100
Progress: 57/100
Progress: 58/100
Progress: 59/100
Progre