<a href="https://colab.research.google.com/github/NiclasFenton-Wiegleb/schlager-lyrics-bot/blob/main/Schlager_Bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://github.com/NiclasFenton-Wiegleb/schlager-lyrics-bot.git
!pip install accelerate -U
!pip install transformers
!pip install tokenizers
!pip install datasets
!pip install torch torchvision -U

Cloning into 'schlager-lyrics-bot'...
remote: Enumerating objects: 1908, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (61/61), done.[K
remote: Total 1908 (delta 35), reused 34 (delta 9), pack-reused 1838[K
Receiving objects: 100% (1908/1908), 98.39 MiB | 25.36 MiB/s, done.
Resolving deltas: 100% (127/127), done.
Updating files: 100% (3283/3283), done.
Collecting accelerate
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.21.0
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m54.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.

# Training the Schlager Bot Language Model

To train the model that will ultimately generate Schlager lyrics, we first train a BERT model from scratch on our text dataset composed of Schlager lyrics and some generic German text from an open source NLP training dataset. The latter serves to provide grammatical structure and vocabulary that the lyrics alone lack.

Pre-training on transformers can be done with self-supervised tasks. In this case we will use Masked Language Modeling (MLM), where a certain percentage of the tokens in a sentence is masked and the model is trained to predict those masked words. One of the advantages of this method is that it can see the position information of the whole sentence - both for the masked and visible part.

First we need to import the relevant dependencies.

In [2]:
import transformers
import json
from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
import datasets
import os
from tokenizers import BertWordPieceTokenizer
from itertools import chain
from transformers import DataCollatorForLanguageModeling
from transformers import TrainingArguments
from transformers import Trainer
from datasets import Features, Value, Sequence

Before we can start putting our model together, we need to preprocess the training data and split it into training and test sub-sets. Separating these out first avoids the model seeing the test data during training and risking overfitting.

In [None]:
#Loading the lyrics data as the dataset

#Iterating through the files in the directory and adding the names to the files variable

files = []

directory = "./schlager-lyrics-bot/lyrics"

for filename in os.scandir(directory):
    if filename.is_file():
        files.append(filename.path)

dataset = datasets.load_dataset("text", data_files= files, split= "train")

In [None]:
#Now we split the dataset into training (90%) and testing (10%)

d = dataset.train_test_split(test_size= 0.1)
d["train"], d["test"]

Now that the dataset is loaded and split into training and test data, it is time to train the tokenizer. To achieve this, we need to write our dataset into text files - keeping training and test data separate.

In [None]:
def dataset_to_text(dataset, output_filename="data.txt"):
  """Utility function to save dataset text to disk,
  useful for using the texts to train the tokenizer
  (as the tokenizer accepts files)"""
  with open(output_filename, "w") as f:
    for t in dataset["text"]:
      print(t, file=f)

# save the training set to train.txt
dataset_to_text(d["train"], "./schlager-lyrics-bot/model/model_data/train.txt")
# save the testing set to test.txt
dataset_to_text(d["test"], "./schlager-lyrics-bot/model/model_data/test.txt")

Next we define some parameters of the tokenizer. The training file indicates the data we're passing to the tokenizer for training. This could be a list of files too. vocab_size is the vocabulary size of tokens, while max_length is the maximum sequence length.

truncate_longer_samples is a boolean indicating whether we truncate sentences longer than the length of max_length, if it's set to False, we won't truncate the sentences, we group them together and split them by max_length, so all the resulting sentences will have the length of max_length.

In [11]:
special_tokens = [
  "[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<S>", "<T>"
]

training_file = ["./schlager-lyrics-bot/model/model_data/train.txt"]
# 30,522 vocab is BERT's default vocab size
vocab_size = 30522
# maximum sequence length, lowering will result to faster training (when increasing batch size)
max_length = 512
# whether to truncate
truncate_longer_samples = False

We are now ready to train the tokenizer. BERT's default tokenizer is WordPiece and, therefore, we initialize the BertWordPieceTokenizer() tokenizer class from the tokenizers library and use the train() method to train it. It will take several minutes to finish. We then need to save the tokenizer to a directory.

In [None]:
#Initialize the WordPiece tokenizer
tokenizer = BertWordPieceTokenizer()

#Train the tokenizer
tokenizer.train(files=training_file, vocab_size=vocab_size, special_tokens=special_tokens)

#Enable truncation up to the maximum 512 tokens
# tokenizer.enable_truncation(max_length=max_length)

model_path = "./schlager-lyrics-bot/model/tokenizer/"

#Save the tokenizer to directory
tokenizer.save_model(model_path)

#Dumping some of the tokenizer config to config file,
#including special tokens, whether to lower case and the maximum sequence length
with open(os.path.join(model_path, "config.json"), "w") as f:
  tokenizer_cfg = {
      "do_lower_case": True,
      "unk_token": "[UNK]",
      "sep_token": "[SEP]",
      "pad_token": "[PAD]",
      "cls_token": "[CLS]",
      "mask_token": "[MASK]",
      "model_max_length": max_length,
      "max_len": max_length,
  }
  json.dump(tokenizer_cfg, f)

The tokenizer.save_model() method saves the vocabulary file into that path, we also manually save some tokenizer configurations, such as special tokens:

- unk_token: A special token that represents an out-of-vocabulary token, even though the tokenizer is a WordPiece tokenizer, the unk tokens are not
- impossible, but rare.
- sep_token: A special token that separates two different sentences in the same input.
- pad_token: A special token that is used to fill sentences that do not reach the maximum sequence length (since the arrays of tokens must be the same size).
- cls_token: A special token representing the class of the input.
- mask_token: This is the mask token we use for the Masked Language Modeling (MLM) pretraining task.

In [12]:
#Let's load the tokenizer
model_path = "./schlager-lyrics-bot/model/tokenizer/"
tokenizer = transformers.BertTokenizerFast.from_pretrained(model_path)

With the tokenizer ready to be taken into operation, we can now tokenize our dataset.

In [13]:
#Load train and test data
train_data = datasets.load_dataset("text", data_files= "./schlager-lyrics-bot/model/model_data/train.txt", split= "train")
test_data = datasets.load_dataset("text", data_files= "./schlager-lyrics-bot/model/model_data/test.txt", split= "train")
#Tokenizing the training dataset
train_dataset = train_data.map((lambda x: tokenizer(x["text"], return_special_tokens_mask=True)), batched= True)

#Tokenizing the test dataset
test_dataset = test_data.map((lambda x: tokenizer(x["text"], return_special_tokens_mask=True)), batched= True)

print(train_dataset[0])



{'text': ' Verrückt sind wir doch beide sowieso', 'input_ids': [2, 984, 218, 133, 201, 901, 1353, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 1]}


In [14]:
# Remove other columns, and rename them as Python lists

test_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
train_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])

print(test_dataset[0])

{'input_ids': [2, 232, 1445, 509, 1445, 16, 1583, 297, 1360, 1060, 3], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}


In [15]:
# Main data processing function that will concatenate all texts from our dataset and generate chunks of
# max_seq_length.
# grabbed from: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py

max_length = 512

features = Features({
    "text": Sequence(Value("string")),
    "input_ids": Sequence(Value("int64")),
    "token_type_ids": Sequence(Value("int64")),
    "attention_mask": Sequence(Value("int64")),
    "special_tokens_mask": Sequence(Value("int64"))
})

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= max_length:
        total_length = (total_length // max_length) * max_length
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + max_length] for i in range(0, total_length, max_length)]
        for k, t in concatenated_examples.items()
    }
    return result

train_dataset = train_dataset.map(group_texts, batched=True, features=features,
                                desc=f"Grouping texts in chunks of {max_length}")
test_dataset = test_dataset.map(group_texts, batched=True,features=features,
                                desc=f"Grouping texts in chunks of {max_length}")

print(train_dataset[0])



{'input_ids': [2, 984, 218, 133, 201, 901, 1353, 3, 2, 250, 16, 228, 949, 296, 177, 317, 3, 2, 1849, 16, 1849, 126, 673, 274, 3, 2, 122, 874, 143, 3, 2, 162, 11, 42, 762, 2358, 4325, 7847, 3, 2, 250, 17, 121, 17, 121, 17, 121, 17, 121, 17, 121, 17, 121, 17, 121, 17, 121, 12, 297, 16, 297, 13, 3, 2, 122, 120, 115, 16, 133, 501, 913, 430, 126, 274, 3, 2, 333, 4502, 16, 333, 5196, 3, 2, 145, 131, 332, 16, 146, 3820, 244, 3637, 11, 37, 3, 2, 115, 235, 738, 332, 120, 274, 120, 1205, 153, 131, 3, 2, 122, 259, 450, 1266, 3, 2, 638, 176, 145, 672, 1727, 4933, 9537, 16, 3, 2, 509, 173, 16, 509, 173, 751, 120, 2326, 173, 3, 2, 115, 1248, 1394, 16, 216, 122, 1101, 3, 2, 732, 308, 869, 122, 126, 4799, 270, 3, 2, 1763, 3635, 3, 2, 336, 151, 848, 3332, 942, 3, 2, 226, 196, 11, 283, 2072, 4096, 3, 2, 149, 201, 263, 515, 16, 5983, 1169, 3, 2, 32, 806, 6530, 386, 4664, 302, 614, 7493, 321, 4325, 3, 2, 136, 3638, 16, 218, 184, 146, 1582, 459, 245, 381, 20, 3, 2, 1005, 17, 1005, 17, 1005, 2274, 2725, 16,

In [16]:
# convert them from lists to torch tensors
train_dataset.set_format(type='torch')
test_dataset.set_format(type='torch')

print(len(train_dataset), len(test_dataset))
print(train_dataset.format)


2172 242
{'type': 'torch', 'format_kwargs': {}, 'columns': ['text', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'], 'output_all_columns': False}


In [17]:
# initialize the model with the config
model_config = transformers.BertConfig(vocab_size=vocab_size, max_position_embeddings=max_length)
model = transformers.BertForMaskedLM(config=model_config)

# initialize the data collator, randomly masking 20% (default is 15%) of the tokens for the Masked Language
# Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.2, return_tensors='pt'
)


In [18]:
trained_model_path = "./schlager-lyrics-bot/model/trained_model"

training_args = TrainingArguments(
    output_dir=trained_model_path,          # output directory to where save model checkpoint
    evaluation_strategy="epoch",    # evaluate each `logging_steps` steps
    overwrite_output_dir=True,
    num_train_epochs=10,            # number of training epochs, feel free to tweak
    per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits
    gradient_accumulation_steps=8,  # accumulating the gradients before updating the weights
    per_device_eval_batch_size=64,  # evaluation batch size
    logging_steps=1000,             # evaluate, log and save model checkpoints every 1000 step
    save_steps=1000,
    # load_best_model_at_end=True,  # whether to load the best model (in terms of loss) at the end of training
    # save_total_limit=3,           # whether you don't have much space so you let only 3 model weights saved in the disk
)

# initialize the trainer and pass everything to it
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# train the model
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


RuntimeError: ignored