# Google Colab Project: Music Composition with GPT-2

## Introduction

In this project, we explore the fascinating world of music composition using state-of-the-art natural language processing models, specifically GPT-2 (Generative Pre-trained Transformer 2). The goal is to train gpt2 model to generate music compositions in ABC notation.

## Project Workflow

### Data Loading and Preparation

- **Data Source**: The project begins by obtaining a dataset of music compositions in ABC notation. This dataset contains the music pieces that we'll use for training our models.

- **Data Preprocessing**: The dataset is preprocessed to clean and format the ABC notation for model training. This includes tokenization and encoding into a suitable format for the models.

### GPT-2 Model Training


- **Model Selection**: We train the GPT-2 model, a powerful generative language model, using PyTorch.

- **Training Procedure**: The GPT-2 model is trained on the preprocessed music data, and multiple runs are logged to track performance. Various hyperparameters are tuned to optimize the model's ability to generate coherent and harmonious music compositions.


In [1]:
import torch
from tqdm import tqdm
from argparse import ArgumentParser

import glob
import os
import pandas as pd

import sys
!pip install wandb

import wandb
wandb.login(key='30b44f6f59b06faebb3d1f78df32c6fd9961f07d')
!{sys.executable} -m pip install youtokentome
!{sys.executable} -m pip install transformers
!pip install accelerate -U
from transformers import Trainer, TrainingArguments,default_data_collator
import youtokentome as yttm



Collecting wandb
  Downloading wandb-0.15.11-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.37-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.0/190.0 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.31.0-py2.py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.8/224.8 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting pathtools (from wandb)
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.2-cp310-cp310-manylinux_2_5_x86_64.manyl

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Collecting youtokentome
  Downloading youtokentome-1.0.6.tar.gz (86 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m81.9/86.7 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: youtokentome
  Building wheel for youtokentome (setup.py) ... [?25l[?25hdone
  Created wheel for youtokentome: filename=youtokentome-1.0.6-cp310-cp310-linux_x86_64.whl size=1948600 sha256=814a8d6cbc3fc708c7399f01decf0c7e62c359bb09f6a5e8a12b7b327bff4d9e
  Stored in directory: /root/.cache/pip/wheels/df/85/f8/301d2ba45f43f30bed2fe413efa760bc726b8b660ed9c2900c
Successfully built youtokentome
Installing collected packages: youtokentome
Successfully instal

In [2]:
ORIGIN = os.path.normpath(os.getcwd())
print(ORIGIN)
TRAIN_DIR ="/content/drive/MyDrive/test2/"
VALID_DIR = "/content/drive/MyDrive/Music_project/valid_path/"
TEST_DIR = "/content/drive/MyDrive/Music_project/test_path/"
TOKENIZER_DIR = "/content/drive/MyDrive/Music_project/abc_run5.yttm"
DATASET_DIR ="/content/drive/MyDrive/Music_project/300,000_new_samples.csv"
# OUTPUT_DIR = "/content/drive/MyDrive/Music_project/output_GPT2_checkpoints6"
OUTPUT_DIR = "/content/drive/MyDrive/Music_project/"


/content


In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "gpt2"  # You can also use "gpt2-medium", "gpt2-large", etc., depending on the model size you want to use.
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token



In [None]:
abc_notation = "X:1\nT:My Tune\nM:4/4\nK:C\n| CDEF G2 A2 | B4 c2"
input_ids = tokenizer.encode(abc_notation, add_special_tokens=True, return_tensors="pt")
input_ids

tensor([[   55,    25,    16,   198,    51,    25,  3666, 42587,   198,    44,
            25,    19,    14,    19,   198,    42,    25,    34,   198,    91,
          6458, 25425,   402,    17,   317,    17,   930,   347,    19,   269,
            17]])

In [None]:
output = model.generate(input_ids, max_length=100, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
generated_abc = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_abc)

X:1
T:My Tune
M:4/4
K:C
| CDEF G2 A2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c2 | B4 c


In [5]:
USEABLE_PARAMS = [i+":" for i in "BCDFGHIKLMmNOPQRrSsTUVWwXZ"] # These are the parameters for key

def read_abc(path):
    keys = []
    notes = []
    with open(path) as rf:
        for line in rf:
            line = line.strip()
            if line.startswith("%"): # Skip any commments
                continue

            if any([line.startswith(key) for key in USEABLE_PARAMS]):
                if(line.startswith('T')):
                    continue # skipping the title for better tokenization
#                 if(line.startswith('L')):
#                     print(line) ## Checking all L in all files
                # After checking the all midi files, they all have the length (L) : 1/8
                keys.append(line)
            else:
                notes.append(line)

    keys = " ".join(keys)

    notes = "".join(notes).strip()
    notes = notes.replace(" ", "")

    if notes.endswith("|"):
        notes = notes[:-1]
    # Remove unneeded character.
    notes = notes.replace(" \ ", "")
    notes = notes.replace("\\", "")
    notes = notes.replace("\ ", "")
    notes = notes.replace("x8|", "") # 8 because all of the midi file has a L:1/8 that means one muted bar
    notes = notes.replace("z8|", "") # 8 because all of the midi file has a L:1/8 that means one muted bar

    notes = notes.strip()
    notes = " ".join(notes.split(" "))

    if not keys or not notes:
        return None, None

    return keys, notes

# from Transformer_model import  get_model


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:

OUTPUT_DIR

'/content/drive/MyDrive/Music_project/output_BERT_checkpoints6'

In [None]:
def load_dataset(path):
  data = []
  count = 0
  counter = 0
  directory_files = os.listdir(path)
  directory_path = path

  for file in directory_files:
      filename = os.path.join(directory_path, file)
      print(filename)
      keys, notes = read_abc(filename)
      print("======================")
      print(keys)
      print(notes)
      if keys is None:
          continue

      keys_tokens = tokenizer.encode(keys)


      bars = notes.split(",")
      input_bars = []
      target_bars = []
      count = 0
      notes_tokens = [tokenizer.encode(i + " | ") for i in bars]

      # To avoid out of memory problems
      # Will remove for now because of my dataset
      print("======total=====")

      print(notes_tokens)

      sequence_len = sum(len(i) for i in notes_tokens)

      counter = counter+1
      if counter == 10:
        break
      data.append((keys_tokens, notes_tokens))
  return data

In [None]:
import ast

def read_dataset(path):
  train_data = []
  df=pd.read_csv(path);
  for i in range (df.shape[0]):

    int_key = ast.literal_eval(df.iloc[i][0])

    int_notes = ast.literal_eval(df.iloc[i][1])

    train_data.append((int_key, int_notes))
  return train_data


In [None]:
train_data = read_dataset(TRAIN_DIR)


In [None]:
train_data = []
valid_data = []
test_data = []

train_data = load_dataset(TRAIN_DIR)
# valid_data = load_dataset(VALID_DIR)
# test_data = load_dataset(TEST_DIR)

/content/drive/MyDrive/test2/8352_9782.abc
[[58, 33, 18, 14, 17, 38, 18, 14, 17, 12, 35, 18, 14, 17, 12, 38, 11, 18, 14, 17, 12, 35, 11, 18, 14, 17, 12, 7131, 38, 14, 17, 35, 14, 17, 38, 11, 14, 17, 35, 11, 14, 17, 12, 7131, 33, 14, 17, 38, 14, 17, 35, 14, 17, 35, 11, 14, 17, 12, 60, 35, 11, 14, 17, 49146, 33, 14, 17, 38, 14, 17, 35, 14, 17, 35, 11, 14, 17, 12, 60, 35, 11, 14, 17, 49146, 33, 14, 17, 38, 14, 17, 35, 14, 17, 35, 11, 14, 17, 12, 60, 35, 11, 14, 17, 49146, 33, 14, 17, 38, 14, 17, 35, 14, 17, 35, 11, 14, 17, 12, 60, 35, 11, 14, 17, 49146, 33, 14, 17, 38, 14, 17, 35, 14, 17, 35, 11, 14, 17, 12, 60, 35, 11, 14, 17, 49146, 33, 14, 17, 38, 14, 17, 35, 14, 17, 35, 11, 14, 17, 12, 60, 35, 11, 14, 17, 12, 930, 220], [58, 33, 17, 38, 17, 35, 17, 38, 11, 17, 12, 35, 11, 17, 12, 7131, 33, 21, 38, 21, 35, 21, 38, 11, 21, 35, 11, 21, 12, 60, 930, 220], [58, 28, 32, 17, 37, 17, 35, 17, 35, 11, 17, 7131, 32, 14, 17, 12, 37, 14, 17, 12, 35, 14, 17, 7131, 32, 14, 17, 37, 14, 17, 7131, 32, 

In [None]:
def tokonize_abc_input(dataset):
  tokens = []
  for i in range(len(dataset)):
    if (len(dataset['abc_input'][i]) >1024):
      continue
    token = tokenizer.encode(dataset['abc_input'][i], padding="max_length",  # Pad to the model's maximum input length
    max_length=512,  # Adjust the maximum length as needed
    truncation=True,  # Truncate if the input is longer than max_length
    add_special_tokens=True,
    return_tensors="pt")
    tokens.append(token[0])
  return tokens

In [None]:
df = pd.read_csv(DATASET_DIR)
train_list = tokonize_abc_input(df)


In [None]:
from torch.utils.data import Dataset

class ABCDataset(Dataset):
    def __init__(self, tokenized_data):
        self.tokenized_data = tokenized_data

    def __len__(self):
        return len(self.tokenized_data)

    def __getitem__(self, idx):
        return self.tokenized_data[idx]


In [None]:
abc_dataset = ABCDataset(train_list)
abc_dataset

<__main__.ABCDataset at 0x7c3f1f486e00>

In [None]:
train_dataset_2 = ABCD(train_data)
# valid_dataset = ABCD(valid_data)

NameError: ignored

In [None]:
train_dataset_2[0]['labels']


tensor([58])

In [None]:
len(train_dataset_2[1]['input_ids'])
max_length = 0
for data in train_dataset_2:
  # print(data)
  # print(data['input_ids'])
  # print(len(data['input_ids']))
  if max_length < len(data['input_ids']):
    max_length = len(data['input_ids'])
print(max_length)

351


In [None]:
import torch
from torch.nn.utils.rnn import pad_sequence



def collate_function(samples):
    input_ids = [sample["input_ids"] for sample in samples]
    labels_id = [sample["labels"] for sample in samples]

    print(labels_id)
    print(labels_id[0])

    max_seq_len = 2048  # Set the maximum sequence length

    # Manually pad sequences to the same length
    input_ids_padded = torch.stack([torch.cat([seq, torch.zeros(max_seq_len - len(seq), dtype=torch.long)]) for seq in input_ids])


  # Concatenate the original tensor with the zeros tensor
    labels_padded = torch.stack([torch.cat([seq, torch.zeros(2047, dtype=torch.long)]) for seq in labels_id])  # Pad labels

    attention_mask = (input_ids_padded != 0).long()
 
    batch = {
        "input_ids": input_ids_padded,
        "attention_mask": attention_mask,
        "labels": input_ids_padded,  # Add the labels tensor
    }


    return batch



In [None]:
OUTPUT_DIR

'/content/drive/MyDrive/Music_project/output_BERT_checkpoints'

In [None]:
from transformers import Trainer, TrainingArguments,TrainerCallback
from transformers import get_cosine_schedule_with_warmup
from transformers import DataCollatorForLanguageModeling


training_args = TrainingArguments(
    output_dir=OUTPUT_DIR + 'run9_withGPT2_300,000_new_samples',
    overwrite_output_dir=True,
    # evaluation_strategy="steps",
    gradient_accumulation_steps=2,

    num_train_epochs=10,
    per_device_train_batch_size=4,
    # per_device_eval_batch_size=8,
    save_strategy='steps',
    save_steps=50000,
    # eval_steps=200,
    logging_strategy='epoch',
    fp16=True,
    report_to="wandb",
    run_name="Run_9_GPT2_toknizer-music_project_300,000_new_samples_DS",

    learning_rate=2e-3,  # Set your initial learning rate
    warmup_ratio=0.1,
    warmup_steps=200,  # Adjust warmup steps as per your preference
    # weight_decay=0.01,  # Set weight decay if necessary
    seed=42,
)

# # Create the learning rate scheduler
# total_steps = len(train_dataset) * training_args.num_train_epochs // training_args.gradient_accumulation_steps
# training_args.learning_rate_scheduler = get_cosine_schedule_with_warmup(
#     optimizer, num_warmup_steps=training_args.warmup_steps, num_training_steps=total_steps

class PrinterCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("flos", None)
        if state.is_local_process_zero:
            print(logs)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=abc_dataset,
    # eval_dataset=valid_dataset,
    data_collator= DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    callbacks=[PrinterCallback], 

)

# Start training


In [None]:
#9th Run

trainer.train()

Step,Training Loss
35106,0.453
70213,0.3086
105319,0.2322
140426,0.1777


{'loss': 0.453, 'learning_rate': 0.001801094453628228, 'epoch': 1.0}
{'loss': 0.3086, 'learning_rate': 0.0016010488513937182, 'epoch': 2.0}
{'loss': 0.2322, 'learning_rate': 0.001401003249159209, 'epoch': 3.0}
{'loss': 0.1777, 'learning_rate': 0.0012009519466453856, 'epoch': 4.0}


Step,Training Loss
35106,0.453
70213,0.3086
105319,0.2322
140426,0.1777
175532,0.1387
210639,0.11
245745,0.0885
280852,0.0718
315958,0.3321
351060,0.0


{'loss': 0.1387, 'learning_rate': 0.0010009234452488171, 'epoch': 5.0}
{'loss': 0.11, 'learning_rate': 0.0008008835432936214, 'epoch': 6.0}
{'loss': 0.0885, 'learning_rate': 0.0006008379410591119, 'epoch': 7.0}
{'loss': 0.0718, 'learning_rate': 0.0004008265405004845, 'epoch': 8.0}
{'loss': 0.3321, 'learning_rate': 0.0003597332269281195, 'epoch': 9.0}
{'loss': 0.0, 'learning_rate': 0.0003597332269281195, 'epoch': 10.0}
{'train_runtime': 65475.672, 'train_samples_per_second': 42.894, 'train_steps_per_second': 5.362, 'total_flos': 7.3383344603136e+17, 'train_loss': 0.19128658034105636, 'epoch': 10.0}


TrainOutput(global_step=351060, training_loss=0.19128658034105636, metrics={'train_runtime': 65475.672, 'train_samples_per_second': 42.894, 'train_steps_per_second': 5.362, 'total_flos': 7.3383344603136e+17, 'train_loss': 0.19128658034105636, 'epoch': 10.0})

In [None]:
#8th Run

trainer.train()

Step,Training Loss
35331,0.5422
70663,0.4488
105994,0.4111
141326,0.3835
176657,0.3622
211989,0.3449
247320,0.3318
282652,0.3214
317983,0.3143


{'loss': 0.5422, 'learning_rate': 1.801104471694373e-05, 'epoch': 1.0}
{'loss': 0.4488, 'learning_rate': 1.6010704879499307e-05, 'epoch': 2.0}
{'loss': 0.4111, 'learning_rate': 1.4010534960777096e-05, 'epoch': 3.0}
{'loss': 0.3835, 'learning_rate': 1.2010195123332673e-05, 'epoch': 4.0}
{'loss': 0.3622, 'learning_rate': 1.0009968565036392e-05, 'epoch': 5.0}
{'loss': 0.3449, 'learning_rate': 8.009572088017898e-06, 'epoch': 6.0}
{'loss': 0.3318, 'learning_rate': 6.009232250573476e-06, 'epoch': 7.0}
{'loss': 0.3214, 'learning_rate': 4.008835773554983e-06, 'epoch': 8.0}
{'loss': 0.3143, 'learning_rate': 2.008552575684631e-06, 'epoch': 9.0}


Step,Training Loss
35331,0.5422
70663,0.4488
105994,0.4111
141326,0.3835
176657,0.3622
211989,0.3449
247320,0.3318
282652,0.3214
317983,0.3143
353310,0.3098


{'loss': 0.3098, 'learning_rate': 8.439296536490046e-09, 'epoch': 10.0}
{'train_runtime': 66380.1332, 'train_samples_per_second': 42.581, 'train_steps_per_second': 5.323, 'total_flos': 7.3853670260736e+17, 'train_loss': 0.37699885325563953, 'epoch': 10.0}


TrainOutput(global_step=353310, training_loss=0.37699885325563953, metrics={'train_runtime': 66380.1332, 'train_samples_per_second': 42.581, 'train_steps_per_second': 5.323, 'total_flos': 7.3853670260736e+17, 'train_loss': 0.37699885325563953, 'epoch': 10.0})

In [None]:
#check Run

trainer.train()


Step,Training Loss
24,2.0165
49,1.7082
73,1.5687
98,1.3302
122,1.2568
147,1.1024
171,1.0761
196,0.9928
220,0.9759
240,0.9194


{'loss': 2.0165, 'learning_rate': 2.2e-06, 'epoch': 0.98}
{'loss': 1.7082, 'learning_rate': 4.7e-06, 'epoch': 2.0}
{'loss': 1.5687, 'learning_rate': 7.100000000000001e-06, 'epoch': 2.98}
{'loss': 1.3302, 'learning_rate': 9.600000000000001e-06, 'epoch': 4.0}
{'loss': 1.2568, 'learning_rate': 1.2e-05, 'epoch': 4.98}
{'loss': 1.1024, 'learning_rate': 1.45e-05, 'epoch': 6.0}
{'loss': 1.0761, 'learning_rate': 1.69e-05, 'epoch': 6.98}
{'loss': 0.9928, 'learning_rate': 1.94e-05, 'epoch': 8.0}
{'loss': 0.9759, 'learning_rate': 1.1000000000000001e-05, 'epoch': 8.98}
{'loss': 0.9194, 'learning_rate': 1.0000000000000002e-06, 'epoch': 9.8}
{'train_runtime': 84.2824, 'train_samples_per_second': 23.018, 'train_steps_per_second': 2.848, 'total_flos': 496977444864000.0, 'train_loss': 1.3007516781489055, 'epoch': 9.8}


TrainOutput(global_step=240, training_loss=1.3007516781489055, metrics={'train_runtime': 84.2824, 'train_samples_per_second': 23.018, 'train_steps_per_second': 2.848, 'total_flos': 496977444864000.0, 'train_loss': 1.3007516781489055, 'epoch': 9.8})

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Jun 15 13:12:18 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces