**Introduction**

In this assignment, you will implement a Recurrent Neural Network (RNN) for music generation.  

For this, you will use the Irish Massive ABC Notation (IrishMAN) dataset, which contains a collection of Irish folk tunes in ABC notation.  

The goal is to train an RNN to generate new tunes based on the patterns learned from the dataset.

**Dataset:**  
IrishMAN Dataset can be found at [https://huggingface.co/datasets/sander-wood/irishman](https://huggingface.co/datasets/sander-wood/irishman).


**Tasks**

a) Data Preparation: Download the IrishMAN dataset and preprocess the ABC notation files to create a suitable input format for the RNN.  

This includes tokenizing the ABC notation and creating sequences of tokens.

In [32]:
from datasets import load_dataset

import torch
import numpy as np


dataset = load_dataset("sander-wood/irishman")

train_dataset = dataset["train"]
validation_dataset = dataset["validation"]


# print("Train dataset size:", len(train_dataset))

# print("Validation dataset size:", len(validation_dataset))

train_data = "\n".join(i['abc notation'] for i in train_dataset)

numberOfSequences = 10

sequenceLength = 100
# Create a list to hold the sequences


# Initialize the sequence list with None


sequence = [None] * numberOfSequences

# In the loop below, you can extract sequences of length 100 from the concatenated train_data string:
for i in range(numberOfSequences):
    start_idx = i * sequenceLength
    end_idx = start_idx + sequenceLength
    sequence[i] = train_data[start_idx:end_idx]







In [33]:

# print(f"Sequence 1:\n{sequence[0]}\n")
# print(f"Length of sequence 1: {len(sequence[0])} characters\n")


In [None]:

# Sort the dataset first
sorted_train_dataset = sorted(train_dataset, key=lambda x: len(x['abc notation']))

tokens = [list(tune['abc notation']) for tune in sorted_train_dataset]

# Create a vocabulary of unique tokens
all_tokens = set(token for tune in tokens for token in tune)

# Create token-to-index and index-to-token mappings
token2idx = {token: idx for idx, token in enumerate(sorted(all_tokens))}
idx2token = {idx: token for token, idx in token2idx.items()}

# Map each token in each tune to its corresponding integer
token_indices = [[token2idx[token] for token in tune] for tune in tokens]

print(f"Number of unique tokens: {len(all_tokens)}")


print("Tokens:", tokens)  # Print the last 10 tokens



Number of unique tokens: 95
Sample tokens from first tune: ['X', ':', '9', '2', '3', '\n', 'L', ':', '1', '/', '4', '\n', 'M', ':', '4', '/', '4', '\n', 'K', ':', 'D', '\n']


b) Model Implementation: Implement an RNN model (RNN Layer, LSTM Layer)using a deep learning framework of your choice (e.g. PyTorch).  

The model should be able to take sequences of tokens as input and predict the next token in the sequence.  

c) Training: Train your RNN model on the preprocessed dataset.  

Experiment with different hyperparameters such as learning rate, batch size, and number of epochs to achieve the best performance.

d) Music Generation: After training, use your RNN model to generate new music sequences.  

You can start with a seed sequence and let the model predict subsequent tokens to create a complete tune.

e) Evaluation: Evaluate the quality of the generated music. You can do this by listening to the tunes or by using metrics such as top-1 or top-5 accuracy.  

You can also look for other metrics like BLEU score or perplexity.  

Please use TensorBoard to visualize the training process and the generated music.  

Here is a tutorial on how to use TensorBoard with PyTorch: https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensorboard_with_pytorch.ipynb.