<a href="https://colab.research.google.com/github/IngaKristin/Final-Project-TensorFlow/blob/vera-branch/dataset_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Important Links:**
- [Paper](https://arxiv.org/pdf/2011.13062.pdf)
- [GitHub](https://github.com/sfc-computational-creativity-lab/x-rhythm-can)

1. [Donwload](https://magenta.tensorflow.org/datasets/groove#format) the Groove MIDI Dataset
2. Mount the dataset to Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **1. Load the Dataset**

##### **1.1 Before loading the dataset, let's have a look at the `info.csv` file.**

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

In [38]:
df = pd.read_csv('/content/drive/MyDrive/TF-dataset/groove/info.csv')
df.head(10)

Unnamed: 0,drummer,session,id,style,bpm,beat_type,time_signature,midi_filename,audio_filename,duration,split
0,drummer1,drummer1/eval_session,drummer1/eval_session/1,funk/groove1,138,beat,4-4,drummer1/eval_session/1_funk-groove1_138_beat_...,drummer1/eval_session/1_funk-groove1_138_beat_...,27.872308,test
1,drummer1,drummer1/eval_session,drummer1/eval_session/10,soul/groove10,102,beat,4-4,drummer1/eval_session/10_soul-groove10_102_bea...,drummer1/eval_session/10_soul-groove10_102_bea...,37.691158,test
2,drummer1,drummer1/eval_session,drummer1/eval_session/2,funk/groove2,105,beat,4-4,drummer1/eval_session/2_funk-groove2_105_beat_...,drummer1/eval_session/2_funk-groove2_105_beat_...,36.351218,test
3,drummer1,drummer1/eval_session,drummer1/eval_session/3,soul/groove3,86,beat,4-4,drummer1/eval_session/3_soul-groove3_86_beat_4...,drummer1/eval_session/3_soul-groove3_86_beat_4...,44.716543,test
4,drummer1,drummer1/eval_session,drummer1/eval_session/4,soul/groove4,80,beat,4-4,drummer1/eval_session/4_soul-groove4_80_beat_4...,drummer1/eval_session/4_soul-groove4_80_beat_4...,47.9875,test
5,drummer1,drummer1/eval_session,drummer1/eval_session/5,funk/groove5,84,beat,4-4,drummer1/eval_session/5_funk-groove5_84_beat_4...,drummer1/eval_session/5_funk-groove5_84_beat_4...,45.687518,test
6,drummer1,drummer1/eval_session,drummer1/eval_session/6,hiphop/groove6,87,beat,4-4,drummer1/eval_session/6_hiphop-groove6_87_beat...,drummer1/eval_session/6_hiphop-groove6_87_beat...,44.119242,test
7,drummer1,drummer1/eval_session,drummer1/eval_session/7,pop/groove7,138,beat,4-4,drummer1/eval_session/7_pop-groove7_138_beat_4...,drummer1/eval_session/7_pop-groove7_138_beat_4...,27.706547,test
8,drummer1,drummer1/eval_session,drummer1/eval_session/8,rock/groove8,65,beat,4-4,drummer1/eval_session/8_rock-groove8_65_beat_4...,drummer1/eval_session/8_rock-groove8_65_beat_4...,59.067313,test
9,drummer1,drummer1/eval_session,drummer1/eval_session/9,soul/groove9,105,beat,4-4,drummer1/eval_session/9_soul-groove9_105_beat_...,drummer1/eval_session/9_soul-groove9_105_beat_...,36.540504,test


In [39]:
df.tail(10)

Unnamed: 0,drummer,session,id,style,bpm,beat_type,time_signature,midi_filename,audio_filename,duration,split
1140,drummer2,drummer2/session2,drummer2/session2/6,rock,130,beat,4-4,drummer2/session2/6_rock_130_beat_4-4.mid,,11.041335,train
1141,drummer2,drummer2/session2,drummer2/session2/7,rock,130,beat,4-4,drummer2/session2/7_rock_130_beat_4-4.mid,,42.629765,train
1142,drummer2,drummer2/session2,drummer2/session2/8,rock,130,beat,4-4,drummer2/session2/8_rock_130_beat_4-4.mid,,42.629765,train
1143,drummer2,drummer2/session2,drummer2/session2/9,rock,130,beat,4-4,drummer2/session2/9_rock_130_beat_4-4.mid,,5.481725,train
1144,drummer2,drummer2/session2,drummer2/session2/10,rock,130,beat,4-4,drummer2/session2/10_rock_130_beat_4-4.mid,,1.834614,train
1145,drummer2,drummer2/session2,drummer2/session2/11,rock,130,beat,4-4,drummer2/session2/11_rock_130_beat_4-4.mid,,1.909613,train
1146,drummer2,drummer2/session2,drummer2/session2/12,rock,130,beat,4-4,drummer2/session2/12_rock_130_beat_4-4.mid,,1.808652,train
1147,drummer2,drummer2/session2,drummer2/session2/13,rock,130,beat,4-4,drummer2/session2/13_rock_130_beat_4-4.mid,,1.864421,train
1148,drummer2,drummer2/session2,drummer2/session2/14,rock,130,beat,4-4,drummer2/session2/14_rock_130_beat_4-4.mid,,1.87596,train
1149,drummer2,drummer2/session2,drummer2/session2/15,rock,130,beat,4-4,drummer2/session2/15_rock_130_beat_4-4.mid,,3.714419,train


##### **1.2 Next, let's have a look at the column names and their meaning.**

In [37]:
df.columns

Index(['drummer', 'session', 'id', 'style', 'bpm', 'beat_type',
       'time_signature', 'midi_filename', 'audio_filename', 'duration',
       'split'],
      dtype='object')

**Column Description:**

`drummer:` a string ID for drummer of the performer. <br>
`session:` a string ID for each recording session (unique per drummer). <br>
`id:` a unique string ID for the performance. <br>
`style:` s string style for the performance formatted as `<primary>/<secondary>`. The primary style comes from the **Genre List** below. <br>
`bpm:` an integer tempo in beats per minute for the performance. <br>
`beat_type:` either “beat” or “fill”. <br>
`time_signature:` the time signature for the performance formatted as `<numerator>-<denominator>`. <br>
`midi_filename:` relative path to a MIDI file. <br>
`audio_filename:` relative path to the WAV file. <br>
`duration:` the float duration in seconds (of the MIDI). <br>
`split:` the predefined split: either "train", "validation" or "test". <br>

**Genre List:**

`afrobeat`,`afrocuban`,`blues`,`country`,`dance`,`funk`,`gospel`,`highlife`,`hiphop`

`jazz`,`latin`,`middleeastern`,`neworleans`,`pop`,`punk`,`reggae`,`rock`,`soul`

In [33]:
df.info

<bound method DataFrame.info of        drummer                session                        id  \
0     drummer1  drummer1/eval_session   drummer1/eval_session/1   
1     drummer1  drummer1/eval_session  drummer1/eval_session/10   
2     drummer1  drummer1/eval_session   drummer1/eval_session/2   
3     drummer1  drummer1/eval_session   drummer1/eval_session/3   
4     drummer1  drummer1/eval_session   drummer1/eval_session/4   
...        ...                    ...                       ...   
1145  drummer2      drummer2/session2      drummer2/session2/11   
1146  drummer2      drummer2/session2      drummer2/session2/12   
1147  drummer2      drummer2/session2      drummer2/session2/13   
1148  drummer2      drummer2/session2      drummer2/session2/14   
1149  drummer2      drummer2/session2      drummer2/session2/15   

              style  bpm beat_type time_signature  \
0      funk/groove1  138      beat            4-4   
1     soul/groove10  102      beat            4-4   
2   

In [47]:
train_data = df.loc[df['split'] == 'train']
train_data_count = train_data['split'].iloc[0:]
train_data_count

10      train
11      train
12      train
13      train
14      train
        ...  
1145    train
1146    train
1147    train
1148    train
1149    train
Name: split, Length: 897, dtype: object

In [1]:
import tensorflow as tfds