**Objective** Read the AudioSet data and convert it into a usable JSON format file.

**Prerequisite**: 
1. Download the data from [AudioSet](https://research.google.com/audioset/download.html) and uncompress the `tar.gz` file. Make sure this notebook is in the same directory as the uncompressed embeddings folder that is generated `audioset_v1_embeddings`

2. Download the class labels file from [class_labels_indices.csv](http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv) and keep it in the current directory.

In [1]:
#!pip3 install tensorflow

In [2]:
import numpy as np
import json
import tensorflow as tf
import os
import pandas as pd

In [3]:
directory = "audioset_v1_embeddings/eval"
class_labels_file = 'class_labels_indices.csv'
dataset = []
for file_name in os.listdir(directory):
     if file_name.endswith(".tfrecord"):
            dataset.append(os.path.join(directory,file_name))

In [4]:
raw_dataset = tf.data.TFRecordDataset(dataset)

In [5]:
class_labels = pd.read_csv(class_labels_file)
labels = class_labels['display_name'].tolist()

music_class = class_labels[class_labels['display_name'].str.contains('Music', case=False)]
music_labels = music_class['index'].tolist()

In [6]:
print(class_labels.head())
print('-----------------')
print(music_class.head())

   index        mid                   display_name
0      0   /m/09x0r                         Speech
1      1  /m/05zppz      Male speech, man speaking
2      2   /m/02zsn  Female speech, woman speaking
3      3   /m/0ytgt     Child speech, kid speaking
4      4  /m/01h8n0                   Conversation
-----------------
     index         mid        display_name
137    137    /m/04rlf               Music
138    138    /m/04szw  Musical instrument
152    152  /m/05148p4  Keyboard (musical)
216    216    /m/064t9           Pop music
217    217  /m/0glt670       Hip hop music


In [7]:
print('class_labels', class_labels.shape)
print('music_class', music_class.shape)
print('-----------------')
print(class_labels.info())
print('-----------------')
print(music_class.info())

print('-----------------')
print(music_labels)

class_labels (527, 3)
music_class (44, 3)
-----------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 527 entries, 0 to 526
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   index         527 non-null    int64 
 1   mid           527 non-null    object
 2   display_name  527 non-null    object
dtypes: int64(1), object(2)
memory usage: 12.5+ KB
None
-----------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 137 to 282
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   index         44 non-null     int64 
 1   mid           44 non-null     object
 2   display_name  44 non-null     object
dtypes: int64(1), object(2)
memory usage: 1.4+ KB
None
-----------------
[137, 138, 152, 216, 217, 219, 227, 230, 233, 234, 237, 239, 240, 245, 246, 247, 248, 249, 252, 253, 254, 256, 258, 259, 260, 261, 262, 264, 265, 267, 268, 269, 270

In [8]:
audios = []
counter = 0
NUM_SECONDS = 10

for raw_record in raw_dataset:
    example = tf.train.SequenceExample()
    example.ParseFromString(raw_record.numpy())
    
    # Audio Meta data
    audio_labels = example.context.feature['labels'].int64_list.value
    start_time = example.context.feature['start_time_seconds'].float_list.value
    end_time = example.context.feature['end_time_seconds'].float_list.value
    video_id = example.context.feature['video_id'].bytes_list.value
    
    if not (set(music_labels) & set(audio_labels)):
        continue

    # Audio Feature
    feature_list = example.feature_lists.feature_list['audio_embedding'].feature
    final_features = [list(feature.bytes_list.value[0]) for feature in feature_list]
    audio_embedding = [item for sublist in final_features[:NUM_SECONDS] for item in sublist]
    
    if len(final_features) < NUM_SECONDS:
        continue
    
    audio = {
        'label': audio_labels,
        'video_id': video_id[0],
        'start_time': start_time[0],
        'end_time': end_time[0],
        'data': audio_embedding
    }
    
    audios.append(audio)
    counter += 1
    if counter % 100 == 0:
        print(f"Processing {counter}th file ...")

Processing 100th file ...
Processing 200th file ...
Processing 300th file ...
Processing 400th file ...
Processing 500th file ...
Processing 600th file ...
Processing 700th file ...
Processing 800th file ...
Processing 900th file ...
Processing 1000th file ...
Processing 1100th file ...
Processing 1200th file ...
Processing 1300th file ...
Processing 1400th file ...
Processing 1500th file ...
Processing 1600th file ...
Processing 1700th file ...
Processing 1800th file ...
Processing 1900th file ...
Processing 2000th file ...


In [9]:
with open('music_set.json', 'w') as file:
    str_audio = repr(audios)
    json.dump(str_audio, file)

In [10]:
[audio['data'][:10] for audio in audios[:4]]

[[0, 255, 0, 255, 147, 255, 12, 255, 0, 0],
 [166, 73, 135, 117, 139, 31, 187, 200, 190, 99],
 [71, 24, 175, 143, 68, 126, 84, 118, 78, 157],
 [208, 255, 255, 68, 8, 145, 134, 220, 50, 205]]

### References

How to read from `tfrecord` files: https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/tfrecord.ipynb#scrollTo=nsEAACHcnm3f