# Data Fusion Tutorial: Standardizing, Streamlining, and Merging Data from Multiple Datasets

This tutorial goes over the basics of how to read, standardize, extract metadata from, and merge data from multiple datasets. The resulting merged dataset file can be used, after some additional processing, to train and test machine learning models.

We'll be merging data from 4 publicly available acoustic datasets: 
- Aggregated Smartphone Timeseries of Rocket-generated Acoustics (ASTRA)
- Smartphone High-explosives Audio Recordings Dataset (SHAReD)
- Hypersonic signals from the OSIRIS-REx capsule reenty (OREX UH)
- Environmental sound recordings from ESC-50, downsampled to 800 Hz

Pickle (PKL) files of each of these datasets can be downloaded from the __[Soundscapes Archive](https://www.higp.hawaii.edu/archive/isla/UH_Soundscapes/)__. Single-recording subsets of each are also included with this repository so that the tutorial can be run without first downloaded the full datasets, if desired.

For information on each of the datasets, see the references listed in `README.md`.

## Section 0: Prerequisites and Imports
The following cell includes the imports necessary to run this example.

In [None]:
import numpy as np
import os
from typing import Dict, Tuple

import pandas as pd
pd.options.mode.copy_on_write = True

import uh_soundscapes.standard_labels as stl
import uh_soundscapes.dataset_standardization as ds_std

## Section 1: Loading the Datasets

In the following cell, we'll define the paths to the dataset. By default, this notebook will use the single-recording tutorial files included in the code repository. If you've downloaded the full datasets and would like to use them, change the file path variables `IMPORT_DIRECTORY`, `ASTRA_FILENAME`, `SHARED_FILENAME`, `ESC50_FILENAME`, and `OREX_FILENAME` to the directory and file names of the datasets on your machine.

In [None]:
CURRENT_DIRECTORY = os.getcwd()
IMPORT_DIRECTORY = CURRENT_DIRECTORY
ASTRA_FILENAME = "ASTRA_tutorial.pkl"
SHARED_FILENAME = "SHAReD_tutorial.pkl"
OREX_FILENAME = "OREX_tutorial.pkl"
ESC50_FILENAME = "ESC50_tutorial_800Hz.pkl"

# ASTRA_FILENAME = "<insert path to ASTRA.pkl on your device here>"
# SHARED_FILENAME = "<insert path to SHAReD.pkl on your device here>"
# OREX_FILENAME = "<insert path to OREX_UH_800Hz.pkl on your device here>"
# ESC50_FILENAME = "<insert path to ESC50_800Hz.pkl on your device here>"
# IMPORT_DIRECTORY = "<insert path to directory containing the above files on your device here>"

We'll also define a few other filenames for exporting various files. If you'd like to export files to a different directory, change the value of `EXPORT_DIRECTORY` to the path to the directory you'd like to save files to.

In [None]:
EXPORT_DIRECTORY = CURRENT_DIRECTORY

ASTRA_STANDARDIZED_FILENAME = "ASTRA_standardized.pkl"
ASTRA_EVENT_MD_FILENAME = "ASTRA_event_metadata.csv"
ASTRA_STATION_MD_FILENAME = "ASTRA_station_metadata.csv"

SHARED_STANDARDIZED_FILENAME = "SHAReD_standardized.pkl"
SHARED_EVENT_MD_FILENAME = "SHAReD_event_metadata.csv"
SHARED_STATION_MD_FILENAME = "SHAReD_station_metadata.csv"

OREX_STANDARDIZED_FILENAME = "OREX_standardized.pkl"
OREX_STATION_MD_FILENAME = "OREX_station_metadata.csv"

ESC50_STANDARDIZED_FILENAME = "ESC50_800Hz_standardized.pkl"
ESC50_EVENT_MD_FILENAME = "ESC50_800Hz_event_metadata.csv"

MERGED_FILENAME = "merged_standardized_dataset.pkl"

We'll can now read each dataset using the pandas module.

In [None]:
astra_ds = pd.read_pickle(os.path.join(IMPORT_DIRECTORY, ASTRA_FILENAME))
shared_ds = pd.read_pickle(os.path.join(IMPORT_DIRECTORY, SHARED_FILENAME))
orex_ds = pd.read_pickle(os.path.join(IMPORT_DIRECTORY, OREX_FILENAME))
esc50_ds = pd.read_pickle(os.path.join(IMPORT_DIRECTORY, ESC50_FILENAME))

Each row of each of the pandas DataFrames contains all data and metadata from a single recording. The column names and values in each DataFrame vary, however.

To keep track of all the column names, we can use the labels classes in `uh_soundscapes.standard_labels`

In [None]:
STL = stl.StandardLabels()
SL = stl.SHAReDLabels()
AL = stl.ASTRALabels()
EL = stl.ESC50Labels()
OL = stl.OREXLabels()


Just like in the individual dataset tutorial notebooks, we'll use these classes to easily access different fields in the dataset. In this tutorial, however, we'll also use them to extract the metadata and standardize the data column values and names so that the datasets can be merged.

In the following sections, we'll go through how this is done for each of the four datasets. For a script version of this process, see `uh_soundscapes.dataset_standardization`.

## Section 2: Streamlining and Standardizing ASTRA

In this section, we'll be streamlining and standardizing the ASTRA dataset so that it can be merged with the other datasets. Since the primary motivation for this is to be able to use the merged dataset for machine learning applications, we want to reduce the size of the file as much as possible. For reference, we'll print the size of the input file below.

In [None]:
astra_input_size_bytes = os.path.getsize(os.path.join(IMPORT_DIRECTORY, ASTRA_FILENAME))
print(f"Input ASTRA file size: {astra_input_size_bytes / 1e6:.4f} MB")

The ASTRA dataset includes a number of fields containing metadata on the rocket launches the recordings are from as well as the recording stations themselves. Since there are multiple recordings from each launch and by each station, we can reduce the size of the dataset significantly by extracting this metadata and storing it in separate files.

In the next cell, we'll do this using the `uh_soundscapes.dataset_standardization` function `compile_metadata`. For details on how this function works, see documentation in `uh_soundscapes.dataset_standardization`.

In [None]:
astra_event_metadata = ds_std.compile_metadata(astra_ds, AL.launch_id, AL.event_metadata)
astra_station_metadata = ds_std.compile_metadata(astra_ds, AL.station_id, AL.station_metadata)

There are a handful of fields in `StandardLabels` that are not in the ASTRA dataset. We want these fields in the final, merged dataset, so we'll add them to the ASTRA data. Three of these fields (`STL.station_alt`, `STL.data_source`, and `STL.station_network`) are common to all recordings in ASTRA, so we'll add them now.

<b>Note</b>: We are using the float value -9999.9 as a placeholder for an unknown surface altitude. Since the station latitude and longitude are included in the dataset and the stations are all on or near the surface, we can get the true altitudes from topography data and add them in later, if desired.

In [None]:
UNKN_SURF_ALT = -9999.9  # placeholder for unknown surface altitude
n_rows_astra = len(astra_ds)
astra_ds[STL.station_alt] = [UNKN_SURF_ALT] * n_rows_astra  # ASTRA stations are all surface stations
astra_ds[STL.data_source] = ["ASTRA"] * n_rows_astra # all data is from the ASTRA dataset
astra_ds[STL.station_network] = ["FLORIDA"] * n_rows_astra  # all data was recorded on the Florida network

In addition, the audio recordings in ASTRA are 5-10 minutes in duration. The full-duration recordings are useful and interesting for many applications, but we recommend starting with only the high-amplitude main launch signature for machine learning applications. This will also reduce the size of the dataset significantly.

We also want to keep samples of the pre-launch ambient noise if possible. Including these samples in training mitigates against station-bias, ensuring the model is not being trained to classify signals by their recording station rather than their origin.

To do this, we'll make two copies of the dataset and select the rocket and noise samples, respectively, from each. To see how this is done in detail, look through the documentation and comments of the `uh_soundscapes.dataset_standardization` functions `select_astra_rocket_samples()` and `select_astra_noise_samples()`.

In [None]:
# make a copies of the raw dataframe to select samples from
rocket_astra_ds = astra_ds.copy()
noise_astra_ds = astra_ds.copy()
# select 5 second rocket samples centered on the peak aligned time of arrival
rocket_astra_ds = ds_std.select_astra_rocket_samples(rocket_astra_ds)
# select < 50 second noise samples ending at least 60 seconds before the start-aligned time of arrival
noise_astra_ds = ds_std.select_astra_noise_samples(noise_astra_ds)

We'll now fill in the other `StandardLabels` columns we want in the dataset.

In [None]:
rocket_astra_ds[STL.source_alt] = [UNKN_SURF_ALT] * len(rocket_astra_ds)  # ASTRA launches are all on the surface
rocket_astra_ds[STL.ml_label] = ["rocket"] * len(rocket_astra_ds)  # suggested label for ML applications

noise_astra_ds[STL.ml_label] = ["noise"] * len(noise_astra_ds)  # suggested label for ML applications

Next, we'll rename the column names to the `StandardLabels` equivalents using the `uh_soundscapes.standard_labels` function `standardize_df_columns()` and the standarization dictionary included in `ASTRALabels`.

In [None]:
rocket_astra_ds = stl.standardize_df_columns(dataset=rocket_astra_ds, label_map=AL.standardize_dict)
noise_astra_ds = stl.standardize_df_columns(dataset=noise_astra_ds, label_map=AL.standardize_dict)

We'll reset the `noise_astra_ds` source location and time values to NaNs (since the source of the sounds in these samples are not from the launch), then check to see if any of the standard columns are missing from either `rocket_astra_ds` and/or `noise_astra_ds`. For any missing standard columns, we'll add the column and fill it with NaNs.

In [None]:
# reset source location and time columns to NaN
noise_astra_ds[STL.source_lat] = [np.nan] * len(noise_astra_ds)
noise_astra_ds[STL.source_lon] = [np.nan] * len(noise_astra_ds)
noise_astra_ds[STL.source_alt] = [np.nan] * len(noise_astra_ds)
noise_astra_ds[STL.source_epoch_s] = [np.nan] * len(noise_astra_ds)
# fill in any other missing standard columns with NaNs
for col in STL.standard_labels:
    if col not in noise_astra_ds.columns:
        noise_astra_ds[col] = [np.nan] * len(noise_astra_ds)
    if col not in rocket_astra_ds.columns:
        rocket_astra_ds[col] = [np.nan] * len(rocket_astra_ds)

Finally, we'll reduce each of the two DataFrames to only the standard columns, then concatenate them, creating a single ASTRA dataset once again, this time with only the data we need to train and test machine learning models.

In [None]:
# keep only standard columns
rocket_astra_ds = rocket_astra_ds[STL.standard_labels]
noise_astra_ds = noise_astra_ds[STL.standard_labels]
# concatenate rocket and noise dataframes
astra_standardized_ds = pd.concat([rocket_astra_ds, noise_astra_ds], ignore_index=True)
    

Let's export the standardized dataset and the metadata files, then check how much we've reduced the size of the ASTRA data.

In [None]:
astra_standardized_ds.to_pickle(os.path.join(EXPORT_DIRECTORY, ASTRA_STANDARDIZED_FILENAME))
astra_event_metadata.to_csv(os.path.join(EXPORT_DIRECTORY, ASTRA_EVENT_MD_FILENAME), index=True)
astra_station_metadata.to_csv(os.path.join(EXPORT_DIRECTORY, ASTRA_STATION_MD_FILENAME), index=True)
astra_output_size_bytes = os.path.getsize(os.path.join(EXPORT_DIRECTORY, ASTRA_STANDARDIZED_FILENAME))
print(f"ASTRA file size before standarization: {astra_input_size_bytes / 1e6:.4f} MB")
print(f"ASTRA file size after standarization: {astra_output_size_bytes / 1e6:.4f} MB")
print(f"ASTRA file size REDUCED by: {(astra_input_size_bytes - astra_output_size_bytes) / astra_input_size_bytes * 100:.2f}%")

Before moving on to the next dataset, we'll take a look at the contents of the metadata files we just exported.

The first file contains the metadata in ASTRA that is associated with the launch event, indexed by the unique launch ID strings. The metadata fields are:
- 'launch_id': the ID string of the launch
- 'launch_pad_latitude': the latitude of the launch pad in degrees
- 'launch_pad_longitude': the longitude of the launch pad in degrees
- 'reported_launch_epoch_s': the reported launch time in epoch seconds
- 'rocket_type': the type of rocket launched (make and model name)
- 'rocket_model_number': the model number of the rocket launched
- 'n_solid_rocket_boosters': the number of solid rocket boosters used

In [None]:
astra_event_metadata.head()

The second file contains the metadata in ASTRA that is associated with the recording smartphone station, indexed by the unique station ID strings. The metadata fields are:
- 'station_id': the ID string of the launch
- 'station_make': the make of the smartphone
- 'station_model_number': the model number of the smartphone

In [None]:
astra_station_metadata.head()

## Section 3: Streamlining and Standardizing SHAReD

In this section, we'll be streamlining and standardizing the SHAReD dataset so that it can be merged with the other datasets. Like with ASTRA, we want to reduce the size of the file as much as possible. For reference, we'll print the size of the input file below.

In [None]:
shared_input_size_bytes = os.path.getsize(os.path.join(IMPORT_DIRECTORY, SHARED_FILENAME))
print(f"Input SHARED file size: {shared_input_size_bytes / 1e6:.4f} MB")

Like ASTRA, the SHAReD dataset includes a number of fields containing metadata on the explosions the recordings are from as well as the recording stations themselves. Since there are multiple recordings from each explosion and by each station, we can reduce the size of the dataset significantly by extracting this metadata and storing it in separate files.

Before we do this, however, we'll make one quick change. If you've completed the SHAReD tutorial notebook, you'll remember that each explosion event in the dataset is associated with a non-unique `event_name` string as well as a unique `event_id_number` integer. To align with the other datasets, we'll combine these two fields so that the `event_name` strings are unique to each event.

In [None]:
# change NNSS event names from "NNSS" to "NNSS_<event_id_number>" to make them unique
for idx in shared_ds.index:
    if shared_ds[SL.event_name][idx] == "NNSS":
        shared_ds.at[idx, SL.event_name] = f"NNSS_{shared_ds[SL.event_id_number][idx]}"

Now that the `event_name` values are unique to individual events, we'll extract the metadata. In the next cell, we'll do this using the `compile_metadata` function just like we did with ASTRA in the previous section.

In [None]:
shared_event_metadata = ds_std.compile_metadata(shared_ds, SL.event_id_number, SL.event_metadata)
shared_station_metadata = ds_std.compile_metadata(shared_ds, SL.smartphone_id, SL.station_metadata)

At this point, we'll add the missing standard columns that have the same values for both the 'explosion' and 'ambient' signals in SHAReD: `data_source`, `station_alt`, and `station_network`.

In [None]:
shared_ds[STL.data_source] = ["SHAReD"] * len(shared_ds) # all data is from the SHAReD dataset
shared_ds[STL.station_alt] = [UNKN_SURF_ALT] * len(shared_ds)  # placeholder for unknown surface altitude
shared_ds[STL.station_network] = [x.split("_")[0] for x in shared_ds[SL.event_name]] # network is first part of event name

To significantly reduce the size of SHAReD, we'll remove the all the columns unrelated to the audio, location, and time data. We'll do this separately for the 'explosion' and 'ambient' samples of the audio recordings already separated and labeled in SHAReD, creating two DataFrames.

In [None]:
# columns to keep for the explosion DataFrame
explosion_columns = [SL.event_name, SL.smartphone_id, SL.microphone_data,
                     SL.microphone_time_s, SL.microphone_sample_rate_hz,
                     SL.external_location_latitude, SL.external_location_longitude,
                     SL.source_latitude, SL.source_longitude, SL.explosion_detonation_time,
                     STL.data_source, STL.station_alt, STL.station_network]
# columns to keep for the ambient DataFrame
ambient_columns = [SL.event_name, SL.smartphone_id, SL.ambient_microphone_time_s,
                   SL.ambient_microphone_data, SL.microphone_sample_rate_hz,
                   SL.external_location_latitude, SL.external_location_longitude,
                   SL.source_latitude, SL.source_longitude, 
                   STL.data_source, STL.station_alt, STL.station_network]
# create separate DataFrames for explosion and ambient data
explosion_df = shared_ds[explosion_columns]
ambient_df = shared_ds[ambient_columns]

Now that the 'explosion' and 'ambient' signals are separated, we'll rename the columns to their standard names, just like we did with ASTRA.

In [None]:
explosion_df = stl.standardize_df_columns(dataset=explosion_df, label_map=SL.standardize_dict)
ambient_df = stl.standardize_df_columns(dataset=ambient_df, label_map=SL.standardize_dict)

The next standard column we'll add is the column containing the epoch second of the first point in the waveform, after which we can eliminate the full time arrays associated with each audio waveform, further reducing the size of the dataset. 

<b>Note:</b> If the time array is required later, it can always be reconstructed from this single time value, the sample rate, and the length of the waveform array.

In [None]:
explosion_df[STL.t0_epoch_s] = [t[0] for t in explosion_df[SL.microphone_time_s]]
ambient_df[STL.t0_epoch_s] = [t[0] for t in ambient_df[SL.ambient_microphone_time_s]]

We'll also add the `ml_label` and `source_alt` columns from `StandardLabels`.

In [None]:
explosion_df[STL.ml_label] = ["explosion"] * len(explosion_df)
ambient_df[STL.ml_label] = ["silence"] * len(ambient_df)

explosion_df[STL.source_alt] = [UNKN_SURF_ALT] * len(explosion_df)  # explosions are all on the surface
ambient_df[STL.source_alt] = [np.nan] * len(ambient_df)  # SHAReD ambient data has no identified source

Now, we'll check to see if the standard columns are all present in both DataFrames. For any missing columns, we'll add them and fill with NaNs.

In [None]:
for col in STL.standard_labels:
    if col not in explosion_df.columns:
        explosion_df[col] = [np.nan] * len(explosion_df)
    if col not in ambient_df.columns:
        ambient_df[col] = [np.nan] * len(ambient_df)

Next, we'll ensure the source location and time columns are filled with NaNs for the ambient data, which has no identified source.

In [None]:
# reset source location and time columns to NaN
ambient_df[STL.source_lat] = [np.nan] * len(ambient_df)
ambient_df[STL.source_lon] = [np.nan] * len(ambient_df)
ambient_df[STL.source_alt] = [np.nan] * len(ambient_df)
ambient_df[STL.source_epoch_s] = [np.nan] * len(ambient_df)

Finally, we'll reduce each of the two DataFrames to only the standard columns, then concatenate them, creating a single SHAReD dataset once again, this time with only the data we need to train and test machine learning models.

In [None]:
# keep only standard columns
ambient_df = ambient_df[STL.standard_labels]
explosion_df = explosion_df[STL.standard_labels]
# concatenate explosion and ambient dataframes
shared_standardized_ds = pd.concat([explosion_df, ambient_df], ignore_index=True)

Let's export the standardized dataset and the metadata files, then check how much we've reduced the size of the SHAReD data.

In [None]:
shared_standardized_ds.to_pickle(os.path.join(EXPORT_DIRECTORY, SHARED_STANDARDIZED_FILENAME))
shared_event_metadata.to_csv(os.path.join(EXPORT_DIRECTORY, SHARED_EVENT_MD_FILENAME), index=True)
shared_station_metadata.to_csv(os.path.join(EXPORT_DIRECTORY, SHARED_STATION_MD_FILENAME), index=True)
shared_output_size_bytes = os.path.getsize(os.path.join(EXPORT_DIRECTORY, SHARED_STANDARDIZED_FILENAME))
print(f"SHARED file size before standarization: {shared_input_size_bytes / 1e6:.4f} MB")
print(f"SHARED file size after standarization: {shared_output_size_bytes / 1e6:.4f} MB")
print(f"SHARED file size REDUCED by: {(shared_input_size_bytes - shared_output_size_bytes) / shared_input_size_bytes * 100:.2f}%")

Before moving on to the next dataset, we'll take a look at the contents of the metadata files we just exported.

The first file contains the metadata in SHAReD that is associated with the explosion event, indexed by the unique event name strings. The metadata fields are:
- 'event_name': the ID string of the explosion event
- 'training_validation_test': an integer unique to the explosion event
- 'source_yield_kg': the source yield of the event in equivalent kilograms of TNT
- 'effective_yield_category': the effective yield category the event belongs to
- 'source_latitude': the latitude of the explosion site in degrees
- 'source_longitude': the longitude of the explosion site in degrees
- 'explosion_detonation_time': the detonation time of the explosion in epoch seconds

In [None]:
shared_event_metadata.head()

The second file contains the metadata in SHAReD that is associated with the recording smartphone station, indexed by the unique station ID strings. The metadata fields are:
- 'smartphone_id': the ID string of the smartphone
- 'station_model': the model of the smartphone

In [None]:
shared_station_metadata.head()

## Section 4: Streamlining and Standardizing OREX Data

In this section, we'll be streamlining and standardizing the OREX_UH dataset so that it can be merged with the other datasets. Like with previous datasets, we want to reduce the size of the file as much as possible. For reference, we'll print the size of the input file below.

In [None]:
orex_input_size_bytes = os.path.getsize(os.path.join(IMPORT_DIRECTORY, OREX_FILENAME))
print(f"Input OREX file size: {orex_input_size_bytes / 1e6:.4f} MB")

We'll start by adding some ground truth information to the dataset.

In [None]:
n_orex_signals = len(orex_ds)
orex_ds[OL.audio_fs] = [800.] * n_orex_signals  # all OREX signals were recorded at 800 Hz
orex_ds[OL.event_id] = ["OREX"] * n_orex_signals  # all OREX signals are from the OSIRIS-REx reentry

We'll also extract the station model and station network from the station label strings using the functions and mapping dictionary defined in the next cell.

In [None]:
def get_station_network(station_label_string):
    return station_label_string.split(" ")[0]

def get_station_model_key(station_label_string):
    return station_label_string.split(" ")[-1].split("-")[0]

station_model_mapping = {
    "S08": {'make': "Samsung", 'model': "Galaxy S8"},
    "S10": {'make': "Samsung", 'model': "Galaxy S10"},
    "S21": {'make': "Samsung", 'model': "Galaxy S21"},
    "S22": {'make': "Samsung", 'model': "Galaxy S22"},
    "S23": {'make': "Samsung", 'model': "Galaxy S23"},
    "A53": {'make': "Samsung", 'model': "Galaxy A53"},
    "T06": {'make': "Samsung", 'model': "Galaxy Tab 6"},
}

orex_ds[OL.station_network] = [get_station_network(sls) for sls in orex_ds[OL.station_label]]

station_model_keys = [get_station_model_key(sls) for sls in orex_ds[OL.station_label]]

orex_ds[OL.station_make] = [station_model_mapping[key]['make'] for key in station_model_keys]
orex_ds[OL.station_model] = [station_model_mapping[key]['model'] for key in station_model_keys]

Unlike ASTRA and SHAReD, there's very little data in the OREX_UH file that is unnecessary. We can still extract station metadata, but as all the recordings in the OREX dataset are from the OSIRIS-REx reentry, there's no event metadata to extract. For information about the event, see the references listed in the README file.

In [None]:
orex_station_metadata = ds_std.compile_metadata(orex_ds, OL.station_id, OL.station_metadata)

We'll now standardize the column names.

In [None]:
orex_standardized_ds = stl.standardize_df_columns(dataset=orex_ds, label_map=OL.standardize_dict)

Next, we can add the rest of the standard columns with known values.

In [None]:
orex_standardized_ds[STL.data_source] = ["UH_OREX"] * n_orex_signals # all data is from the UH OREX dataset
orex_standardized_ds[STL.station_alt] = [UNKN_SURF_ALT] * n_orex_signals  # placeholder for unknown surface altitude
orex_standardized_ds[STL.ml_label] = ["hypersonic"] * n_orex_signals  # suggested label for ML applications
orex_standardized_ds[STL.t0_epoch_s] = [time[0] for time in orex_standardized_ds[OL.audio_epoch_s]]

Finally, we'll fill any missing standard columns with NaNs and reduce the dataset to only the standard columns.

In [None]:
# fill in any missing standard columns with NaNs
for col in STL.standard_labels:
    if col not in orex_standardized_ds.columns:
        orex_standardized_ds[col] = [np.nan] * n_orex_signals
# keep only the standard columns
orex_standardized_ds = orex_standardized_ds[STL.standard_labels]

Let's export the standardized dataset and the station metadata file, then check how much we've reduced the size of the OREX data.

In [None]:
orex_standardized_ds.to_pickle(os.path.join(EXPORT_DIRECTORY, OREX_STANDARDIZED_FILENAME))
orex_station_metadata.to_csv(os.path.join(EXPORT_DIRECTORY, OREX_STATION_MD_FILENAME), index=True)
orex_output_size_bytes = os.path.getsize(os.path.join(EXPORT_DIRECTORY, OREX_STANDARDIZED_FILENAME))
print(f"OREX file size before standarization: {orex_input_size_bytes / 1e6:.4f} MB")
print(f"OREX file size after standarization: {orex_output_size_bytes / 1e6:.4f} MB")
print(f"OREX file size REDUCED by: {(orex_input_size_bytes - orex_output_size_bytes) / orex_input_size_bytes * 100:.2f}%")

Before moving on, we'll take a look at the station metadata file. The metadata fields are:
- 'station_ids': the ID string of the smartphone
- 'station_labels': the station label string of the smartphone
- 'station_make': the make of the smartphone
- 'station_model_number': the model of the smartphone
- 'deployment_network': the name of the network the smartphone was deployed on

In [None]:
orex_station_metadata.head()

## Section 5: Streamlining and Standardizing ESC-50 Data

In this section, we'll be streamlining and standardizing the ESC-50 dataset so that it can be merged with the other datasets. Like with previous datasets, we want to reduce the size of the file as much as possible. However, ESC-50 is already very streamlined and this will not be possible. Instead, we'll aim to increase the size as little as possible. For reference, we'll print the size of the input file below.

In [None]:
esc50_input_size_bytes = os.path.getsize(os.path.join(IMPORT_DIRECTORY, ESC50_FILENAME))
print(f"Input ESC-50 file size: {esc50_input_size_bytes / 1e6:.4f} MB")

We'll start by adding the data source column to the DataFrame.

In [None]:
n_esc50_signals = len(esc50_ds)
esc50_ds[STL.data_source] = ["ESC-50"] * n_esc50_signals # all data is from the ESC-50 dataset

We can then compile event metadata for the dataset. ESC-50 contains no information on the recording stations, so there is no station metadata to complile.

In [None]:
esc50_event_metadata = ds_std.compile_metadata(esc50_ds, EL.clip_id, EL.event_metadata)

We'll now standardize the column names.

In [None]:
esc50_standardized_ds = stl.standardize_df_columns(dataset=esc50_ds, label_map=EL.standardize_dict)

Finally, we'll fill any missing standard columns with NaNs and reduce the dataset to only the standard columns.

In [None]:
# fill in any missing standard columns with NaNs
for col in STL.standard_labels:
    if col not in esc50_standardized_ds.columns:
        esc50_standardized_ds[col] = [np.nan] * n_esc50_signals
# keep only the standard columns
esc50_standardized_ds = esc50_standardized_ds[STL.standard_labels]

Let's export the standardized dataset and the station metadata file, then check how much we've reduced the size of the ESC-50 data.

In [None]:
esc50_standardized_ds.to_pickle(os.path.join(EXPORT_DIRECTORY, ESC50_STANDARDIZED_FILENAME))
esc50_event_metadata.to_csv(os.path.join(EXPORT_DIRECTORY, ESC50_EVENT_MD_FILENAME), index=True)
esc50_output_size_bytes = os.path.getsize(os.path.join(EXPORT_DIRECTORY, ESC50_STANDARDIZED_FILENAME))
print(f"ESC-50 file size before standarization: {esc50_input_size_bytes / 1e6:.4f} MB")
print(f"ESC-50 file size after standarization: {esc50_output_size_bytes / 1e6:.4f} MB")
print(f"ESC-50 file size INCREASED by: {(esc50_output_size_bytes - esc50_input_size_bytes) / esc50_input_size_bytes * 100:.2f}%")

Before moving on, we'll take a look at the station metadata file. The metadata fields are:
- 'clip_id': the ID string of the Freesound clip the audio is sampled from
- 'true_class': the ESC-50 class of the sample
- 'inferred_class': the class predicted by YAMNet when run on the sample (after upsampling to 16kHz)

In [None]:
esc50_event_metadata.head()

## Section 6: Data Fusion

Now that all the datasets are standardized, they can be easily merged into one dataset.

In [None]:
datasets_to_merge = [astra_standardized_ds, shared_standardized_ds, orex_standardized_ds, esc50_standardized_ds]
merged_ds = pd.concat(datasets_to_merge, ignore_index=True)

Before exporting the merged dataset, we'll print out summaries of each of the included datasets and the resulting merged dataset.

In [None]:
for dataset in datasets_to_merge:
    ds_std.summarize_dataset(dataset)

In [None]:
print("MERGED DATASET")
ds_std.summarize_dataset(merged_ds)

Finally, we'll export the merged dataset and check to see how big the file is compared to the original datasets, combined.

In [None]:
merged_ds.to_pickle(os.path.join(EXPORT_DIRECTORY, MERGED_FILENAME))

In [None]:
final_output_size_bytes = os.path.getsize(os.path.join(EXPORT_DIRECTORY, MERGED_FILENAME))
initial_combined_size_bytes = (astra_input_size_bytes + shared_input_size_bytes + orex_input_size_bytes + esc50_input_size_bytes)
print(f"Combined input file size: {initial_combined_size_bytes / 1e6:.4f} MB")
print(f"Merged file size: {final_output_size_bytes / 1e6:.4f} MB")
print(f"Total file size REDUCED by: {(initial_combined_size_bytes - final_output_size_bytes) / initial_combined_size_bytes * 100:.2f}%")

This concludes the tutorial on standardizing and merging the datasets for use with machine learning applications.