### Machine learning with audio data - Preparation

In this and the next unit we will see how we can prepare, explore and analyze audio data with the help of machine learning. As for the other modalities, such as text or images, the trick is to get the data first into a machine interpretable format.

The interesting thing with audio data is that you can treat it as many different modalities:

* You can extract high-level features and analyze the data like tabular data.
* You can compute frequency plots and analyze the data like image data.
* You can use speech-to-text models and analyze the data like text data.
* You can use temporal sensitive models and analyze the data like time-series data.

In our course we will take a look at the first three approaches. But first, let’s take a closer look at what audio data.

The data we will be using for this use-case was downloaded from the Common Voice repository from Kaggle. This 14 GB big dataset is a small snapshot of a much bigger dataset from Mozilla. But don’t worry, for our use-case here we will use an ever smaller subsample of the full dataset, more about this later.

#### 1. Audio data
While there are multiple Python libraries that allow you the loading and manipulation of audio data (scipy is one of them), we will use librosa for this and the following unit.

So, let’s download the sample_?.mp3 files from the resources of this unit and load them with librosa.

In [2]:
import librosa

# Loads the mp3 file
y, sr = librosa.load("c4_sample-1.mp3", sr=16_000)

# Print some information
print(y, len(y), sr)

NoBackendError: 

The information we get from the audio file are y the audio time-series data represented as 50’688 individual data points and sr the sampling rate with which the audio data was read from the file (use sr=None to get the original sampling rate). To better understand what this time-series data contains, let’s plot it in a figure.

In [None]:
from matplotlib import pyplot as plt

plt.figure(figsize=(12, 3))
plt.plot(y)
plt.show()