# Tutorial 2: EDA - Exploratory Data Analysis

After setting up AWS, we can finally dive into the exciting world of Data Science. In this Jupyter Notebook, we will explore audio data using various EDA techniques. EDA is an essential step in any Data Science project, including audio data modeling. It helps us gain insights into the data's characteristics, identify patterns, and understand relationships between variables. It also allows for determining the best preparation steps for the modeling phase. A good data preparation is ofen more important than the model and hyperparamters. If you'd like to learn more about best practices and the latest developments in engineering the data to build AI systems, please visit [Data Centric AI site](https://datacentricai.org/). You will find there loads of interesting information from that domain.

Throughout this tutorial, we will cover a range of EDA tools and methods, such as spectrograms, waveform plots, and statistical summaries, to visualize and summarize the audio data.

Let's get started!

**NOTE:** This notebook does not require a GPU instance.

## Setup

First, we need to import required libraries and functions. 

In [2]:
import sys # Python system library needed to load custom functions
import math # module with access to mathematical functions
import os # for changing the directory

import numpy as np  # for performing calculations on numerical arrays
import pandas as pd  # home of the DataFrame construct, _the_ most important object for Data Science

from IPython.display import Audio # for listening to our insects
import IPython
from scipy.fft import fft # function to calculate Fast Fourier Transform

import matplotlib.pyplot as plt  # allows creation of insightful plots
import seaborn as sns # another library to make even more beautiful plots

sys.path.append('../../src') # add the source directory to the PYTHONPATH. This allows to import local functions and modules.
# enable rendering plots under the code cell that created it
%matplotlib inline

from eda_utils import show_sampling, signal_generator, plot_random_spec, plot_spec, plot_waveform # functions to create plots for and from audio data
from gdsc_utils import download_directory, PROJECT_DIR # function to download GDSC data from S3 bucket and our root directory
from config import DEFAULT_BUCKET  # S3 bucket with the GDSC data
import warnings
warnings.filterwarnings("ignore")

os.chdir(PROJECT_DIR) # changing our directory to root

## Downloading the data

Next we need to download the official data for the GDSC from the S3 bucket. The S3 bucket is structured as follows:

```
S3_bucket/
    â””â”€â”€ data/
        |â”€â”€ labels.json
        |â”€â”€ metadata.csv
        â””â”€â”€ train/
            |â”€â”€ train_file_1.wav
            |â”€â”€ train_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv
        â””â”€â”€ val/
            |â”€â”€ val_file_1.wav
            |â”€â”€ val_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv
        â””â”€â”€ test/
            |â”€â”€ test_file_1.wav
            |â”€â”€ test_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv
    â””â”€â”€ data_small/
        |â”€â”€ labels.json
        â””â”€â”€ train/
            |â”€â”€ train_file_1.wav
            |â”€â”€ train_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv
        â””â”€â”€ val/
            |â”€â”€ val_file_1.wav
            |â”€â”€ val_file_2.wav
            |â”€â”€ ...
            |â”€â”€ metadata.csv


```

In the official S3 bucket, you can find 2 folders:

- *data* - it contains the complete dataset for the challenge.
- *data_small* - this folder contains a small sample of the training and validation datasets. It will be utilized in the 4th tutorial, so there's no need to download it at the moment.

For the purpose of this tutorial, we need to download the entire dataset, which includes the entire *data* directory. To accomplish this, we can make use of the ```download_directory``` function.

In [3]:
download_directory('data/', None, DEFAULT_BUCKET)

## Analysing the metadata

Let's start with loading the metadata file and printing the first few observations 

In [4]:
df = pd.read_csv('data/metadata.csv')
df.head()

In [5]:
import soundfile as sf
f = sf.SoundFile('data/train/Roeselianaroeselii_XC751814-dat028-019_edit1.wav')
print('samples = {}'.format(f.frames))
print('sample rate = {}'.format(f.samplerate))
print('seconds = {}'.format(f.frames / f.samplerate))

The metadata contains general information about our dataset, for each file we have:
- <i>file_name</i> -  name of the file,
- <i>unique_file</i> - the unique file name (context: some of the files in the unique_file column were very long, so they were cut into smaller ones. The names of those cut files are in the file_name column),
- <i>path</i> - shows us where this specific file is located,
- <i>species</i> - tells us what species was recorded,
- <i>label</i> - is <i>species</i> encoded to a number,
- <i>subset</i> - indicates if the file belongs to the train or validation dataset,
- <i>sample_rate</i> - is a feature related to audio. It shows how many points are recorded every second,
- <i>num_frames</i> - is the total number of samples in the recording,
- <i>length</i> - duration of the audio file in seconds, which can be calculated by dividing <i>num_frames</i> by <i>sample_rate</i>.

If some of these features are difficult to grasp, don't worry! We will do a deeper dive into sample rate, number of frames, and others later in this notebook.
If you would like to learn more about audio processing after the session, here are some useful links:
* [Wikipedia - Audio Signal Processing](https://en.wikipedia.org/wiki/Audio_signal_processing)
* [Medium article on audio features](https://medium.com/analytics-vidhya/audio-data-processing-feature-extraction-science-concepts-behind-them-be97fbd587d8)
* [Another Medium article, but this time on spectrograms and Fourier Transform](https://towardsdatascience.com/understanding-audio-data-fourier-transform-fft-spectrogram-and-speech-recognition-a4072d228520)

If this list seems to be a bit short, don't worry we've also included some more sources in the later parts of the notebook. We also encourage you to share on the Teams channel any other useful materials that you find on the web.

But let's get back to the analysis! Now we'll focus on inspecting the dataset characteristics. For starters, let's try to inspect the numerical variables of our dataset (sample_rate, num_frames, and length) to get an idea about their distribution. The [.describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) Dataframe method gives us a quick overview of the column distributions.

In [6]:
df[["sample_rate", "num_frames", "length"]].describe()

Fortunately <i>sample_rate</i> is constant for all audio files. It will save us some preprocessing work!

We can see that the audio length varies heavily - from 1 second to almost 12 minutes. But half of the files are between 5 and 28 seconds.

Let's now inspect the most important categorical variables - file_name, unique_file, label, and subset.

In [7]:
df[["file_name", "unique_file", "label", "subset"]].apply(lambda x: len(x.unique()))

We see that we have 2331 audio files that come from 1252 unique recordings. This is due to the decision of the data suppliers that long files will be split into smaller ones. Apart from that we notice, that there are 66 labels in the dataset and the set is split into two chunks - training and validation. Having the train-val split will be useful for the modeling phase when you will tune your model hyperparameters.

**Key insights**:
1. sampling rate of the files is constant
2. the files vary in length, which may be a challenge in preparing the data for the modeling phase
3. there are recordings from 66 different classes or insect species 
4. the dataset is already split into train and validation, which will make easier the evaluation and hyperparameter tuning of our models

Now let's visualize some of the qualities of this dataset to better understand the data we are working with. First, let's add a variable that combines <i>species</i> and <i>labels</i> columns, so it'll be easier to tag the axis on the plots we are about to create.

In [8]:
df['species and label'] = df.apply(lambda x: f"{x['species']} ({str(x['label'])})", axis = 1)
df.head(2)

In [9]:
df_train_sub = df[df["subset"] == "train"].sort_values('label')
df_val_sub = df[df["subset"] == "validation"].sort_values('label')

In [10]:
df[["label", "subset"]].sort_values('label').describe()

At the end of the DataFrame you can now see our newly created colum 'species and label'

One of the main challenges when working with a large number of classes is class imbalance. It means that some of the classes have many example datapoints and some of them are underrepresented. Aggregating the data by species might help us to assess the balance of the classes.

In [11]:
# Calculating stats per label/species - total length of recording per class and the total number of class occurences in the dataset
df_stats = df.groupby(['label','species and label']).agg(length = ('length', 'sum'), count = ('species', 'count')).reset_index()

# Calculating average length of an audio sample
df_stats['avg_len'] = df_stats['length']/df_stats['count']
df_stats['subset'] = df['subset']
df_stats.head()

*df_stats* shows us how many seconds of recordings we have for each species, the number of different entries as well as the average length per sample. Let's visualize this!

In [12]:
df_stats = df_stats.sort_values('label')

plt.figure(figsize = (20,6))
sns.barplot(x = df_stats['species and label'], y = df_stats['count'], color = 'royalblue')
plt.title('Number of files per species', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

It looks like we have very high variance in the number of samples per species, but what about the audio length?

In [13]:
df_stats = df_stats.sort_values('label')

plt.figure(figsize = (20,6))
sns.barplot(x = df_stats['species and label'], y = df_stats['length'], color = 'royalblue')
plt.title('Length of files per species', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

In [14]:
plt.scatter(x = df_stats['count'], y = df_stats['length'], color = 'royalblue')

The audio length also varies heavily from class to class. We have some species with just a few seconds of recordings and almost 5000 seconds (over 1 hour!) for *Grylluscampestris*, aka field crickets.

The last two plots show that we are working with an imbalanced dataset. This may be a challenge when preparing a model that will perform well on all classes, because not all classes may be well represented.

**Key insights:**

* the number of examples per each class varies heavily, which means that we are working with an imbalanced dataset
* the total audio length per class is also different from class to class, this together with the number of files gives full information about the amount of data we have per each class

**Exercises**:
- To explore the data further, you could recreate the plots using different subsets, such as the training and validation sets.
- The audio samples for different species exhibit varying lengths. Which species have the shortest and longest audio samples?
- An interesting question to ask is which classes have the most and least amount of data.
- How can the problem of class imbalance be tackled when building an AI solution? Post your thoughts on the [GDSC Teams channel](https://teams.microsoft.com/l/channel/19%3ad6ae189bbba3496abbb5f7f8939c92a4%40thread.skype/Data%2520and%2520AI%2520Questions?groupId=7d77d672-dff1-4c9f-ac55-3c837c1bebf9&tenantId=76a2ae5a-9f00-4f6b-95ed-5d33d77c4d61)! 

In [15]:
df_train = df_train_sub.groupby(['label','species and label']).agg(length = ('length', 'sum'), count = ('species', 'count')).reset_index()

# Calculating average length of an audio sample
df_train['avg_len'] = df_train['length']/df_train['count']
#df_train['subset'] = df['subset']

In [16]:
df_val = df_val_sub.groupby(['label','species and label']).agg(length = ('length', 'sum'), count = ('species', 'count')).reset_index()

# Calculating average length of an audio sample
df_val['avg_len'] = df_val['length']/df_val['count']
#df_val['subset'] = df['subset']

In [17]:
df_train = df_train.sort_values('label')

plt.figure(figsize = (20,6))
sns.barplot(x = df_train['species and label'], y = df_train['count'], color = 'royalblue')
plt.title('Number of files per species', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

In [18]:
df_val = df_val.sort_values('label')

plt.figure(figsize = (20,6))
sns.barplot(x = df_val['species and label'], y = df_val['count'], color = 'royalblue')
plt.title('Number of files per species', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

In [19]:
df_train = df_train.sort_values('label')

plt.figure(figsize = (20,6))
sns.barplot(x = df_train['species and label'], y = df_train['length'], color = 'royalblue')
plt.title('length of files per species', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

In [20]:
df_val = df_val.sort_values('label')

plt.figure(figsize = (20,6))
sns.barplot(x = df_val['species and label'], y = df_val['length'], color = 'royalblue')
plt.title('Length of files per species', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

In [21]:
df_train = df_train.sort_values('label')

plt.figure(figsize = (20,6))
sns.barplot(x = df_train['species and label'], y = df_train['avg_len'], color = 'royalblue')
plt.title('avg_len of files per species', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

In [22]:
df_val = df_val.sort_values('label')

plt.figure(figsize = (20,6))
sns.barplot(x = df_val['species and label'], y = df_val['avg_len'], color = 'royalblue')
plt.title('avg_len of files per species', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

The audio samples for different species exhibit varying lengths. Which species have the shortest and longest audio samples?

In [23]:
df_train.sort_values('length')

In [24]:
df_val.sort_values('length')

In [25]:
df_train.sort_values('length',ascending=False)

In [26]:
df_val.sort_values('length',ascending=False)

In [27]:
df_train = df_train.sort_values('label')
df_val = df_val.sort_values('label')

In [28]:
plt.figure(figsize = (20,6))
plt.scatter(df_train['label'], df_train['length'], label = "df_train")
plt.scatter(df_val['label'], df_val['length'], label = "df_val")

#plt.plot(y, x, label = "line 2")
plt.legend()
plt.show()

In [29]:
plt.figure(figsize = (20,6))
plt.scatter(df_train['label'], df_train['avg_len'], label = "df_train")
plt.scatter(df_val['label'], df_val['avg_len'], label = "df_val")

#plt.plot(y, x, label = "line 2")
plt.legend()
plt.show()

In [30]:
plt.figure(figsize = (20,6))
plt.scatter(df_train['label'], df_train['count'], label = "df_train")
plt.scatter(df_val['label'], df_val['count'], label = "df_val")

#plt.plot(y, x, label = "line 2")
plt.legend()
plt.show()

## Foundations of audio processing

As we saw in the analysis above for this year's Global Data Science Challenge, we will work with audio data. It's a very specific kind of data with its features and characteristics. Before we continue exploring our data in depth let's try to understand some of the most important concepts around audio features and processing.

In [31]:
x = np.linspace(0, math.pi*6, 1000)
y = np.sin(x)

plt.figure(figsize = (20,6))
plt.plot(x, y, lw = 3)
plt.arrow(0, 0, math.pi*2, 0, lw = 4, color = 'red', head_width = 0.05, length_includes_head = True)
plt.arrow(math.pi*2, 0,-math.pi*2, 0, lw = 4, color = 'red', head_width = 0.05, length_includes_head = True)

plt.arrow(0.5*math.pi, 0, 0, 1, lw = 4, color = 'red', head_width = 0.05, length_includes_head = True)
plt.arrow(0.5*math.pi, 1, 0, -1, lw = 4, color = 'red', head_width = 0.05, length_includes_head = True)
plt.text(math.pi*0.5 + 0.1, 0.5, 'A - amplitude', fontsize = 'large')

plt.title('Sine function as a simple sound wave', fontsize = 20)

plt.xlabel("Time")
plt.ylabel("Amplitude")

plt.text(math.pi, 0.1, 'T - period', fontsize = 'large')
plt.grid()

A [sine wave](https://onlinetonegenerator.com/) is a continous *beep*, the simplest sound we can make. In the above plot, you can see a sine wave that persists for over 18 seconds. On the x-axis, you have the time domain, and on the y-axis the amplitude of the wave. The amplitude represents the intensity or volume of an audio wave, while the period represents the time it takes for the wave to complete one cycle of oscillation. 

As <i>amplitude</i> and <i>period</i> are quite straightforward let's try to focus on more complex features.

In the analysis of the metadata table, we saw some information about the <i>sampling rate</i>. Let's see how can we understand it!

Let's assume that we want to record some continuous signal that looks exactly like the below sine wave. The problem with this is that it is impossible to record and store something continuously. We cannot store infinitely many points.

In [32]:
show_sampling(10, 1, 1, show_signal = True, show_sampling = True, plot_sampling = True)

The only thing we can do is to record only specific points in time. For example, let's say that our device allows us to record the signal once every 0.1 second. This would result in having 10 points (samples) per each 1 second of the signal. So the rate at which we sample the data is 10 points per second. This is exactly the definition of the sampling rate. It is the number of samples (or measurements) of the audio signal that are taken per second. It is typically measured in [Hertz (Hz)](https://en.wikipedia.org/wiki/Hertz), which is the inverse of the time unit.

From the above plot, we can see that with a low sampling rate, the reconstruction of the signal is not that precise. What if we increased the sampling rate? Let's check it on the below plot.

In [33]:
show_sampling(10, 100, 1, show_signal = True, show_sampling = True, plot_sampling = False)

We can see, that the higher the sampling rate is, the more accurately the analog (real-world) audio signal can be converted to digital format. A higher sampling rate results in better sound quality, but also leads to larger file sizes.

Great! Hopefully, by now we have a good grasp of the sampling rate. One feature connected with sampling rate is the number of frames which is sampling rate multiplied by the length of the file and gives us the total number of samples in a recording.

Great, now let's move on to other features.

One of the most important features of audio data is <i>frequency</i>. This is the inverse of <i>period</i> and it tells us how many cycles a signal makes per second, which means that the unit for the frequency is given also in Hertz (Hz). Let's look at simple sine functions to understand this feature better.

In [34]:
signal, time = signal_generator(2, 1000, [1,1.5], show_signals = True, show_signals_sum = False, split_plots = False)

So from the above plot, we can see that the higher the frequency, the more cycles per second we have. If we were to use our newly gained knowledge about audio features we could rephrase that *the higher the frequency, the shorter the period of the signal is.* 

**Key insights:**
* Audio data have different features. We've learned about amplitude, period, and frequency. Make sure that you understand them because we will build on top of that later on.
* The sampling rate determines how good or bad is the reconstruction of the signal.

So far we've inspected only separate sine waves, which were quite straightforward to analyze, but in practice, an audio recording is a sum of multiple signals. Let's consider the a signal that consists of three overlapping frequencies.

In [35]:
t = 1 # time in seconds
sr = 1000 # sample_rate
freq = [4, 20, 10] # frequencies used to build signal

In [36]:
signal, time = signal_generator(t, sr, freq, show_signals = True, show_signals_sum = True, split_plots = True)

An audio signal is often a combination of multiple signals, also known as components or frequencies, which vary in amplitude and frequency. While the individual sine curves (the blue plots) are very simple, the combination (the red plot) seem almost chaotic.

Luckily there is a method for extracting the underlying components from a recording: signal decomposition. The process involves breaking down the audio signal into its constituent frequencies using mathematical techniques such as Fourier transform. By analyzing each frequency component separately, we can gain a better understanding of the different elements that make up an audio signal, and this can be useful in a variety of applications, such as audio processing and enhancement.

In [37]:
time = 2 # time in seconds
sr = 1000 # sample_rate
freq = [7, 15, 40] # frequencies used to build signal

In [38]:
signal, time_points = signal_generator(time, sr, freq, show_signals = False, show_signals_sum = True, split_plots = False)

Let's save the signal above as a <i>complex_signal</i> variable. To do so we need to sum all of the components. This will effectively create a sum of amplitudes of the components for each sample point.

In [39]:
complex_signal = signal.sum(axis=0)
complex_signal.shape

We can reconstruct the frequencies of the signals using Fourier Transform.

In [40]:
n_sample = time*sr # Number of sample points
t = time/sr # time steps
x = np.linspace(0.0, time, n_sample) # x-axis, running number, time has to be an integer
y = complex_signal

# plot the signal
plt.title('Complex signal')
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.plot(y) 
plt.show()

yf = fft(y) # perform FFT - Fast Fourier Transform

# plot the graph to show frequency domain
xf = np.linspace(0.0, time/(2.0*t), n_sample//2)
plt.plot(xf, 2.0/n_sample * np.abs(yf[0:n_sample//2]))
plt.xlim(0, 100)
plt.title('Signal in frequency domain after performing FFT')
plt.xlabel('Frequencies (0 to 100 Hz)')
plt.ylabel('Amplitude')
plt.xticks(np.arange(0, 100,5))
plt.grid()
plt.show()

With just a few functions, we were able to determine the frequencies of the components, which is a crucial step in the process of analyzing audio data. 

Unfortunately, an explanation of the details of the Fourier Transform is beyond the scope of this tutorial. For those interested in delving deeper into the mathematics behind it, we have provided a few useful resources such as:
- a [YouTube video](https://www.youtube.com/watch?v=spUNpyF58BY&t=3s) explaining what <i>Fourier Transform</i> is,
- a Wikipedia [article](https://en.wikipedia.org/wiki/Fourier_transform) 
- and the SciPy fftfreq [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.fftfreq.html) 

While it may seem unnecessary for an EDA tutorial, these lessons will prove their value as we move on to the next section. Now, we can finally return to our dataset and resume our analysis.

**Key insights:**
* Audio data usually are a mix of different signals.
* We can analyze the data by extracting the frequencies and in that way gain a better understanding of the different elements that make up an audio signal.

**Excercises**:
- To deepen your understanding of audio features, you can experiment with the plotting functions and explore the different options available.
- An interesting question to ask is what is the minimum sampling rate needed to accurately recreate the input signal. Does this minimum rate depend on factors such as frequency or other features?
- To further practice working with audio data, you can create a complex audio sample and attempt to extract the individual component frequencies using signal decomposition techniques such as Fourier Transform.

## Advanced audio processing: Waveform and spectrogram

A waveform represents the shape of the audio signal over time. It is a graphical representation of the amplitude of the audio signal on the vertical axis versus time on the horizontal axis. A waveform provides a visual representation of the audio signal and allows us to identify patterns, variations, and trends in the signal. It can be useful in analyzing the characteristics of an audio signal, such as its volume, pitch, and duration.

Let's take a look at one of our insect recordings.

In [41]:
example_path = 'data/train/Chorthippusbiguttulus_XC751834-dat031-007_edit3.wav'
plot_waveform(example_path)

At first glance we can see that the file starts with a silence and then some noise starts to appear and get louder and louder until it stops at around 3rd/4th second. Then the noise appears another few times.

If we zoom in to the first 0.01 second of the recording we will see, that it is a complex signal build of multiple componets - exactly what we've previously discussed. To zoom in you can adjust the second parameter of the plotting function to display only the first few seconds of the audio.

In [42]:
plot_waveform(example_path, 0.01)

A spectrogram on the other hand is a visual representation of the frequency content of an audio signal as it varies with time. It is a 2D plot that shows how the energy of different frequencies changes over time in an audio signal. In a spectrogram, the horizontal axis represents time, the vertical axis represents frequency, and the color intensity represents the energy or amplitude of the frequencies. To obtain a spectrogram we need the **Fourier Transform** (about which we learned a bit earlier) to decompose the audio signal into its consituent frequency components across time.

The spectrogram is the most useful way of plotting audio as it gives us all 3 important features:
- time,
- frequency,
- amplitude (volume)

Let's plot an audio sample from our dataset and see what we get.

In [None]:
plot_spec([example_path])

The frequency ranges from 0 to 22050 Hz and it's connected to the limit of human hearing, which is approximately 20 kHz. The value of 22050 Hz is also connected to our sampling rate which (if you recall) is twice as high and is equal to 44100 Hz. The reason why we sample data with the frequency of 44100 Hz is connected to so called [<i>Nyquist frequency</i>](https://en.wikipedia.org/wiki/Nyquist_frequency), which states that to accurately represent a signal, the sampling rate must be at least twice the highest frequency present in the signal.

Let's check if we can hear what we see on the spectrogram!

In [None]:
Audio(example_path)

We can well hear that the insect is relatively quiet for the first small amount of the recording and then begins to make rhythmic louder sounds. The recording ends with a bit different noise that ends quickly. All of this can also be seen on the spectrogram. <br>

Do you think insects of the same species make similar sounds? Let's try to find out. We pick and plot four samples from the same species.

In [4]:
df_test = pd.read_csv('data/test/metadata.csv')
df_test.head()

In [7]:
for i, file_name in enumerate(df_test['file_name']):
    df_test['path'][i] = f'data/test/{file_name}'
df_test    

In [11]:
import librosa
length0=librosa.get_duration(filename=df_test['path'][0])
round(length0)

In [19]:
for i, file_name in enumerate(df_test['file_name']):
    #df_test['length'] = librosa.get_duration(filename=df_test['path'][i])
    df_test['length'][i] = librosa.get_duration(filename=df_test['path'][i])
df_test 

In [29]:
df_test['length'] = 0
df_test['num_frames'] = 0
df_test['sample_rate'] = 0
df_test

In [36]:
df_test=df_test.drop('length_new', axis=1)

In [37]:
df_test

In [33]:
import soundfile as sf
for i in range(len(df_test)):
    f = sf.SoundFile(df_test['path'][i])
    df_test['length'][i] = f.frames / f.samplerate
    df_test['num_frames'][i] = f.frames
    df_test['sample_rate'][i] = f.samplerate
df_test    

In [38]:
df_test.to_csv('data/test/metadata_new.csv',index=False)

In [44]:
df_test[['length', 'num_frames']].describe()

In [43]:
df[['length', 'num_frames']].describe()

In [46]:
df_test.sort_values('length')

In [5]:
df_test = pd.read_csv('data/test/metadata_new.csv')

In [12]:
paths = list(df_test['path'])

In [None]:
plot_spec(paths[:30])

In [None]:
for path in df_test['path'][:50]:
    IPython.display.display(Audio(path))

In [None]:
paths = list(df[df['label']==65].sample(4)['path'])
paths

In [None]:
plot_spec(paths)

In [None]:
import IPython
for path in paths:
    IPython.display.display(Audio(path))

In [None]:
paths = list(df[df['label']==64].sample(4)['path'])
plot_spec(paths)
for path in paths:
    IPython.display.display(Audio(path))

In [None]:
for i in range(50,63):
    paths = list(df[df['label']==i].sample(4)['path'])
    plot_spec(paths)
    for path in paths:
        IPython.display.display(Audio(path))

In [None]:
for i in range(55,63):
    paths = list(df[df['label']==i].sample(4)['path'])
    plot_spec(paths)
    for path in paths:
        IPython.display.display(Audio(path))

In [None]:
for i in range(59,63):
    df_shorter = df[df['length']<90]
    paths = list(df_shorter[df_shorter['label']==i].sample(4)['path'])
    plot_spec(paths)
    for path in paths:
        IPython.display.display(Audio(path))

In [None]:
for i in range(63,65):
    df_shorter = df[df['length']<90]
    paths = list(df_shorter[df_shorter['label']==i].sample(4)['path'])
    plot_spec(paths)
    for path in paths:
        IPython.display.display(Audio(path))

In [None]:
for i in range(0,50):
    print(i)
    df_shorter = df[df['length']<90]
    paths = list(df_shorter[df_shorter['label']==i].sample(4)['path'])
    plot_spec(paths)
    for path in paths:
        IPython.display.display(Audio(path))

In [None]:
paths[0]

In [None]:
Audio(paths[1])

There are some visible similarities between the spectrograms, which are also clearly audible. 

Let's also look at the spectrograms of different species and check if we can also *see* differences

In [None]:
plot_random_spec(df, labels = [10, 20, 30, 40])

In [None]:
for i in range(56,63):
    paths = list(df[df['label']==i].sample(4)['path'])
    plot_spec(paths)
    for path in paths:
        IPython.display.display(Audio(path))

There are some obvious differences between the recordings. Different classes produce different spectrograms.

Spectrograms are essential tools for analyzing audio signals, as they provide a visual representation of how the frequency content of the audio changes over time. By displaying audio as a two-dimensional image, spectrograms make it easier to identify patterns and features in complex audio signals. Moreover, many audio classification models rely on spectrograms as their input data, making them a crucial component of audio analysis and machine learning. You will learn more about it in the next tutorials.

**Key insights:**
* waveform is the plot of the amplitude of a signal over time
* spectrograms give a rich signal representation by combining the time and frequency domains
* different classes have different looking spectrograms, while the same clsses have similarly looking spectrograms. This makes it an useful audio representation and may be an input for our future models

**Exercises:** 
- Generate spectrograms and listen to the audio samples. Can you differentiate between the species based on their sound patterns?
- Can you observe any resemblances within a species? Perhaps certain species produce very similar sounds?
- Attempt to identify the predominant frequencies for particular species. Do they emit sounds in higher or lower frequencies?
- Analyze some of the waveforms generated from the dataset by decomposing them.

In the next tutorial, we'll prepare a baseline model with the use of the knowledge we gathered from this tutorial. You will also send your first submission which will put your team on the leaderboard!

**REMINDER: After finishing your work remember to shut down the instance.**